Monday, February 27, 2012

What do you want to know?

I'd meant the previous post to segue into a discussion of the things I'm looking to find out but I got lost in the middle, in part I'll blame The Help for being really engrossing, but mostly I'm just easily distractable.  So here now, is the list of things I'd like to know.  As I answer some of these things, I'll link them through to the relevant blog posts.
  • Which words appear most/least in all puzzles?
  • Which words appear proportionally more often than in common usage?
  • Which words have the most/least variety of clues?
  • Which clues have the most answers?
  • Are any common words biased towards horizontal/vertical?
  • How full is the 3-4 letter distribution space? How does that distribution compare to the 3-4 letter Google 1-grams?
  • What does the distribution of other groups of things look like? (e.g. greek letters, signs of the zodiac, days of the week, countries of the world)
  • Which words appear most often in the same position?
  • Which words appear in the largest variety of positions?
  • Which letter/word is most often in the corners?
  • Which letter most often starts/ends a word?
  • How does start/end letter distribution differ from the dictionary/common usage?
  • How does consonant/vowel ratio differ from dictionary/common usage?
  • What grid shape is most common?
  • Which letter is most common at each position in the mode grid size?
3/5/2012:
  • Can I identify enough place names that it becomes interesting to draw them on a map?

I find each of these interesting in and of itself, but I also think it'll be interesting to make comparisons between publishers, days of the week, over time, or any other cross tab I feel like at the time.

As always, I reserve the right to modify this last, hopefully my loyal readers have more suggestions?

Sunday, February 26, 2012

I've grown today

So the format all these crossword puzzles is in is the AcrossLite .puz format.  I'm struggling with how to translate all these small binary files into something Hadoop is comfortable with.  This seems to be a common problem (Small Files Problem), and I think the solution is SequenceFiles full of key/value pairs.  My current trouble is figuring out what I need to include in my puzzle maps and how much duplication I'm willing to put up with, since each box in the grid is included in at least two words.  While the easy answer to the first is 'everything' and I got some advice that the answer to the second is 'a lot, disk space is free'.  And then I thought about it some more and spent an hour or so coming up with complicated strategies to index my files so that I wouldn't have to duplicate any characters, and then I thought about it some more and I realized that they're just characters, and that anything I tried to do would probably take more than one byte to store and longer than one read to look up.

This is probably a realization that all programmers come to eventually, I'm lucky to have had it in the privacy of my own home...


Experiments with Google Charts

I don't even have any real data yet, so this is probably putting the cart before the horse, but I'm tired of trying to figure out SequenceFiles so I'm taking a detour through Data Visualization.  Check out the graph below the cut.

Data Characterization

So, while I reserve the right to monkey with this dataset before I start doing any real analysis, this seems like a good time to describe whose puzzles I have, some of their properties, and the number I ended up getting in my initial download.


Source/PublicationDays of the week availableLatest DateEarliest DateNumber Downloaded
Boston GlobeSUNDAY2012-02-192008-04-20200
Chronicle of Higher EducationFRIDAY2012-02-172011-08-1224
Ink WellFRIDAY2012-02-172010-01-01111
I SwearFRIDAY2012-02-172010-01-01112
Jonesin'THURSDAY2012-02-162010-01-07111
Thomas JosephMONDAY - SATURDAY2012-02-222007-06-141470
Los Angeles TimesDAILY2012-02-222012-01-2628
NewsdayDAILY2012-02-222005-05-012485
New York TimesDAILY2012-02-222004-01-012974
Onion AV ClubWEDNESDAY2012-02-222006-10-04282
Philadelphia InquirerSUNDAY2012-02-192004-01-04425
PremierMONDAY - SATURDAY2012-02-192004-11-14380
ShefferMONDAY - SATURDAY2012-02-222007-06-141470
Thinks.comDAILY
UniversalDAILY2012-02-222004-01-012780
USA TodayMONDAY - SATURDAY2012-02-222008-06-101044
Washington PostDAILY2012-02-222011-04-05324
Washington Post PuzzlerSUNDAY2012-02-192004-01-04425
Wall Street JournalFRIDAY2012-02-172004-01-02425

I'm pretty conflicted about this, while it's nice to have over 15,000 puzzles, the distribution between sources is all over the charts: I only have 28 LATs, 282 Onion AVs, and 2974 NYTs, that's going to make drawing some kinds of conclusions difficult and statistically suspect, but I guess we'll cross that moat when we come to it.

Thursday, February 23, 2012

How'd you get your data?

While nothing compares to the pleasure (and terror!) of setting pen to newsprint on a New York Times Sunday crossword, I'm a very happy user of the Shortyz android app by Robert "kebernet" Cooper.  Not only is Shortyz free, but Mr. Cooper also makes the source code available on Google Code (here).  Having written android apps before, I figured it would be a cinch to convert some Android code for downloading puzzle files into a standalone desktop app that could do the same on a larger scale.

Well, no, of course not.

But over the course of a weekend, I was able to cobble together a process I am happy with.  I wasn't interested in any of the android UI or infrastructure components, and while there weren't too terribly many android dependencies in the Downloader code, I did have to rework some sections.  Thankfully, the code is well structured, with most Downloaders inheriting their logic from an Abstract class that does most of the heavy lifting.  However, I did end up rewriting some bits purely to appease my own OCD/aesthetics, especially the way it figured out which providers had puzzles available on which days.  It looks like Google Code will let me clone the Shortyz repo GitHub style so I can share my changes, but that make take a profession of interest from someone else before I get around to it.

So my modified Shortyz code netted me 20 Downloaders that, when given a valid date, query a URL for a puzzle file, then save it to disk.  The next step was to build a framework that would loop over every date to grab as many puzzles as it could.  To do that I whipped up a simple little main method that initializes all the Downloaders, gets the current date, asks each Downloader for the puzzle from that day, then decrements the day.  Since I know none of these sources is going to have an infinite backlog, I implemented some logic so that if a Downloader failed to deliver a puzzle when it was supposed to three times in a row, that Downloader got blacklisted and wasn't queried again.  Finally, to be a good citizen, I only request a puzzle from each website once a (very conservative) 30 seconds. 

What is this?

I do a lot of crossword puzzles, I try to do at least one a day, normally three or four on Sundays, and when you do that many puzzles from that many sources you tend to notice things. Two examples are "Crossword puzzle words", like "EPEE" or "IOUS" or "ESAI" (Morales) that pop up again and again because they make it easier for the puzzle constructors, and those cases where two different editors from two different publications use the same word, sometimes with the same clue, on the same day.  I have a whole list of things I think it will be interesting to look at that I'll document in more detail in a later post.

I'm also a Software Engineer looking to keep up with cool new technology, and what better way to learn than to have a personally interesting problem?  To be able to come to even slightly meaningful generalizations about crossword puzzles it's critical to have as much data as possible.  Conveniently, Big Data is a hot topic in the field as storage prices drop and the number of quantifiable interactions between people and systems increase. Google, Facebook, Amazon, and pretty much every company on the internet or off it is making buckets of money turning the terabytes of data they have collected into actionable analyses.

The main tool (or at least, the one I'm interested in now) for asking questions of these data quickly and efficiently is MapReduce, an algorithm developed at Google but that was documented well enough that an open source implementation was created and is maintained by Apache as Hadoop.

I intend for this project to serve as a pleasant counterpart to my day job while helping me develop a skill set that I'm interested in pursuing, whether I'm still interested in it by the end of this project is one of the more important questions :)

I intend for this blog to serve as a project log of the problems I encounter as well as how I resolved them, a document of my hypotheses and conclusions, and a gallery of (hopefully) pretty data visualizations. But honestly, by making this public, I also hope it will serve as a goad to keep up my enthusiasm long enough to come to interesting conclusions.