Thursday, February 23, 2012

What is this?

I do a lot of crossword puzzles, I try to do at least one a day, normally three or four on Sundays, and when you do that many puzzles from that many sources you tend to notice things. Two examples are "Crossword puzzle words", like "EPEE" or "IOUS" or "ESAI" (Morales) that pop up again and again because they make it easier for the puzzle constructors, and those cases where two different editors from two different publications use the same word, sometimes with the same clue, on the same day.  I have a whole list of things I think it will be interesting to look at that I'll document in more detail in a later post.

I'm also a Software Engineer looking to keep up with cool new technology, and what better way to learn than to have a personally interesting problem?  To be able to come to even slightly meaningful generalizations about crossword puzzles it's critical to have as much data as possible.  Conveniently, Big Data is a hot topic in the field as storage prices drop and the number of quantifiable interactions between people and systems increase. Google, Facebook, Amazon, and pretty much every company on the internet or off it is making buckets of money turning the terabytes of data they have collected into actionable analyses.

The main tool (or at least, the one I'm interested in now) for asking questions of these data quickly and efficiently is MapReduce, an algorithm developed at Google but that was documented well enough that an open source implementation was created and is maintained by Apache as Hadoop.

I intend for this project to serve as a pleasant counterpart to my day job while helping me develop a skill set that I'm interested in pursuing, whether I'm still interested in it by the end of this project is one of the more important questions :)

I intend for this blog to serve as a project log of the problems I encounter as well as how I resolved them, a document of my hypotheses and conclusions, and a gallery of (hopefully) pretty data visualizations. But honestly, by making this public, I also hope it will serve as a goad to keep up my enthusiasm long enough to come to interesting conclusions.

No comments:

Post a Comment