Crossword Stats: 2012

Sunday, July 29, 2012

Three Letter Word Space

I've been sitting on this one for a while, but +Karl Steinke was pondering which length of word filled the largest percentage of its possibility space. That is, of all possible 3 (or 4 or whatever) letter combinations, what length of combination has the largest percentage of real words? We decided it had to be short words, probably between 3 and 5, but my money is on 3, which is convenient because that's also the largest number of letters that's easy to graph (since you can put it in 3 dimensions) and the shortest length of crossword clue. In this case, I'm using the looser definition of "word" that encompasses any 3 letter combination that shows up in one of my puzzles.

For 3 letter combinations, there are 26^3 (17576) possible values, running from AAA to ZZZ. The official Scrabble word list has 1014 words (5.77%), and my crossword word list has 3070 words (17.5%). Check out the full graph and some highlights below the fold.

Digression: Visualizing Supreme Court Cases

I lost a couple of days worth of work on the Crossword puzzle stats, and I've been working on building back the motivation to do it all over again. But last week (the day after the Obamacare ruling in the Supreme Court) I was listening to NPR and they were talking about how polarized the court has become, as well as how "betrayed" the conservatives felt about how "their" Justice, John Roberts, voted.

So of course I thought, "that sounds like fun!". And I tracked down an actually very nice dataset on Wikipedia for the 2000 session and onward (wiki:2000_term_opinions_of_the_Supreme_Court_of_the_United_States).

Over the course of last weekend, I got to the point where I could visualize *something*, and it looked a lot like this:

Each vertical bar is a case and each horizontal line shows how each Justice contributed, Green blocks are the Majority opinion, Blue blocks are Concurrences, and Red blocks are Dissents, large gaps inside a block indicates multiple Justices wrote (eg) Dissents. The Justices' lines are colored by the party of the President who placed them on the court. By way of example, that third case there is Bush v. Gore, you can see (well, maybe you can't it's definitely the most complicated to draw), that 5 justices participated in the majority, including three that also contributed to the same concurrence. The other four Justices wrote four separate Dissents. As you can see, it gets very messy, very quick.

As a first pass, I keep the Justices in their natural list order, so within each opinion, Rehnquist will be at the top and Breyer will be at the bottom. I figured this would provide some continuity that would make it easier to follow. But, looking at it, maybe not so much...

So Tuesday I sent out a plea to my computer science minded friends asking for suggestions on how to draw the lines such that they wouldn't cross at weird angles and would stay out of each others' way.

This is what I asked for, even if gmail dropped my stupid images:

(Before) (Suggested)

Despite the lack of any real idea what I was talking about, several of them pointed me at some very good resources for graph drawing algorithms, and I think what I want is some sort of PCB Routing algorithm, but the more I looked into that, the more I realized how much math was involved, and did I really want to do that for fun?

So I started to think about what I could do short of that point, and it struck me that a lot of the "end" (on the right) nodes were in the "wrong" places. Figuring out the "right" places still involves a fair amount of math, figuring out the optimal ordering of (for example) 7 Justices in a Majority opinion still involves 5040 (7!) operations, but it was at least math I was comfortable with (Permutations).

So I whipped up a Permutation Iterator (ie, stole, from this guy) and modified it to accept a Cost function that gets applied to every permutation. The input to be permutted is an integer array containing the start (left) indexes, so, looking at the Before image above, trying to find the correct order for the majority opinion on the right, I would submit an array that looked like {1, 2, 4, 6, 7, 8 }. That describes a possible ordering where node 1 on the left connects to node 0 on the right, 2 -> 1, 4 -> 2, etc. Calculating the cost of that permutation is pretty straightforward, I take the ABS( array Value - array Index) summed across all elements. So for this example, the cost would be (1-0)+(2-1)+(4-2)+(6-3)+(7-4)+(8-5) = 13.

Just minimizing the distance helped, but it got even better when I minimized both distance *and* the number of intersecting lines.

The result of all of that is below, I think it's definitely an improvement, check out all the long lines! but it's still not ideal.

I have some ideas on how to improve it further, right now I'm leaving the concurrences/dissents in their natural order, and I could probably get some more mileage out of adjusting their order to reduce the total distance, I'm also wondering if maybe looking ahead a second case (or further) might also help, but I'm less certain how to implement that.

I'm especially looking for advice on the long, parallel lines, where the steeper the lines, the more they tend to blur together, what should I do about this?

Thanks for your help!

Saturday, March 31, 2012

Link: A Better Strategy for Hangman

This is a very interesting article about how you're probably playing Hangman wrong. It's very much in the vein of what I'm trying to do.

Sunday, March 25, 2012

What's a "Word"

One of the things I didn't address at all in my last post is "what counts as a 'word'?". While I did take a moment to distinguish between Proper nouns and regular nouns, I expect that the further you get down the occurrence list, the fewer words will actually appear in a dictionary. I'd really like to be able to look at that sort of thing, but unfortunately, the OED doesn't publish a web service. I did download the Google 1-grams, but since they count anything separated by white space as a word, it's full of iffy values as well (eg, the first 35 entries are the letter A repeated from 1 to 35 times). I also found a list of all the words that appear in Wikipedia as of 2005 that will certainly be a smaller corpus. They're also both likely to be full of misspellings : / . Does anyone out there know of a curated word list that might be appropriate for this kind of usage?

While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...

Crossword Word Analysis

I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.

ERA	1260
AREA	1258
ORE	1019
ERE	991
ARIA	978
ERIE	978
ALOE	969
ONE	967
ALE	895
ATE	877
ELSE	846
ARE	842
ERR	801
ETA	795
ALI	771
ALA	770
EAR	767
OREO	761
ODE	760
ADO	756

All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.

To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.

Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.

Global Distribution of Letters, Part 2, The Envisioning

As promised, here is my extended analysis of the Global Letter Distribution stats I posted last. First, instead of using the chart from Google, I've switched to an analysis of all the words in the OED. As I should probably have expected, my initial, purely visual, analysis was way off. In reality, the use of the letter T is practically identical between the OED and the XWords. So to provide a little more mathematical rigor, I plotted the Percent Difference between the OED and XWord values as (XWord val - OED val) / (OED val). What I found surprising was that six letters, S, A, E, D, O, and T are overused at the expense of the other twenty letters, so if you're ever in doubt, that's a good place to start.

Google Charts wasn't doing what I wanted it to, so now I'm experimenting with Tableau, and let me tell you, it's pretty hot (if you're on rss, come view the whole post):

I'm definitely interested in any other functions you think would be fun to graph, as the little bit of formal stats I had was a long time ago. Please let me know what you think!

Sunday, March 4, 2012

Global Distribution of Letters

To take advantage of my momentum, I went ahead and found the global distribution of letters across all puzzles!

The # is a black box, and as you'd expect it's leading the pack, although it's a lot closer to A and E than I would have guessed. There's also one poor little Ñ hanging out there at the end, I should figure out where he came from and whether it's a bug...

Here's the English Letter Frequency graph from Wikipedia

T's seem to be very strongly under represented in the Crossword puzzles. More interesting maths to come soon.

The data set is here, I'll update the post with a live Google Chart after I've gotten some sleep.

You're doing it wrong!

So it turns out the progress I thought I'd made in my last post wasn't nearly as impressive as I thought it was. While I was actually mapping things, I completely missed the boat with what I thought my output was doing. What I'd hoped was happening was that each time my Map operation was called, there was a sort of anonymous record created to encapsulate all the fields associated with each puzzle. Instead, I just got a really big file full of key/value pairs, and I'd lost all concept of a distinct puzzle.

The solution was to make my records explicit but that wasn't as hard as I thought. To include an object as a Key or Value in Hadoop it needs to implement Writable and then define how its values should be encoded and decoded. That turned out to be pretty straightforward, as I'd already done most of the work figuring out how to encode my objects in the previous SequenceFile version. And it looked like it worked! Even better, it reduced the size of my output object significantly. The SequenceFile encoded key/value pairs clocked in at 109MB, but the new PuzWritable version is just 67MB, still twice as large as the original dataset, but much more reasonable. I think the largest difference is before, every field stored keys with every value, where as now, the keys for all the fields are assumed and determined based on the encoding order of the fields.

So now that I've got my second draft of a dataset working, it's time for a sanity check! I just added up the counts of all the fields, so the number of guys that occur once (author, date, source, title) should match up with my number of records, and the number of across and down clues, and individual boxes, should be more interesting. Here's what I got:

across    1798769086
author    14282
box    11024805265
date    14282
dimension    14282
down    1942045913
source    14282
title    14282

it looks good! 14282 is the right number of records! So the across clues...wait...hundreds, thousands, millions, 1.7 billions across clues? That can't possibly be right, can it? 1798769086/14282 = 125947. Some how I don't think the average puzzle has 125000 across clues. What the hell?!

Turns out, Hadoop doesn't instantiate a new Writable every time I expected it to. Instead, it just calls the Writable.readFields method on the existing object. That meant I was adding all of the clues to the same List, and when I queried for the size after every record, the total grew exponentially. My solution was to clear the list during every readFields operation, that should be fine, since every Map/Reduce operation should be stateless, and anything keeping track of an object reference is doing it wrong. So sanity check take two:

across    549134
author    14282
box    3395411
date    14282
dimension    14282
down    594893
source    14282
title    14282

woohoo!

Thursday, March 1, 2012

Say What? Progress?!

OMG you guys, I actually made progress tonight. Yesterday I just tried to get Hadoop working, that was complicated on its own but I didn't make it any easier on myself.

First, though, the Hadoop documentation sucks. The 0.20 documentation has a really clear, basic example about how to use some packaged example jars to look through a random set of xml files for matches to a regular expression. That's a great introduction to the execution process and a good validation that everything is where it should be. It's also very readable with well enumerated steps.

Unfortunately, neither the 0.23 or the 1.0 versions I downloaded worked with that demo. There are no example jars, or even a configuration directory, which left me struggling to figure out what was going on. Also, I assumed that I could run a simple, single node implementation directly from my IDE, in fact, I haven't really seen anything that says I can't, except that when I try it I get an awfully unhelpful error that the running process doesn't have permissions to write to a tmp folder. (I know Eclipse is supposed to have a plugin, but I saw a really mixed opinion of it online, too)

And finally, the stupid bin/hadoop application can't cope with file paths with spaces in them! It seems like that should be a no brainer in 2012, that was annoying to figure out.

So, lacking external validation or an easy way to test, I was frustrated but I pressed on. I had been stressing about how I was going to feed these .puz files directly to the Mapper, it looked like I might have to create custom input formats, some kind of puz file writable implementation, all kinds of crap, then I ran across a random reference in all my googling and it all came together. Just write a text file, one file path per line, and feed that to the mapper using the same code everyone else is in all their examples. Then you can just use File IO inside the mapper to do everything you need to, so that was done 30 seconds later.

Once I got everything straightened out, and a couple of other bugs fixed (Text objects don't like it when you pass nulls), I was able to run my Mapper job to convert all my binary files into SequenceFiles. As the screenshot says, it took just shy of 3 minutes to read in all 15070 and translate them into Key/Value pairs. 788 of those files broke, throwing an Exception trying to parse the input file, I think these guys are actually corrupted, but I can't tell you which ones they are at the moment...

Writing these guys ballooned my data from 30MB to 109MB, but it should pay off as I finally get to do some real analysis.

Check back for that this weekend.

Monday, February 27, 2012

What do you want to know?

I'd meant the previous post to segue into a discussion of the things I'm looking to find out but I got lost in the middle, in part I'll blame The Help for being really engrossing, but mostly I'm just easily distractable. So here now, is the list of things I'd like to know. As I answer some of these things, I'll link them through to the relevant blog posts.

Which words appear most/least in all puzzles?
Which words appear proportionally more often than in common usage?
Which words have the most/least variety of clues?
Which clues have the most answers?
Are any common words biased towards horizontal/vertical?
How full is the 3-4 letter distribution space? How does that distribution compare to the 3-4 letter Google 1-grams?
What does the distribution of other groups of things look like? (e.g. greek letters, signs of the zodiac, days of the week, countries of the world)
Which words appear most often in the same position?
Which words appear in the largest variety of positions?
Which letter/word is most often in the corners?
Which letter most often starts/ends a word?
How does start/end letter distribution differ from the dictionary/common usage?
How does consonant/vowel ratio differ from dictionary/common usage?
What grid shape is most common?
Which letter is most common at each position in the mode grid size?

3/5/2012:

Can I identify enough place names that it becomes interesting to draw them on a map?

I find each of these interesting in and of itself, but I also think it'll be interesting to make comparisons between publishers, days of the week, over time, or any other cross tab I feel like at the time.

As always, I reserve the right to modify this last, hopefully my loyal readers have more suggestions?

Sunday, February 26, 2012

I've grown today

So the format all these crossword puzzles is in is the AcrossLite .puz format. I'm struggling with how to translate all these small binary files into something Hadoop is comfortable with. This seems to be a common problem (Small Files Problem), and I think the solution is SequenceFiles full of key/value pairs. My current trouble is figuring out what I need to include in my puzzle maps and how much duplication I'm willing to put up with, since each box in the grid is included in at least two words. While the easy answer to the first is 'everything' and I got some advice that the answer to the second is 'a lot, disk space is free'. And then I thought about it some more and spent an hour or so coming up with complicated strategies to index my files so that I wouldn't have to duplicate any characters, and then I thought about it some more and I realized that they're just characters, and that anything I tried to do would probably take more than one byte to store and longer than one read to look up.

This is probably a realization that all programmers come to eventually, I'm lucky to have had it in the privacy of my own home...

Experiments with Google Charts

I don't even have any real data yet, so this is probably putting the cart before the horse, but I'm tired of trying to figure out SequenceFiles so I'm taking a detour through Data Visualization. Check out the graph below the cut.

Data Characterization

So, while I reserve the right to monkey with this dataset before I start doing any real analysis, this seems like a good time to describe whose puzzles I have, some of their properties, and the number I ended up getting in my initial download.

Source/Publication	Days of the week available	Latest Date	Earliest Date	Number Downloaded
Boston Globe	SUNDAY	2012-02-19	2008-04-20	200
Chronicle of Higher Education	FRIDAY	2012-02-17	2011-08-12	24
Ink Well	FRIDAY	2012-02-17	2010-01-01	111
I Swear	FRIDAY	2012-02-17	2010-01-01	112
Jonesin'	THURSDAY	2012-02-16	2010-01-07	111
Thomas Joseph	MONDAY - SATURDAY	2012-02-22	2007-06-14	1470
Los Angeles Times	DAILY	2012-02-22	2012-01-26	28
Newsday	DAILY	2012-02-22	2005-05-01	2485
New York Times	DAILY	2012-02-22	2004-01-01	2974
Onion AV Club	WEDNESDAY	2012-02-22	2006-10-04	282
Philadelphia Inquirer	SUNDAY	2012-02-19	2004-01-04	425
Premier	MONDAY - SATURDAY	2012-02-19	2004-11-14	380
Sheffer	MONDAY - SATURDAY	2012-02-22	2007-06-14	1470
Thinks.com	DAILY
Universal	DAILY	2012-02-22	2004-01-01	2780
USA Today	MONDAY - SATURDAY	2012-02-22	2008-06-10	1044
Washington Post	DAILY	2012-02-22	2011-04-05	324
Washington Post Puzzler	SUNDAY	2012-02-19	2004-01-04	425
Wall Street Journal	FRIDAY	2012-02-17	2004-01-02	425

I'm pretty conflicted about this, while it's nice to have over 15,000 puzzles, the distribution between sources is all over the charts: I only have 28 LATs, 282 Onion AVs, and 2974 NYTs, that's going to make drawing some kinds of conclusions difficult and statistically suspect, but I guess we'll cross that moat when we come to it.

Thursday, February 23, 2012

How'd you get your data?

While nothing compares to the pleasure (and terror!) of setting pen to newsprint on a New York Times Sunday crossword, I'm a very happy user of the Shortyz android app by Robert "kebernet" Cooper. Not only is Shortyz free, but Mr. Cooper also makes the source code available on Google Code (here). Having written android apps before, I figured it would be a cinch to convert some Android code for downloading puzzle files into a standalone desktop app that could do the same on a larger scale.

Well, no, of course not.

But over the course of a weekend, I was able to cobble together a process I am happy with. I wasn't interested in any of the android UI or infrastructure components, and while there weren't too terribly many android dependencies in the Downloader code, I did have to rework some sections. Thankfully, the code is well structured, with most Downloaders inheriting their logic from an Abstract class that does most of the heavy lifting. However, I did end up rewriting some bits purely to appease my own OCD/aesthetics, especially the way it figured out which providers had puzzles available on which days. It looks like Google Code will let me clone the Shortyz repo GitHub style so I can share my changes, but that make take a profession of interest from someone else before I get around to it.

So my modified Shortyz code netted me 20 Downloaders that, when given a valid date, query a URL for a puzzle file, then save it to disk. The next step was to build a framework that would loop over every date to grab as many puzzles as it could. To do that I whipped up a simple little main method that initializes all the Downloaders, gets the current date, asks each Downloader for the puzzle from that day, then decrements the day. Since I know none of these sources is going to have an infinite backlog, I implemented some logic so that if a Downloader failed to deliver a puzzle when it was supposed to three times in a row, that Downloader got blacklisted and wasn't queried again. Finally, to be a good citizen, I only request a puzzle from each website once a (very conservative) 30 seconds.

What is this?

I do a lot of crossword puzzles, I try to do at least one a day, normally three or four on Sundays, and when you do that many puzzles from that many sources you tend to notice things. Two examples are "Crossword puzzle words", like "EPEE" or "IOUS" or "ESAI" (Morales) that pop up again and again because they make it easier for the puzzle constructors, and those cases where two different editors from two different publications use the same word, sometimes with the same clue, on the same day. I have a whole list of things I think it will be interesting to look at that I'll document in more detail in a later post.

I'm also a Software Engineer looking to keep up with cool new technology, and what better way to learn than to have a personally interesting problem? To be able to come to even slightly meaningful generalizations about crossword puzzles it's critical to have as much data as possible. Conveniently, Big Data is a hot topic in the field as storage prices drop and the number of quantifiable interactions between people and systems increase. Google, Facebook, Amazon, and pretty much every company on the internet or off it is making buckets of money turning the terabytes of data they have collected into actionable analyses.

The main tool (or at least, the one I'm interested in now) for asking questions of these data quickly and efficiently is MapReduce, an algorithm developed at Google but that was documented well enough that an open source implementation was created and is maintained by Apache as Hadoop.

I intend for this project to serve as a pleasant counterpart to my day job while helping me develop a skill set that I'm interested in pursuing, whether I'm still interested in it by the end of this project is one of the more important questions :)

I intend for this blog to serve as a project log of the problems I encounter as well as how I resolved them, a document of my hypotheses and conclusions, and a gallery of (hopefully) pretty data visualizations. But honestly, by making this public, I also hope it will serve as a goad to keep up my enthusiasm long enough to come to interesting conclusions.

Crossword Stats

Important Pages

Interesting Questions |

Data Characterization |

Latest Analysis