Sunday, March 25, 2012

Crossword Word Analysis

I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.

ERA1260
AREA1258
ORE1019
ERE991
ARIA978
ERIE978
ALOE969
ONE967
ALE895
ATE877
ELSE846
ARE842
ERR801
ETA795
ALI771
ALA770
EAR767
OREO761
ODE760
ADO756

All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.

To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.

Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.


I really like the curvy shape on the Number of Total Usages graph, it's not far off the standard bell curve shape I was expecting, but I was surprised by how smooth the weighted average word length graph is.  I guess it makes sense, as the longer a word is, the more likely it is to be unique, but I didn't expect the relationship to be that direct.

No comments:

Post a Comment