Crossword Stats: Crossword Word Analysis

I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.

ERA	1260
AREA	1258
ORE	1019
ERE	991
ARIA	978
ERIE	978
ALOE	969
ONE	967
ALE	895
ATE	877
ELSE	846
ARE	842
ERR	801
ETA	795
ALI	771
ALA	770
EAR	767
OREO	761
ODE	760
ADO	756

All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.

To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.

Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.

I really like the curvy shape on the Number of Total Usages graph, it's not far off the standard bell curve shape I was expecting, but I was surprised by how smooth the weighted average word length graph is. I guess it makes sense, as the longer a word is, the more likely it is to be unique, but I didn't expect the relationship to be that direct.

Crossword Stats

Important Pages

Interesting Questions |

Data Characterization |

Latest Analysis

Sunday, March 25, 2012

Crossword Word Analysis

No comments:

Post a Comment