I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.
ERA
1260
AREA
1258
ORE
1019
ERE
991
ARIA
978
ERIE
978
ALOE
969
ONE
967
ALE
895
ATE
877
ELSE
846
ARE
842
ERR
801
ETA
795
ALI
771
ALA
770
EAR
767
OREO
761
ODE
760
ADO
756
All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.
To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.
Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.
I really like the curvy shape on the Number of Total Usages graph, it's not far off the standard bell curve shape I was expecting, but I was surprised by how smooth the weighted average word length graph is. I guess it makes sense, as the longer a word is, the more likely it is to be unique, but I didn't expect the relationship to be that direct.
No comments:
Post a Comment