Saturday, March 31, 2012
Link: A Better Strategy for Hangman
This is a very interesting article about how you're probably playing Hangman wrong. It's very much in the vein of what I'm trying to do.
Sunday, March 25, 2012
What's a "Word"
One of the things I didn't address at all in my last post is "what counts as a 'word'?". While I did take a moment to distinguish between Proper nouns and regular nouns, I expect that the further you get down the occurrence list, the fewer words will actually appear in a dictionary. I'd really like to be able to look at that sort of thing, but unfortunately, the OED doesn't publish a web service. I did download the Google 1-grams, but since they count anything separated by white space as a word, it's full of iffy values as well (eg, the first 35 entries are the letter A repeated from 1 to 35 times). I also found a list of all the words that appear in Wikipedia as of 2005 that will certainly be a smaller corpus. They're also both likely to be full of misspellings : / . Does anyone out there know of a curated word list that might be appropriate for this kind of usage?
While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...
While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...
Crossword Word Analysis
I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.
All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.
To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.
Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.
ERA | 1260 |
AREA | 1258 |
ORE | 1019 |
ERE | 991 |
ARIA | 978 |
ERIE | 978 |
ALOE | 969 |
ONE | 967 |
ALE | 895 |
ATE | 877 |
ELSE | 846 |
ARE | 842 |
ERR | 801 |
ETA | 795 |
ALI | 771 |
ALA | 770 |
EAR | 767 |
OREO | 761 |
ODE | 760 |
ADO | 756 |
All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.
To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.
Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.
Tuesday, March 6, 2012
Global Distribution of Letters, Part 2, The Envisioning
As promised, here is my extended analysis of the Global Letter Distribution stats I posted last. First, instead of using the chart from Google, I've switched to an analysis of all the words in the OED. As I should probably have expected, my initial, purely visual, analysis was way off. In reality, the use of the letter T is practically identical between the OED and the XWords. So to provide a little more mathematical rigor, I plotted the Percent Difference between the OED and XWord values as (XWord val - OED val) / (OED val). What I found surprising was that six letters, S, A, E, D, O, and T are overused at the expense of the other twenty letters, so if you're ever in doubt, that's a good place to start.
Google Charts wasn't doing what I wanted it to, so now I'm experimenting with Tableau, and let me tell you, it's pretty hot (if you're on rss, come view the whole post):
I'm definitely interested in any other functions you think would be fun to graph, as the little bit of formal stats I had was a long time ago. Please let me know what you think!
Google Charts wasn't doing what I wanted it to, so now I'm experimenting with Tableau, and let me tell you, it's pretty hot (if you're on rss, come view the whole post):
I'm definitely interested in any other functions you think would be fun to graph, as the little bit of formal stats I had was a long time ago. Please let me know what you think!
Sunday, March 4, 2012
Global Distribution of Letters
To take advantage of my momentum, I went ahead and found the global distribution of letters across all puzzles!
The # is a black box, and as you'd expect it's leading the pack, although it's a lot closer to A and E than I would have guessed. There's also one poor little Ñ hanging out there at the end, I should figure out where he came from and whether it's a bug...
Here's the English Letter Frequency graph from Wikipedia
T's seem to be very strongly under represented in the Crossword puzzles. More interesting maths to come soon.
The data set is here, I'll update the post with a live Google Chart after I've gotten some sleep.
The # is a black box, and as you'd expect it's leading the pack, although it's a lot closer to A and E than I would have guessed. There's also one poor little Ñ hanging out there at the end, I should figure out where he came from and whether it's a bug...
Here's the English Letter Frequency graph from Wikipedia
T's seem to be very strongly under represented in the Crossword puzzles. More interesting maths to come soon.
The data set is here, I'll update the post with a live Google Chart after I've gotten some sleep.
You're doing it wrong!
So it turns out the progress I thought I'd made in my last post wasn't nearly as impressive as I thought it was. While I was actually mapping things, I completely missed the boat with what I thought my output was doing. What I'd hoped was happening was that each time my Map operation was called, there was a sort of anonymous record created to encapsulate all the fields associated with each puzzle. Instead, I just got a really big file full of key/value pairs, and I'd lost all concept of a distinct puzzle.
The solution was to make my records explicit but that wasn't as hard as I thought. To include an object as a Key or Value in Hadoop it needs to implement Writable and then define how its values should be encoded and decoded. That turned out to be pretty straightforward, as I'd already done most of the work figuring out how to encode my objects in the previous SequenceFile version. And it looked like it worked! Even better, it reduced the size of my output object significantly. The SequenceFile encoded key/value pairs clocked in at 109MB, but the new PuzWritable version is just 67MB, still twice as large as the original dataset, but much more reasonable. I think the largest difference is before, every field stored keys with every value, where as now, the keys for all the fields are assumed and determined based on the encoding order of the fields.
So now that I've got my second draft of a dataset working, it's time for a sanity check! I just added up the counts of all the fields, so the number of guys that occur once (author, date, source, title) should match up with my number of records, and the number of across and down clues, and individual boxes, should be more interesting. Here's what I got:
across 1798769086
author 14282
box 11024805265
date 14282
dimension 14282
down 1942045913
source 14282
title 14282
it looks good! 14282 is the right number of records! So the across clues...wait...hundreds, thousands, millions, 1.7 billions across clues? That can't possibly be right, can it? 1798769086/14282 = 125947. Some how I don't think the average puzzle has 125000 across clues. What the hell?!
Turns out, Hadoop doesn't instantiate a new Writable every time I expected it to. Instead, it just calls the Writable.readFields method on the existing object. That meant I was adding all of the clues to the same List, and when I queried for the size after every record, the total grew exponentially. My solution was to clear the list during every readFields operation, that should be fine, since every Map/Reduce operation should be stateless, and anything keeping track of an object reference is doing it wrong. So sanity check take two:
across 549134
author 14282
box 3395411
date 14282
dimension 14282
down 594893
source 14282
title 14282
woohoo!
The solution was to make my records explicit but that wasn't as hard as I thought. To include an object as a Key or Value in Hadoop it needs to implement Writable and then define how its values should be encoded and decoded. That turned out to be pretty straightforward, as I'd already done most of the work figuring out how to encode my objects in the previous SequenceFile version. And it looked like it worked! Even better, it reduced the size of my output object significantly. The SequenceFile encoded key/value pairs clocked in at 109MB, but the new PuzWritable version is just 67MB, still twice as large as the original dataset, but much more reasonable. I think the largest difference is before, every field stored keys with every value, where as now, the keys for all the fields are assumed and determined based on the encoding order of the fields.
So now that I've got my second draft of a dataset working, it's time for a sanity check! I just added up the counts of all the fields, so the number of guys that occur once (author, date, source, title) should match up with my number of records, and the number of across and down clues, and individual boxes, should be more interesting. Here's what I got:
across 1798769086
author 14282
box 11024805265
date 14282
dimension 14282
down 1942045913
source 14282
title 14282
it looks good! 14282 is the right number of records! So the across clues...wait...hundreds, thousands, millions, 1.7 billions across clues? That can't possibly be right, can it? 1798769086/14282 = 125947. Some how I don't think the average puzzle has 125000 across clues. What the hell?!
Turns out, Hadoop doesn't instantiate a new Writable every time I expected it to. Instead, it just calls the Writable.readFields method on the existing object. That meant I was adding all of the clues to the same List, and when I queried for the size after every record, the total grew exponentially. My solution was to clear the list during every readFields operation, that should be fine, since every Map/Reduce operation should be stateless, and anything keeping track of an object reference is doing it wrong. So sanity check take two:
across 549134
author 14282
box 3395411
date 14282
dimension 14282
down 594893
source 14282
title 14282
woohoo!
Thursday, March 1, 2012
Say What? Progress?!
OMG you guys, I actually made progress tonight. Yesterday I just tried to get Hadoop working, that was complicated on its own but I didn't make it any easier on myself.
First, though, the Hadoop documentation sucks. The 0.20 documentation has a really clear, basic example about how to use some packaged example jars to look through a random set of xml files for matches to a regular expression. That's a great introduction to the execution process and a good validation that everything is where it should be. It's also very readable with well enumerated steps.
Unfortunately, neither the 0.23 or the 1.0 versions I downloaded worked with that demo. There are no example jars, or even a configuration directory, which left me struggling to figure out what was going on. Also, I assumed that I could run a simple, single node implementation directly from my IDE, in fact, I haven't really seen anything that says I can't, except that when I try it I get an awfully unhelpful error that the running process doesn't have permissions to write to a tmp folder. (I know Eclipse is supposed to have a plugin, but I saw a really mixed opinion of it online, too)
And finally, the stupid bin/hadoop application can't cope with file paths with spaces in them! It seems like that should be a no brainer in 2012, that was annoying to figure out.
So, lacking external validation or an easy way to test, I was frustrated but I pressed on. I had been stressing about how I was going to feed these .puz files directly to the Mapper, it looked like I might have to create custom input formats, some kind of puz file writable implementation, all kinds of crap, then I ran across a random reference in all my googling and it all came together. Just write a text file, one file path per line, and feed that to the mapper using the same code everyone else is in all their examples. Then you can just use File IO inside the mapper to do everything you need to, so that was done 30 seconds later.
Once I got everything straightened out, and a couple of other bugs fixed (Text objects don't like it when you pass nulls), I was able to run my Mapper job to convert all my binary files into SequenceFiles. As the screenshot says, it took just shy of 3 minutes to read in all 15070 and translate them into Key/Value pairs. 788 of those files broke, throwing an Exception trying to parse the input file, I think these guys are actually corrupted, but I can't tell you which ones they are at the moment...
Writing these guys ballooned my data from 30MB to 109MB, but it should pay off as I finally get to do some real analysis.
Check back for that this weekend.
Subscribe to:
Posts (Atom)