Crossword Stats: You're doing it wrong!

So it turns out the progress I thought I'd made in my last post wasn't nearly as impressive as I thought it was. While I was actually mapping things, I completely missed the boat with what I thought my output was doing. What I'd hoped was happening was that each time my Map operation was called, there was a sort of anonymous record created to encapsulate all the fields associated with each puzzle. Instead, I just got a really big file full of key/value pairs, and I'd lost all concept of a distinct puzzle.

The solution was to make my records explicit but that wasn't as hard as I thought. To include an object as a Key or Value in Hadoop it needs to implement Writable and then define how its values should be encoded and decoded. That turned out to be pretty straightforward, as I'd already done most of the work figuring out how to encode my objects in the previous SequenceFile version. And it looked like it worked! Even better, it reduced the size of my output object significantly. The SequenceFile encoded key/value pairs clocked in at 109MB, but the new PuzWritable version is just 67MB, still twice as large as the original dataset, but much more reasonable. I think the largest difference is before, every field stored keys with every value, where as now, the keys for all the fields are assumed and determined based on the encoding order of the fields.

So now that I've got my second draft of a dataset working, it's time for a sanity check! I just added up the counts of all the fields, so the number of guys that occur once (author, date, source, title) should match up with my number of records, and the number of across and down clues, and individual boxes, should be more interesting. Here's what I got:

across    1798769086
author    14282
box    11024805265
date    14282
dimension    14282
down    1942045913
source    14282
title    14282

it looks good! 14282 is the right number of records! So the across clues...wait...hundreds, thousands, millions, 1.7 billions across clues? That can't possibly be right, can it? 1798769086/14282 = 125947. Some how I don't think the average puzzle has 125000 across clues. What the hell?!

Turns out, Hadoop doesn't instantiate a new Writable every time I expected it to. Instead, it just calls the Writable.readFields method on the existing object. That meant I was adding all of the clues to the same List, and when I queried for the size after every record, the total grew exponentially. My solution was to clear the list during every readFields operation, that should be fine, since every Map/Reduce operation should be stateless, and anything keeping track of an object reference is doing it wrong. So sanity check take two:

across    549134
author    14282
box    3395411
date    14282
dimension    14282
down    594893
source    14282
title    14282

woohoo!

Crossword Stats

Important Pages

Interesting Questions |

Data Characterization |

Latest Analysis

Sunday, March 4, 2012

You're doing it wrong!

No comments:

Post a Comment