Thursday, March 1, 2012

Say What? Progress?!


OMG you guys, I actually made progress tonight.  Yesterday I just tried to get Hadoop working, that was complicated on its own but I didn't make it any easier on myself.

First, though, the Hadoop documentation sucks.  The 0.20 documentation has a really clear, basic example about how to use some packaged example jars to look through a random set of xml files for matches to a regular expression.  That's a great introduction to the execution process  and a good validation that everything is where it should be.  It's also very readable with well enumerated steps.

Unfortunately, neither the 0.23 or the 1.0 versions I downloaded worked with that demo.  There are no example jars, or even a configuration directory, which left me struggling to figure out what was going on.  Also, I assumed that I could run a simple, single node implementation directly from my IDE, in fact, I haven't really seen anything that says I can't, except that when I try it I get an awfully unhelpful error that the running process doesn't have permissions to write to a tmp folder.  (I know Eclipse is supposed to have a plugin, but I saw a really mixed opinion of it online, too)

And finally, the stupid bin/hadoop application can't cope with file paths with spaces in them! It seems like that should be a no brainer in 2012, that was annoying to figure out.

So, lacking external validation or an easy way to test, I was frustrated but I pressed on.  I had been stressing about how I was going to feed these .puz files directly to the Mapper, it looked like I might have to create custom input formats, some kind of puz file writable implementation, all kinds of crap, then I ran across a random reference in all my googling and it all came together.  Just write a text file, one file path per line, and feed that to the mapper using the same code everyone else is in all their examples.  Then you can just use File IO inside the mapper to do everything you need to, so that was done 30 seconds later.

Once I got everything straightened out, and a couple of other bugs fixed (Text objects don't like it when you pass nulls), I was able to run my Mapper job to convert all my binary files into SequenceFiles.  As the screenshot says, it took just shy of 3 minutes to read in all 15070 and translate them into Key/Value pairs.  788 of those files broke, throwing an Exception trying to parse the input file, I think these guys are actually corrupted, but I can't tell you which ones they are at the moment...

Writing these guys ballooned my data from 30MB to 109MB, but it should pay off as I finally get to do some real analysis.

Check back for that this weekend.

No comments:

Post a Comment