Sunday, February 26, 2012

I've grown today

So the format all these crossword puzzles is in is the AcrossLite .puz format.  I'm struggling with how to translate all these small binary files into something Hadoop is comfortable with.  This seems to be a common problem (Small Files Problem), and I think the solution is SequenceFiles full of key/value pairs.  My current trouble is figuring out what I need to include in my puzzle maps and how much duplication I'm willing to put up with, since each box in the grid is included in at least two words.  While the easy answer to the first is 'everything' and I got some advice that the answer to the second is 'a lot, disk space is free'.  And then I thought about it some more and spent an hour or so coming up with complicated strategies to index my files so that I wouldn't have to duplicate any characters, and then I thought about it some more and I realized that they're just characters, and that anything I tried to do would probably take more than one byte to store and longer than one read to look up.

This is probably a realization that all programmers come to eventually, I'm lucky to have had it in the privacy of my own home...


No comments:

Post a Comment