Thursday, February 23, 2012

How'd you get your data?

While nothing compares to the pleasure (and terror!) of setting pen to newsprint on a New York Times Sunday crossword, I'm a very happy user of the Shortyz android app by Robert "kebernet" Cooper.  Not only is Shortyz free, but Mr. Cooper also makes the source code available on Google Code (here).  Having written android apps before, I figured it would be a cinch to convert some Android code for downloading puzzle files into a standalone desktop app that could do the same on a larger scale.

Well, no, of course not.

But over the course of a weekend, I was able to cobble together a process I am happy with.  I wasn't interested in any of the android UI or infrastructure components, and while there weren't too terribly many android dependencies in the Downloader code, I did have to rework some sections.  Thankfully, the code is well structured, with most Downloaders inheriting their logic from an Abstract class that does most of the heavy lifting.  However, I did end up rewriting some bits purely to appease my own OCD/aesthetics, especially the way it figured out which providers had puzzles available on which days.  It looks like Google Code will let me clone the Shortyz repo GitHub style so I can share my changes, but that make take a profession of interest from someone else before I get around to it.

So my modified Shortyz code netted me 20 Downloaders that, when given a valid date, query a URL for a puzzle file, then save it to disk.  The next step was to build a framework that would loop over every date to grab as many puzzles as it could.  To do that I whipped up a simple little main method that initializes all the Downloaders, gets the current date, asks each Downloader for the puzzle from that day, then decrements the day.  Since I know none of these sources is going to have an infinite backlog, I implemented some logic so that if a Downloader failed to deliver a puzzle when it was supposed to three times in a row, that Downloader got blacklisted and wasn't queried again.  Finally, to be a good citizen, I only request a puzzle from each website once a (very conservative) 30 seconds. 


Here's the actual code:

    do {
      LOG.info("Starting " + format.format(calendar.getTime()));
      for (Downloader next : downloaderLoader) {
        if (next.canDownload(calendar) && ! blacklist.contains(next)) {
          download = next.download(calendar.getTime());
          if (download != null) {
            failCount.put(next, 0);
          }else {
            int value = failCount.get(next) + 1;
            failCount.put(next, value);
            if(value > 2 ) {
              LOG.severe("Blacklisted " + next.getName() + format.format(calendar.getTime()));
              blacklist.add(next);
            }
          }
        }
      }
      calendar.add(Calendar.DATE, -1);
      //Be nice
      Thread.sleep(30000);
    } while (blacklist.size() < numDownloaders);

So, what did that get me? 26 hours later I have 12,424 puzzles, of the original 20 sources, 4 were active when I hit 1/1/2004 and decided to kill it.  While some characterization of these puzzles will almost certainly be one of the first things I look at with Hadoop (think of it as a warmup), I think I'll also post some initial Dataset analysis in the next post.

2 comments:

  1. Why a separate blacklist? Why not tell the list what the maximum failCount is during initialization and let it count them itself? Then it will just return false in the canDownload() method? All the code on that page is managing the count, and little about managing the calling.

    ReplyDelete
  2. I hear what you're saying, but I don't think it belongs in the Downloader logic, what canDownload(Calendar) returns is whether this publisher has a crossword puzzle available on this day of the week. So if it's a daily puzzle it always returns 'true', if it's good Mon-Sat it only returns false on Sun, etc. Other than that, the Downloaders are stateless, so I think it makes sense to have the controller portion external to it.

    ReplyDelete