Sunday, February 26, 2012

Data Characterization

So, while I reserve the right to monkey with this dataset before I start doing any real analysis, this seems like a good time to describe whose puzzles I have, some of their properties, and the number I ended up getting in my initial download.


Source/PublicationDays of the week availableLatest DateEarliest DateNumber Downloaded
Boston GlobeSUNDAY2012-02-192008-04-20200
Chronicle of Higher EducationFRIDAY2012-02-172011-08-1224
Ink WellFRIDAY2012-02-172010-01-01111
I SwearFRIDAY2012-02-172010-01-01112
Jonesin'THURSDAY2012-02-162010-01-07111
Thomas JosephMONDAY - SATURDAY2012-02-222007-06-141470
Los Angeles TimesDAILY2012-02-222012-01-2628
NewsdayDAILY2012-02-222005-05-012485
New York TimesDAILY2012-02-222004-01-012974
Onion AV ClubWEDNESDAY2012-02-222006-10-04282
Philadelphia InquirerSUNDAY2012-02-192004-01-04425
PremierMONDAY - SATURDAY2012-02-192004-11-14380
ShefferMONDAY - SATURDAY2012-02-222007-06-141470
Thinks.comDAILY
UniversalDAILY2012-02-222004-01-012780
USA TodayMONDAY - SATURDAY2012-02-222008-06-101044
Washington PostDAILY2012-02-222011-04-05324
Washington Post PuzzlerSUNDAY2012-02-192004-01-04425
Wall Street JournalFRIDAY2012-02-172004-01-02425

I'm pretty conflicted about this, while it's nice to have over 15,000 puzzles, the distribution between sources is all over the charts: I only have 28 LATs, 282 Onion AVs, and 2974 NYTs, that's going to make drawing some kinds of conclusions difficult and statistically suspect, but I guess we'll cross that moat when we come to it.

No comments:

Post a Comment