Sunday, March 25, 2012

What's a "Word"

One of the things I didn't address at all in my last post is "what counts as a 'word'?".  While I did take a moment to distinguish between Proper nouns and regular nouns, I expect that the further you get down the occurrence list, the fewer words will actually appear in a dictionary.  I'd really like to be able to look at that sort of thing, but unfortunately, the OED doesn't publish a web service.  I did download the Google 1-grams, but since they count anything separated by white space as a word, it's full of iffy values as well (eg, the first 35 entries are the letter A repeated from 1 to 35 times).  I also found a list of all the words that appear in Wikipedia as of 2005 that will certainly be a smaller corpus.  They're also both likely to be full of misspellings : / . Does anyone out there know of a curated word list that might be appropriate for this kind of usage?

While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...

No comments:

Post a Comment