I've been sitting on this one for a while, but +Karl Steinke was pondering which length of word filled the largest percentage of its possibility space. That is, of all possible 3 (or 4 or whatever) letter combinations, what
length of combination has the largest percentage of real words? We decided it had to be short words, probably between 3 and 5, but my money is on 3, which is convenient because that's also the largest number of letters that's easy to graph (since you can put it in 3 dimensions) and the shortest length of crossword clue. In this case, I'm using the looser definition of "word" that encompasses any 3 letter combination that shows up in one of my puzzles.
For 3 letter combinations, there are 26^3 (17576) possible values, running from AAA to ZZZ. The official Scrabble word list has 1014 words (5.77%), and my crossword word list has 3070 words (17.5%). Check out the full graph and some highlights below the fold.
Crossword Stats
Sunday, July 29, 2012
Wednesday, July 4, 2012
Digression: Visualizing Supreme Court Cases
I lost a couple of days worth of work on the Crossword puzzle stats, and I've been working on building back the motivation to do it all over again. But last week (the day after the Obamacare ruling in the Supreme Court) I was listening to NPR and they were talking about how polarized the court has become, as well as how "betrayed" the conservatives felt about how "their" Justice, John Roberts, voted.
So of course I thought, "that sounds like fun!". And I tracked down an actually very nice dataset on Wikipedia for the 2000 session and onward (wiki:2000_term_opinions_of_the_Supreme_Court_of_the_United_States).
Over the course of last weekend, I got to the point where I could visualize *something*, and it looked a lot like this:
Each vertical bar is a case and each horizontal line shows how each Justice contributed, Green blocks are the Majority opinion, Blue blocks are Concurrences, and Red blocks are Dissents, large gaps inside a block indicates multiple Justices wrote (eg) Dissents. The Justices' lines are colored by the party of the President who placed them on the court. By way of example, that third case there is Bush v. Gore, you can see (well, maybe you can't it's definitely the most complicated to draw), that 5 justices participated in the majority, including three that also contributed to the same concurrence. The other four Justices wrote four separate Dissents. As you can see, it gets very messy, very quick.
As a first pass, I keep the Justices in their natural list order, so within each opinion, Rehnquist will be at the top and Breyer will be at the bottom. I figured this would provide some continuity that would make it easier to follow. But, looking at it, maybe not so much...
So Tuesday I sent out a plea to my computer science minded friends asking for suggestions on how to draw the lines such that they wouldn't cross at weird angles and would stay out of each others' way.
This is what I asked for, even if gmail dropped my stupid images:
(Before) (Suggested)
Despite the lack of any real idea what I was talking about, several of them pointed me at some very good resources for graph drawing algorithms, and I think what I want is some sort of PCB Routing algorithm, but the more I looked into that, the more I realized how much math was involved, and did I really want to do that for fun?
So I started to think about what I could do short of that point, and it struck me that a lot of the "end" (on the right) nodes were in the "wrong" places. Figuring out the "right" places still involves a fair amount of math, figuring out the optimal ordering of (for example) 7 Justices in a Majority opinion still involves 5040 (7!) operations, but it was at least math I was comfortable with (Permutations).
So I whipped up a Permutation Iterator (ie, stole, from this guy) and modified it to accept a Cost function that gets applied to every permutation. The input to be permutted is an integer array containing the start (left) indexes, so, looking at the Before image above, trying to find the correct order for the majority opinion on the right, I would submit an array that looked like {1, 2, 4, 6, 7, 8 }. That describes a possible ordering where node 1 on the left connects to node 0 on the right, 2 -> 1, 4 -> 2, etc. Calculating the cost of that permutation is pretty straightforward, I take the ABS( array Value - array Index) summed across all elements. So for this example, the cost would be (1-0)+(2-1)+(4-2)+(6-3)+(7-4)+(8-5) = 13.
Just minimizing the distance helped, but it got even better when I minimized both distance *and* the number of intersecting lines.
The result of all of that is below, I think it's definitely an improvement, check out all the long lines! but it's still not ideal.
I have some ideas on how to improve it further, right now I'm leaving the concurrences/dissents in their natural order, and I could probably get some more mileage out of adjusting their order to reduce the total distance, I'm also wondering if maybe looking ahead a second case (or further) might also help, but I'm less certain how to implement that.
I'm especially looking for advice on the long, parallel lines, where the steeper the lines, the more they tend to blur together, what should I do about this?
Thanks for your help!
So of course I thought, "that sounds like fun!". And I tracked down an actually very nice dataset on Wikipedia for the 2000 session and onward (wiki:2000_term_opinions_of_the_Supreme_Court_of_the_United_States).
Over the course of last weekend, I got to the point where I could visualize *something*, and it looked a lot like this:
Each vertical bar is a case and each horizontal line shows how each Justice contributed, Green blocks are the Majority opinion, Blue blocks are Concurrences, and Red blocks are Dissents, large gaps inside a block indicates multiple Justices wrote (eg) Dissents. The Justices' lines are colored by the party of the President who placed them on the court. By way of example, that third case there is Bush v. Gore, you can see (well, maybe you can't it's definitely the most complicated to draw), that 5 justices participated in the majority, including three that also contributed to the same concurrence. The other four Justices wrote four separate Dissents. As you can see, it gets very messy, very quick.
As a first pass, I keep the Justices in their natural list order, so within each opinion, Rehnquist will be at the top and Breyer will be at the bottom. I figured this would provide some continuity that would make it easier to follow. But, looking at it, maybe not so much...
So Tuesday I sent out a plea to my computer science minded friends asking for suggestions on how to draw the lines such that they wouldn't cross at weird angles and would stay out of each others' way.
This is what I asked for, even if gmail dropped my stupid images:
(Before) (Suggested)
Despite the lack of any real idea what I was talking about, several of them pointed me at some very good resources for graph drawing algorithms, and I think what I want is some sort of PCB Routing algorithm, but the more I looked into that, the more I realized how much math was involved, and did I really want to do that for fun?
So I started to think about what I could do short of that point, and it struck me that a lot of the "end" (on the right) nodes were in the "wrong" places. Figuring out the "right" places still involves a fair amount of math, figuring out the optimal ordering of (for example) 7 Justices in a Majority opinion still involves 5040 (7!) operations, but it was at least math I was comfortable with (Permutations).
So I whipped up a Permutation Iterator (ie, stole, from this guy) and modified it to accept a Cost function that gets applied to every permutation. The input to be permutted is an integer array containing the start (left) indexes, so, looking at the Before image above, trying to find the correct order for the majority opinion on the right, I would submit an array that looked like {1, 2, 4, 6, 7, 8 }. That describes a possible ordering where node 1 on the left connects to node 0 on the right, 2 -> 1, 4 -> 2, etc. Calculating the cost of that permutation is pretty straightforward, I take the ABS( array Value - array Index) summed across all elements. So for this example, the cost would be (1-0)+(2-1)+(4-2)+(6-3)+(7-4)+(8-5) = 13.
Just minimizing the distance helped, but it got even better when I minimized both distance *and* the number of intersecting lines.
The result of all of that is below, I think it's definitely an improvement, check out all the long lines! but it's still not ideal.
I have some ideas on how to improve it further, right now I'm leaving the concurrences/dissents in their natural order, and I could probably get some more mileage out of adjusting their order to reduce the total distance, I'm also wondering if maybe looking ahead a second case (or further) might also help, but I'm less certain how to implement that.
I'm especially looking for advice on the long, parallel lines, where the steeper the lines, the more they tend to blur together, what should I do about this?
Thanks for your help!
Saturday, March 31, 2012
Link: A Better Strategy for Hangman
This is a very interesting article about how you're probably playing Hangman wrong. It's very much in the vein of what I'm trying to do.
Sunday, March 25, 2012
What's a "Word"
One of the things I didn't address at all in my last post is "what counts as a 'word'?". While I did take a moment to distinguish between Proper nouns and regular nouns, I expect that the further you get down the occurrence list, the fewer words will actually appear in a dictionary. I'd really like to be able to look at that sort of thing, but unfortunately, the OED doesn't publish a web service. I did download the Google 1-grams, but since they count anything separated by white space as a word, it's full of iffy values as well (eg, the first 35 entries are the letter A repeated from 1 to 35 times). I also found a list of all the words that appear in Wikipedia as of 2005 that will certainly be a smaller corpus. They're also both likely to be full of misspellings : / . Does anyone out there know of a curated word list that might be appropriate for this kind of usage?
While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...
While I'm not exactly producing journal quality output, I'm a little more worried than normal about a GIGO issue here...
Crossword Word Analysis
I'm not feeling up to anything really complicated this weekend, but I had previously put together a Hadoop job aggregating all of the words used in the puzzles, so here's some high level analysis of that. In my total puzzle set, there were 118,211 unique words used a total of 1,144,027 times. Here are the 20 most used words and their number of occurrences, as one might expect, they're both very short, and very vowel heavy.
All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.
To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.
Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.
ERA | 1260 |
AREA | 1258 |
ORE | 1019 |
ERE | 991 |
ARIA | 978 |
ERIE | 978 |
ALOE | 969 |
ONE | 967 |
ALE | 895 |
ATE | 877 |
ELSE | 846 |
ARE | 842 |
ERR | 801 |
ETA | 795 |
ALI | 771 |
ALA | 770 |
EAR | 767 |
OREO | 761 |
ODE | 760 |
ADO | 756 |
All but two of those would be valid Scrabble words, but Ali (most likely Muhammad) and Oreo are both proper nouns. I'll probably look at Vowel distribution next, I bet the more common words will see a heavy bias towards vowels, as well as towards the 6 over-represented letters I pulled out in the last post.
To have somewhere to start analyzing all these words, I divided them into buckets by the number of occurrences, subdivided by powers of 2, so the first bucket contains all the words that occur 2^0 times, the second is from 2^0+1 to 2^1, all the way up to 2^10+.
Below the cut, check out my graphs of the number of words in each bucket, the total number of occurrences in each bucket, and the average word length in each bucket, I think they're kind of cool.
Tuesday, March 6, 2012
Global Distribution of Letters, Part 2, The Envisioning
As promised, here is my extended analysis of the Global Letter Distribution stats I posted last. First, instead of using the chart from Google, I've switched to an analysis of all the words in the OED. As I should probably have expected, my initial, purely visual, analysis was way off. In reality, the use of the letter T is practically identical between the OED and the XWords. So to provide a little more mathematical rigor, I plotted the Percent Difference between the OED and XWord values as (XWord val - OED val) / (OED val). What I found surprising was that six letters, S, A, E, D, O, and T are overused at the expense of the other twenty letters, so if you're ever in doubt, that's a good place to start.
Google Charts wasn't doing what I wanted it to, so now I'm experimenting with Tableau, and let me tell you, it's pretty hot (if you're on rss, come view the whole post):
I'm definitely interested in any other functions you think would be fun to graph, as the little bit of formal stats I had was a long time ago. Please let me know what you think!
Google Charts wasn't doing what I wanted it to, so now I'm experimenting with Tableau, and let me tell you, it's pretty hot (if you're on rss, come view the whole post):
I'm definitely interested in any other functions you think would be fun to graph, as the little bit of formal stats I had was a long time ago. Please let me know what you think!
Sunday, March 4, 2012
Global Distribution of Letters
To take advantage of my momentum, I went ahead and found the global distribution of letters across all puzzles!
The # is a black box, and as you'd expect it's leading the pack, although it's a lot closer to A and E than I would have guessed. There's also one poor little Ñ hanging out there at the end, I should figure out where he came from and whether it's a bug...
Here's the English Letter Frequency graph from Wikipedia
T's seem to be very strongly under represented in the Crossword puzzles. More interesting maths to come soon.
The data set is here, I'll update the post with a live Google Chart after I've gotten some sleep.
The # is a black box, and as you'd expect it's leading the pack, although it's a lot closer to A and E than I would have guessed. There's also one poor little Ñ hanging out there at the end, I should figure out where he came from and whether it's a bug...
Here's the English Letter Frequency graph from Wikipedia
T's seem to be very strongly under represented in the Crossword puzzles. More interesting maths to come soon.
The data set is here, I'll update the post with a live Google Chart after I've gotten some sleep.
Subscribe to:
Posts (Atom)