Sunday, July 29, 2012

Three Letter Word Space

I've been sitting on this one for a while, but +Karl Steinke was pondering which length of word filled the largest percentage of its possibility space.  That is, of all possible 3 (or 4 or whatever) letter combinations, what length of combination has the largest percentage of real words?  We decided it had to be short words, probably between 3 and 5, but my money is on 3, which is convenient because that's also the largest number of letters that's easy to graph (since you can put it in 3 dimensions) and the shortest length of crossword clue.  In this case, I'm using the looser definition of "word" that encompasses any 3 letter combination that shows up in one of my puzzles.

For 3 letter combinations, there are 26^3 (17576) possible values, running from AAA to ZZZ.  The official Scrabble word list has 1014 words (5.77%), and my crossword word list has 3070 words (17.5%).  Check out the full graph and some highlights below the fold.

Wednesday, July 4, 2012

Digression: Visualizing Supreme Court Cases

I lost a couple of days worth of work on the Crossword puzzle stats, and I've been working on building back the motivation to do it all over again.  But last week (the day after the Obamacare ruling in the Supreme Court) I was listening to NPR and they were talking about how polarized the court has become, as well as how "betrayed" the conservatives felt about how "their" Justice, John Roberts, voted.

So of course I thought, "that sounds like fun!".  And I tracked down an actually very nice dataset on Wikipedia for the 2000 session and onward (wiki:2000_term_opinions_of_the_Supreme_Court_of_the_United_States).

Over the course of last weekend,  I got to the point where I could visualize *something*, and it looked a lot like this:
 Each vertical bar is a case and each horizontal line shows how each Justice contributed, Green blocks are the Majority opinion, Blue blocks are Concurrences, and Red blocks are Dissents, large gaps inside a block indicates multiple Justices wrote (eg) Dissents.  The Justices' lines are colored by the party of the President who placed them on the court.  By way of example, that third case there is Bush v. Gore, you can see (well, maybe you can't it's definitely the most complicated to draw), that 5 justices participated in the majority, including three that also contributed to the same concurrence.  The other four Justices wrote four separate Dissents.  As you can see, it gets very messy, very quick.

As a first pass, I keep the Justices in their natural list order, so within each opinion, Rehnquist will be at the top and Breyer will be at the bottom.  I figured this would provide some continuity that would make it easier to follow. But, looking at it, maybe not so much...

So Tuesday I sent out a plea to my computer science minded friends asking for suggestions on how to draw the lines such that they wouldn't cross at weird angles and would stay out of each others' way.

This is what I asked for, even if gmail dropped my stupid images:



(Before)               (Suggested)

Despite the lack of any real idea what I was talking about, several of them pointed me at some very good resources for graph drawing algorithms, and I think what I want is some sort of PCB Routing algorithm, but the more I looked into that, the more I realized how much math was involved, and did I really want to do that for fun?

So I started to think about what I could do short of that point, and it struck me that a lot of the "end" (on the right) nodes were in the "wrong" places.  Figuring out the "right" places still involves a fair amount of math, figuring out the optimal ordering of (for example) 7 Justices in a Majority opinion still involves 5040 (7!) operations, but it was at least math I was comfortable with (Permutations).

So I whipped up a Permutation Iterator (ie, stole, from this guy) and modified it to accept a Cost function that gets applied to every permutation.  The input to be permutted is an integer array containing the start (left) indexes, so, looking at the Before image above, trying to find the correct order for the majority opinion on the right, I would submit an array that looked like {1, 2, 4, 6, 7, 8 }.  That describes a possible ordering where node 1 on the left connects to node 0 on the right, 2 -> 1, 4 -> 2, etc.  Calculating the cost of that permutation is pretty straightforward, I take the ABS( array Value - array Index) summed across all elements.  So for this example, the cost would be (1-0)+(2-1)+(4-2)+(6-3)+(7-4)+(8-5) = 13.

Just minimizing the distance helped, but it got even better when I minimized both distance *and* the number of intersecting lines.

The result of all of that is below, I think it's definitely an improvement, check out all the long lines! but it's still not ideal. 



I have some ideas on how to improve it further, right now I'm leaving the concurrences/dissents in their natural order, and I could probably get some more mileage out of adjusting their order to reduce the total distance, I'm also wondering if maybe looking ahead a second case (or further) might also help, but I'm less certain how to implement that.

I'm especially looking for advice on the long, parallel lines, where the steeper the lines, the more they tend to blur together, what should I do about this?

 Thanks for your help!