Every year I try to come up with a way to pick my NCAA basketball bracket that is as complicated, math-laden and time consuming to work through as it is sure to fail. I usually succeed on all counts. This year's installment is pretty simple mathematically, involves lots of data collection, and generates some nice pretty pictures and tables.
The model I had in mind for this year is that basketball games are multiplicative - not additive - plus gaussian noise. That is, one team is say, 5% better than the other team so over any period of the game, they are going to score 5% more points than the other team. That would make basketball games easy to predict if it were literally true, but then there is the element of noise. Any given period of time involves a lot of randomness. A few turnovers, a few missed shots, or conversely a few blown plays that result in baskets, etc. (and don't even get me started on the refs).
For things of this nature - things that are multiplicative with randomness - the result is often a lognormal distribution. In this case, that means that if I take the ratio of the two teams scores and take the log of that, I should end up with errors that are normally distributed.
In addition, I assume that this randomness is the only thing that determines the outcome of basketball games beyond who is the better team. [I don't give any weight to the "matchup problem" argument (e.g. "Connecticut is going to be a matchup problem for Syracuse because of their quick guards"). That's not just for mathematical convenience - I suspect most of that stuff is bologna.] What you end up with is that I'm claiming there is one "master list" for the order of the teams, and if you see a win or loss not predicted by that list, the cause was randomness (sometimes you win at roulette, too).
The first problem was finding a list of all the scores of all the games. It turns out I couldn't find such a list. So with a little help from wget, CBS sports and PERL, I made one. This is a list of every game played this year (up to 2/28) with the teams and the score.
From that, it's a simple matter to take the log-ratio of all the scores. After that, though, what I needed to find was a mean value for each team that minimized the residual errors. That is, I needed a set of values: {x0, ... xN} where fore each game between team n and m with log-ratio of the result rn,m, the error en,m = (xn - xm) - rn,m is minimized. I couldn't find an analytical way to do it, so I just iterated. That is, I started assuming all the x's were 0, and then corrected them at each iteration by a little bit in the direction that would attempt to minimize the overall error. It converges very quickly - which is itself a statement that this isn't totally bunk.
The output of the script is a list of each team and their xn value, as well as the variance around that mean. Here is the top-10:
1: Kansas | 100.000 ( 0.000 ± 0.310) |
2: Kentucky | 99.449 (-0.006 ± 0.178) |
3: Duke | 99.430 (-0.006 ± 0.173) |
4: Syracuse | 95.651 (-0.044 ± 0.167) |
5: Ohio State | 94.774 (-0.054 ± 0.183) |
6: Mississippi State | 94.152 (-0.060 ± 0.187) |
7: Brigham Young | 94.059 (-0.061 ± 0.478) |
8: Kansas State | 93.852 (-0.063 ± 0.204) |
9: West Virginia | 91.649 (-0.087 ± 0.249) |
10: Baylor | 90.802 (-0.096 ± 0.123) |
The values on the right before the parenthesis are how many points they would score in a theoretical (mean) game against the #1 team (in this case Kansas) assuming that team scored exactly 100 pts. Inside the parenthesis are their actual log-means (again relative to Kansas) and their variance around that mean.
You can see a few things. First, this list is very reasonable and is very close in many ways to the polls. Second, it's apparent that Kansas, Kentucky and Duke are all about the same, but then there's a pretty big drop-off. You can find the full list here (sorry Kentucky Christian).
Before we continue, we're now in a position to test my original hypothesis that games follow a lognormal distribution. If I take the error in each game relative to the set of x's I found above, I can generate a histogram of those errors. Here is that histogram (after scaling each team's error to unity variance):
Neat! That looks pretty normal. [I'm not sure what the deal is with the spikes. I'm not getting paid for this, so I'm going to overlook that for now].
One can also do a Q-Q plot of the data, which shows how close to normal it is:
If the result aligns perfectly with the straight line y=x, then you have a perfect normal distribution with infinite data points. Note that there are little up/downticks at the ends. This indicates that the tails are a bit heavier than a true normal distribution. That is, there are more blowouts than you would expect. Again, I'm not getting paid for this, so I'm going to forego any additional analysis here.
From this point, we're finally able to make predictions. Unfortunately, I again don't know of any analytical way to turn the parameters of two normal distributions into a probability (in this case, I'm pretty sure there isn't one). But once again, we can fudge it with enough CPU power. So I simulated 1 million games between each pair of teams with the parameters from the previous step. Here is the top-10 against each other (I didn't run the full 500x500 table because it's excessive):
| Kansas | Kentucky | Duke | Syracuse | Ohio State | Mississippi State | Brigham Young | Kansas State | West Virginia | Baylor |
Kansas | - | 50.7 | 50.9 | 54.9 | 56.0 | 56.4 | 54.3 | 56.8 | 58.8 | 61.4 |
Kentucky | 49.3 | - | 50.0 | 56.3 | 57.6 | 58.2 | 54.3 | 58.5 | 60.5 | 66.1 |
Duke | 49.1 | 50.0 | - | 56.1 | 57.5 | 58.3 | 54.2 | 58.6 | 60.6 | 66.5 |
Syracuse | 45.1 | 43.7 | 43.9 | - | 51.7 | 52.5 | 51.2 | 53.1 | 55.5 | 59.8 |
Ohio State | 44.0 | 42.4 | 42.5 | 48.3 | - | 50.9 | 50.5 | 51.4 | 54.5 | 57.6 |
Mississippi State | 43.6 | 41.8 | 41.7 | 47.5 | 49.1 | - | 50.1 | 50.5 | 53.5 | 56.3 |
Brigham Young | 45.7 | 45.7 | 45.8 | 48.8 | 49.5 | 49.9 | - | 50.2 | 51.9 | 52.9 |
Kansas State | 43.2 | 41.5 | 41.4 | 46.9 | 48.6 | 49.5 | 49.8 | - | 53.0 | 55.4 |
West Virginia | 41.2 | 39.5 | 39.4 | 44.5 | 45.5 | 46.5 | 48.1 | 47.0 | - | 51.3 |
Baylor | 38.6 | 33.9 | 33.5 | 40.2 | 42.4 | 43.7 | 47.1 | 44.6 | 48.7 | - |
A value in row i, column j means that team i has that percentage chance of beating team j.
This is where I'll stop for now since the brackets aren't out yet. When they do come out, I can simulate each game using this approach and find who is most likely to win. Then I'll be rich, rich, rich!!!
Update: I completed the analysis after the season ended and the brackets were released. You can find that post here