March Madness First Weekend Data Round Up

Which Dataset did the best?

In looking at over 107 data points, it once again becomes clear that no one piece of data works for all games or matchups. It was true last year, the year before, and the years before. However, below is a chart of the data points that did the best that I evaluated.

Personally, for the most part, I always start my evaluations with both KenPom and Sagarin. These two data sets are always a good starting point. In fact, I pay for the subscription to KenPom. It’s reasonable and fun to work with all season long. [Yeah, a free plug for both] After that, I use other metrics and some tools we’ve developed like the total adjusted efficiency and its differential.

As stated, a single data point is a terrible way to look at any game, but here are the data points that gave the best numbers. KenPom and 3 Pt Pct Shooting Differential were the two best “flat” predictors. However, as bad as we talk about the Committee and seeding, Committee Rank and Seed would have given you a 25-7 record for the 1st round. If you combined that with the 3 Pt Pct Differential in the second round, you’d end up with a respectable 490 points (an ESPN 99% rank). In a lot of office pools that would make you a clear favorite. Overall you would be going into the Sweet 16 with a 37-11 record and a good chance of being on top at the end.

Yet when you look at each game set individually, the right set of stats could have led to being perfect after these first two rounds. Yeah, even little data nuggets pointed to the Princeton win. The problem is that so many data sets point in the other direction. So what’s good and what’s junk? Back several years ago we did a study of five years of in-depth tourney data. Then last year we had our data scientist look at some other data. That combination of data allowed us to win several pools last year and finish once again in the top 95% on ESPN. It’s still nowhere near as good as our best year, but it did give us some validation that we were on the right path. Unfortunately, this year, we tried a new model. It has not performed as well as last year although we have some brackets in the 94% group as of now. However, our max points possible still remains

But if you are wondering what data might have led you to a perfect bracket, the truth is that a well-honed set of data could have given you that perfect bracket (at least until the next round). At the end of this silly season, we will go back into our hole in the ground and sort it out once more. We will then do a deep dive verification of our data. Then we will publish our numbers. I would expect this to be done by June 1st.

Of course, there are some other bracket principles we adhere to here. The main thing to remember when making any bracket is that the first round cannot be evaluated as a group. You must take each of the matchups and apply the right data set to the right matchup. That has been true each and every year. As an example of note, the 7/10 games seem to come down to which team has come into the tournament doing a better job at taking care of the ball. Hence, TOs matter the most. But again, at the end of this season, we will parse out this data and compare it with our last 10 years. Then we are going to endeavor to recreate the last 15 years for an even better look at the data to come to better models. The future goal is to go back and recreate the last twenty years for analysis.

For now, we are going to use as our starting point a tried and true method of riding out the rest of the tournament. We are going to start our personal analysis with Seeding and Sagarin. Of course, a little Sangria might help too.

We will follow up that analysis with a chaser of our own spreadsheet analysis that compares teams more thoroughly. That spreadsheet should be available later today for download.

For now and as always, choose joy and continue to enjoy the madness.

Which Dataset did the best?

Related