Observed stats are bad, regressed stats are good

That’s what I have been told.

The fact that you are still basing your arguments on observed data means that, even now after at least six years of exhortations from successive generations of people who have tried to work with you and help you, you do not understand what it means to regress your data. I refuse to engage any longer till you learn to do this. —Benjamin Wendorf

Under the threat that Benjamin Wendorf will never engage with me again (how could I live with myself if that happened), I decided that I must finally investigate this regression thing that I have been putting off for so many years.

So, my first step was to go back and actually read what these really smart people have been telling me for years and I was told that Eric Tulsky was an expert in these matters. I say ‘was’ and expert because he is now working for the Carolina Hurricanes and out of the public sphere. Everyone is talking about him so he must be good and if knowing about this regression stuff managed to get him an NHL job, maybe I’ll get an NHL job too. Ok, in addition to being able to engage with Benjamin Wendorf again, that will be my motivation for learning this regression stuff. Maybe I too will get an NHL job.

So off to google I went to search for those old Tulsky posts where he was telling me about this cool regression stuff that I have ignored for all these years. I came up with this article which seems like a great place to start as it was highly recommended by Nick Emptage. Nick described the process in the following way:

Essentially, the idea is to use the year-on-year autocorrelation r as a measure of a statistic’s repeatability, and use (1 – r) to regress the value of that measure toward the average.

Ok, so back to Tulsky’s post. Tulsky’s method was to use 3 years of data (2007-08 to 2009-10) vs. the next three years of data (2010-11 to 2012-13). Here is what he found:

It turns out that the correlation between their on-ice shooting percentage over the first three years and the next three years is only 0.33.

Ok, awesome. Now, let’s apply what Nick told us about.

So our simple statistical rules tell us that if all we know is a guy’s on-ice shooting percentage over a ~3000-minute sample, our best guess for what he’d do in the future would be to take that number and pull it back 67% of the way towards the average shooting percentage.

Hey, yeah, this regression stuff is actually sounding kind of simple. Next he showed me a beautiful scatter plot that showed me exactly what was happening.

You’ll notice that the line doesn’t have a slope of one; it’s much flatter than that. If you take a guy with an 11% on-ice shooting percentage in the first three years, the best-fit estimation of his next three years isn’t 11%; it’s (0.3485 * 11 + 5.3525) = 9.2%. Coincidentally enough, this is almost exactly 67% closer to the mean than the starting value was.

So there is no groundbreaking news here. The laws of arithmetic apply to hockey.

Ok, awesome. I have studied this and I am ready to apply this all on my own. I mean, if it is good enough for Tulsky it is good enough for me.

Ok, so first thing I need to do is calculate the auto-correlation of a statistic so I went and grabbed 3 years of player data and then a subsequent 3 years of player data. I chose to use 2010-11 to 2012-13 and 2013-14 to 2014-16 as my data sets and used players with at least 1500 minutes of ice time in each data set.

So, I calculated the auto-correlation and I was shocked. The auto-correlation of CF% is 0.55! Yes, 0.55. That, according to statistical theory means we ought to be regressing CF% 45% towards the mean. How could hockey analytics not be doing this for all these years?

Let me show you a scatter plot.


You’ll notice that the line doesn’t have a slope of one; it’s flatter than that. If you take a guy with an 55% on-ice shooting percentage in the first three years, the best-fit estimation of his next three years isn’t 55%; it’s (0.6535 * 55 + 17.038) = 53%. Coincidentally enough, this is not far off 45% closer to the mean than the starting value was.

Time to start a campaign to regress CF% to the mean. I am a believer. Observed stats are bad, regressed stats are good. Who is with me?

Oh, and I want to revise my critique of Ryan Stimson’s passing data post that started this whole discussion. The problem isn’t that Stimson is using a stat that is heavily teammate influenced (as Ryan so eloquently pointed out) to dismiss claims Andrew Berkshire made about a players individual playmaking ability. The problem is that Ryan is using observed data and this is bad. Until Ryan regresses his data I refuse to engage him further (Wait, what was that I just heard? A roaring cheer from the Hockey Graphs crowd maybe?).

I am sold. Regression is a wonderful thing. CF% should be regressed 45% towards the mean. Do it or else Benjamin Wendorf won’t engage with you again.

Oh, if you think that sounds like a lot, don’t worry, Benjamin Wendorf is willing to wager that it isn’t.

I would wager that if people are regressing towards the mean because of their data, they aren’t “way over regress”-ing.

The data tells me I should regress CF% 45% towards the mean so I can’t be over-regressing can I? Trust Wendorf. Trust Tulsky. Observed stats are for losers. Regress your data. Regress CF% towards the mean 45%. It will all work out. Simply statistical theory and the laws of arithmetic do apply to hockey.


Update: I have written a follow up post to this one. If you read this one the follow up post is a must read.

This article has 3 Comments

  1. When an article is this snarky, it’s hard to know if you actually support regressed stats or not. This is my first time reading your work so maybe I am missing all the context..

  2. While I’m a huge fan of regression analysis and do regress everything (including Fenwick, which I prefer over Corsi because of it’s slightly better R^2) I found this absolutely hilarious.

Comments are closed.