Are we predicting the future or analysing the past?

So, yesterdays post created an interesting, and mostly positive to my delight, response. Credit goes out to Matt Cane who hit the nail on the head in figuring out the response I was looking for.

Matt is correct, CF% has quite good in-season correlations and shooting percentage does not. I chose CF% to be my ‘very first foray into regression’ to emphasize what is actually happening when you regress using the method that Tulsky (and others) used. The result is, you over regress. CF% doesn’t need to be regressed because it doesn’t suffer sample size issues. Sh% suffers sample size issues so could be regressed but you need to know why and to what extent.

The methodology you use to regress data matters a whole lot depending on what your goal is. Let me present you two scenarios:

  1. You play fantasy sports and you want to predict what a players statistics will be in future seasons for your keeper league.
  2. You are a general manager of an NHL organization and you want to evaluate the players current talent to decide whether to trade for them for your playoff run.

These two scenarios lead to two related, but very different, questions. The first is, what statistics can I expect this guy going to put up in the future. The second is, how good is this guy right now. Those questions may sound similar but they must be answered in different ways.

At the Rochester Institute of Technology Hockey Analytics Conference (RITHAC) held last October I made a presentation (which I have failed to write a post on) that included this slide showing all of the factors that might impact on-ice results (measurable outcomes).

measuringtalent

So a players statistics is a result of a number of factors including that players talent, the talent level of his teammates, the talent level of the players he plays against, how the player is used by the coach, playing style, score effects, injuries, etc. as well as a component for luck and randomness.

A key point here is that player talent and ability will not change much from year to year, or even over several years (except for players at the very beginning or very end of their careers – excluding Jagr of course). From year to year all those other components could change, and in some instances will change a lot. For example, if a player was traded or signed as a free agent with another team his quality of teammates will change as much usage, QoC, coaching tactics, and score effects.

If I am attempting to predict future statistics for my fantasy hockey league I want to factor into my thinking the chances that a player might change teams or move up in the line up. If the player in question is currently a winger for a superstar player if he changed teams (or the superstar player changed teams) it is likely that his future situation will not be as good and so it is likely that his statistics will regress down (through no fault of his own – his own talent is fairly persistent). Conversely if the player is playing on a terrible team and gets traded his stats are likely to regress up which again has nothing to do with his own talent.

I showed this at RITHAC by looking at differences in persistence with players who remained on the same team and those who switched teams. Here is an example slide.

CF60_Y1vsY2_same_vs_changed_teams

If we were 100% confident that a player was going to stay on the same team in predicting future performance you would be inclined to regress less than if you believed they were going to switch teams.

Getting back to the topic at hand, when Eric Tulsky is figuring out how much to regress using a 3-year vs 3-year correlation he is factoring into the equation a lot of roster and coaching changes not to mention ageing curves. All those changes will dilute the correlation. We see it easily using CF%. I calculated the 3-year vs 3-year correlation of 0.55 while Matt Cane calculated the in-season correlation to be 0.74 to 0.69 for forwards and defense respectively.

Now, if you are trying to predict the future for your keeper hockey pool then it is reasonable to factor in the possibility of trade, coaching changes and ageing so go right ahead and apply Tulsky’s method (use it for CF% too!). If you are a General Manager of an NHL team looking for a player to help you for the playoff run you would potentially be under valuing (or over valuing) a players current talent level because you over regressed.

In evaluating past results we don’t want to factor in possible team changes, coaching changes, or ageing because we already have all that information. The great thing about the past is that it has already happened and all those other factors have already been accounted for in the stats.

When we are evaluating the past, ultimately we only want to regress for actual randomness (and possibly other factors we cannot account for but I’ll leave that out of the equation for now). The challenge is to quantify the extent actual randomness will influence the data. I attempted to do this for shooting percentage already so you don’t have to. My methodology was to do a split halves correlation but split based on even and odd seconds. By doing this we practically eliminate all other factors as QoT, QoC, coaching, usage, etc. will be the same in the even seconds as the odd seconds. This allows us to focus purely on randomness due to limited sample sizes which is the key problem with shooting percentage. As Tangotiger pointed out, the end result is to properly regress for randomness we should be adding ~600 shots at league average shooting percentage to the players actual statistics.

Last season there were 117 forwards who were on the ice for >500 shots for and 202 forwards on the ice for >400 shots for. If you a top line forward you will probably be on the ice for more than 500 shots and if you are a top six forward you will almost certainly be on the ice for >400 shots. All of these numbers are in 5v5 situations only. Now lets look at how adding 600 league average shots compares to Tulsky’s regression of 67% (and other people thought we should regress 80% towards the mean).

# of YearsShots 2nd LinerRegression 2nd LinerShots 1st LinerRegression 1st Liner
1-year
40060%50055%
2-year
80043%100038%
3-year
120033%150029%
4-year
160027%200023%
5-year
200023%250019%
6-year
240020%300017%
7-year
280018%350015%
8-year
320016%400013%

Regression is still significant for 1 or two years but much less than Tulsky’s 67% regression which was calculated with a base of 3 seasons of data. With more years regression is in the 15-30% range which is a relatively small regression (on par with Matt Cane’s in-season calculation for CF%). With 3-years of shooting percentage data we are really starting to get a good indication of true talent and not randomness. I often like to look at 3-year data for this reason.

Let’s look at how these regressions look for actual players. I’ll look at Ovechkin as he has the most shots over the past 8 seasons, Alex Tanguay because he is exceptional at boosting on-ice shooting percentage, and Ryan Callahan as he has been on the ice for a lot of shots but has a below average on-ice shooting percentage.

regressionplots

Eric Tulsky did eventually realize that his 67% regression method was over-regressing. He solved the problem by choosing not to regress to a league-wide mean shooting percentage but instead choosing to regress to a mean shooting percentage representing the typical shooting percentage for a player getting an equal amount of ice time. Top players regressed towards 9.7% while bottom tier players regressed towards ~6.3% and his regression methodology changed from the (1-r) method discussed yesterday to something a little more closely to what I am doing here. The resulting regression method regressed the data substantially less than his initial estimate of regressing 67% and almost certainly less than adding 600 league average shots when dealing with one or two years of data. The reason for this is he is factoring more information into his analysis. That additional information is the coaches opinion of how good the player is based on how much ice time he coach gives the player. One could certainly apply this same theory to my 600 shot rule by using a different mean to regress too.

Long story short though, these improved regression methods paint a dramatically different picture on the importance of shooting percentage. A shooting percent range of 6.3% to 9.7% is substantial. It means good on-ice shooting percentage players would score >50% more goals than poor on-ice shooting percentage players on an equal number of shots. The range for shot rates over longer time periods is substantially smaller than that. When it comes to goal production, shooting percentage matters as much or more than shot rate.

Before I finish up I need to mention that we haven’t really isolated individual talent yet. These regression techniques are only giving us a best estimate of on-ice shooting percentage after attempting to factor luck and randomness. That on-ice shooting percentage is the result of all five players on the ice, not the individual. That is where Rel and RelTM metrics come into play. These metrics attempt to isolate individual contribution from the quality of his teammates. At RITHAC I showed that these techniques actually do a pretty good job as shown by these charts comparing players that stayed on the same team to those that switched teams.

CF60RelTM_Y1vsY2_same_vs_changed_teams

In this chart we see that the RelTM stats persist year to year even for players that change teams. This is a definite improvement over the charts I showed earlier. There is probably a lot more work that can be done to isolate individual contribution but at minimum we should be using Rel or RelTM stats.

In summary, my post yesterday was really just a set up post for this one. I believe most early estimates of how much we needed to regress significantly over-regressed. I knew this back then because of looking at observed data and how players did in fact maintain on-ice shooting percentages over a number of seasons significantly higher than the regressed estimates. Furthermore, simply looking at long-term on-ice shooting percentages showed that high on-ice shooting percentages were associated with high end offensive forwards and poor on-ice shooting percentages were associated with 3rd and 4th line defensive players. This structure doesn’t result from randomness. Eventually, it seems, Tulsky realized this too and adjusted his regression technique. Another important takeaway from all of this is that the range of on-ice shooting percentages is quite substantial and needs to be factored into player evaluation.

My final message is it is great to know how to apply statistical methods but one must still understand what the methods are telling them and what they are not. It is also important, where possible, to test them with actual observed data to see if they make sense. Observed results still matter. Sorry Bejamin.

 

This article has 2 Comments

  1. Hey David, as a fellow statistician, I just wanted to drop you a quick note and let you know that you absolutely knocked it out of the park with this one. Great job!

  2. Comparing the variance in a regressed stats to the variance in an observation stat is flawed, obviously the regressed stat will vary less, it’s been regressed.

    A regressed stat doesn’t mean that’s what the players skill level is, it means that the data’s best guess at the players skill level.

    If I shot 11% last season, the data has no way of knowing what that was due to a few lucky bounces or because I take high quality shots. What the regression tells is that more often then not a high shooting% is more about variance and luck then about skill, and so taking that into account we regress the stat to get our best guess for what the players true shooting% is.

Comments are closed.