Anecdotes and validating statistical models

I often get criticized by others in the hockey analytics community, particularly on twitter, when I through out a piece of data that runs counter to conventional analytical thinking and raise a question as to whether it means something or not. Sometimes I do it because I believe in what I am saying but other times I do it because I think we should constantly be challenging and testing current conventional wisdom. That is how we learn new things. We make observations, we ask questions about their relevance, we investigate and we either toss out the new idea or we revise our previous way of thinking. Stop doing this and you stop learning.

Earlier this it was brought up on twitter that the Anaheim Ducks shooting percentage had dropped considerably this season. I replied to that tweet noting that pointing out that the Ducks Corsi has also increased significantly this season and wondered if it was a coincidence.

I am generally one of the few in the hockey analytics community that actually does believe there is a negative correlation between playing strong puck possession hockey and good shooting percentage hockey. It is difficult to prove to a statistically significant confidence level but I have seen enough to believe there is such relationship. I know others vehemently deny this though so I figured when I made that tweet it would generate a reaction. Practically within seconds Matt Cane pipes in with

Which deserved the reply

Which brought Micah Blake McCurdy into the discussion.

Although I will still argue that data is just an aggregate of anecdotes or single data points and that saying is just stats people being defensive I do agree with the point Matt and Micah are trying to make. Anecdotes don’t make an argument, rigorous statistical analysis does.

Where I think I differ though is I believe anecdotes and single data points ought to be used to test and validate our statistical models.

Why is using data to check your statistical models? The answer is simple. To ensure your conclusions are valid. To ensure you are interpreting the results correctly. To ensure any assumptions you have made aren’t affecting the results negatively. Let me give you an example from hockey analytics.

Hockey Analytics used to think that zone starts had a significant impact on a players statistics. Vic Ferrari, now with the Washington Capitals, crunched the numbers and came up with the notion that each additional zone start was worth 0.8 additional shot attempts. Dirk Hoag estimated it to be closer to 1.1. Linear regression models were used to come up with these estimates.

Gabe Desjardins’ used an estimate like these and claimed that Daniel Sedin got an additional 7-9 points because of his extra offensive zone starts. In the 2011-12 season Sedin had 242 more offensive zone starts than defensive zone starts. At 0.8 shot attempts per additional zone start that equates to 194 extra shot attempts. At 1.1 shot attempts per additional zone start that would be 266 extra shot attempts. At a league average shot attempt to goal conversion rate (which Sedin would be well above) that would mean Sedin would be on the ice for 8-11 additional goals. Using these numbers Desjardins’ estimates seem reasonable. That is what the model said but is that what actually happened?

(Warning: The following paragraph uses anecdotes which may frighten some statisticians. Reader discretion is advised.)

We can of course check the actual data to see. By looking at all the goals that Sedin was on the ice for following an offensive zone start there is no way he was on the ice for that many goals after a face off and certainly didn’t get that many extra points. In fact, with generous assumptions it is far more likely that the benefit of all those additional offensive zone starts was that he was on the ice for an additional 3-4 goals (and it is probably closer to 2-3). I recall someone looking at Ovechkin as well and found that he also wasn’t on the ice for near enough goals to benefit to the extent that the zone start model predict. I recommend any doubters to pick some players and count how many goals immediately after a face off that they were on the ice for. Ask yourself if that fits the 0.8-1.1 shot attempt per zone start model. I’ll suggest it won’t.

How can this be? Why would the regression models lie to us?  Well, the regression used was CF% vs OZone% and there is in fact a strong correlation. The problem is this is a classic case of correlation not being causation. The players who start in the defensive zone don’t have offensive talent. They also are generally assigned defensive roles so are less inclined to take risks and venture deep into the offensive zone. As a result of both of these they will, on average, have a lower CF%. The opposite is true for offensive zone players. It isn’t the zone start that is driving the CF% it is the talent of the players and the roles they are assigned that drive Corsi. The talent of the players is also what drives the coach to assign zone starts giving offensive zone starts to offensive players and defensive zone starts to players he is asking to play defensive hockey. Talent drives Corsi and talent drives zone starts. Zone starts don’t drive Corsi to a very significant degree.

Of course, I checked the data in another way. I isolated segments of time after each zone face off and compared players statistics with and without that segment of time. The conclusion was that the influence of the face off was largely eliminated within the first 10 seconds. When you actually look at the numbers for players with heavy zone start biases, the impact of the zone starts is relatively small. Micah Blake McCurdy recently confirmed this with a very thorough analysis of shift starts.

Of all the players in the league in 2014-2015 with 100 minutes played at 5v5, only 5% of them them see their on-ice shot percentage move more than one percentage point

To summarize, initial estimates of the impact of zone starts were derived by linear regression analysis and the identified impact was quite significant. There is nothing wrong with linear regression as a statistical methodology, it is the application of it that produced the faulty conclusions. Unfortunately these faulty conclusions stuck with the hockey analytics community for years to come and I have spent a lot of time and effort fighting the zone starts are important claims (for example, here in response to Sam Ventura now of the Penguins, here in response to Tyler Dellow now of the Oilers, as well as here).

Statistical analysis is more than plugging numbers into a statistical model and presto, out comes the results. Statistical models must too be tested using actual, even anecdotal, data. It can save a lot of troubles down the road such as over estimating the impact of zone starts or even over regressing shooting percentages. It’s one thing if you are playing around with hockey analytics in blogs and twitter but it is another if you are making recommendations to a general manager of an NHL team. That is when you want to be sure you get it right so it is critical you always question your work to ensure all your assumptions and conclusions are valid. In the real world of NHL hockey it would be devastating to under value Daniel Sedin because of all his offensive zone starts and decide it might be worth trading him for that guy you are over valuing because of all his defensive zone starts.

1. Rupert says:

“It is difficult to prove to a statistically significant confidence level but I have seen enough to believe there is such relationship.”

And there’s your problem, if the statistics say it’s possibly not significant it’s better to focus on what is known to be significant, and your focus on anecdotes will frequently lead you to poor analysis and flawed conclusions.

Overall I agree with your logic, and I do find it stupid that most statistical based hockey analysis agree that score affects shooting quality (you generate more low quality shows when trailing) and yet they struggle to accept any other variable on shooting quality.

Is there anyone well versed in the hockey analytics community who can make a deductive argument against shot quality outside of p value>0,05 hurr durr?

1. Shingo says:

Hi Rupert, I don’t think necessarily think that David is ‘focusing’ too much on anecdotes, to me it seems that he is simply pointing out that ignoring them can be damaging.

I can almost guarantee that every advanced stat that has been investigated to date was done so due to an aggregate of anecdotes developed over time in the statistician’s head, whether they want to accept that fact or not.

2. Chris says:

Definitely late to the party but I find it interesting that you think possession is negatively associated with shooting percentage. Has anyone actually explored the relationship between possession, say via corsi, and number of goals scored or points accrued?

I ran a model like that to try to predict goal scoring in the playoffs (to try to win a hockey pool) and didn’t find an association. So just wondering if my result is a fluke or if possession really isn’t correlated with scoring and if anyone has done this work before.