Apparently hockey analytics hates “binning” and ‘petbugs’ is the latest to write about it today over at Hockey Graphs. There are reasons not to use binning, however binning can help overcome noise in observed data. As this is a hockey blog lets use shooting percentage as an example.
In hockey, shooting percentage observations show a lot of randomness in them due to the small sample sizes we are dealing with. Here is a simple model to show this.
I have 80 players for which I will assign them a ‘true talent’ shooting percentage ranging from 6 to 9.95 in steps of 0.05. So, the 80 players would have ‘true talent’ shooting percentage of 6.00, 6.05, 6.10, 6.15, …, 9.90, 9.95.
Next I’ll conduct coin flip trials to generate ‘observed’ shooting percentages using a weighted coin equal to their ‘true talent’ shooting percentage. For this I am going to model 100 coin flips, or 100 shots, to determine how many goals are scored and then calculate the observed shooting percentages.
If I now plot the players ‘observed’ shooting percentage against their ‘true talent’ shooting percentage I’ll get something like this.
If we saw this chart we would conclude that there is some relationship between the two variables but it isn’t overly strong, however we do know that there is a relationship between true talent and observed shooting percentage. In fact, if I increased trials from n=100 to n=infinity the r^2 would become 1.00 – the observed shooting percentage and true talent shooting percentage would be identical.
The problem is, we can’t play an infinite number of hockey games to overcome the sample size. However, we could bin our data to improve our ability to observe the relationship. For example, lets use a bin size of 5 making bin 1 have true shooting percentages of 6.00, 6.05, 6.10, 6.15 and 6.20. The second bin would have the next 5 true talent shooting percentages and so on. Then we calculate bin average of both the true talent shooting percentage and the observed shooting percentage. When we plot this we get would get the following:
Well, now that is a better R^2. We are getting close to what we expect the relationship to be (which I remind you is an R^2 of 1.00).
How about we try n=10 for bin size.
Well now we are getting somewhere. If you saw that you would conclude there is a fairly strong correlation between the two variables, which we know is the case here.
Binning can be used to overcome noise in observations. Of course, in reality we don’t know true talent shooting percentage but you could bin based on some other variable. In one of my favourite hockey analytics articles, Tom Awad binned players based on ice time to make it easier to study what makes good players better than not so good players. I binned players by role to investigate if roles influence a players statistics. Tyler Dellow is using a bin of ‘star players’ as a way to determine which players are getting the toughest quality of competition. In each of these cases bins are used to strengthen our ability to observe relationships within the data.
Hockey analytics has enough limitations in the data we are working with that we don’t need to impose more limitations on ourselves by arbitrarily throwing away useful tools and techniques at our disposal.
Use bins wisely, but don’t hate the bin.