Binning to Overcome Noisy Data

Apparently hockey analytics hates “binning” and ‘petbugs’ is the latest to write about it today over at Hockey Graphs. There are reasons not to use binning, however binning can help overcome noise in observed data. As this is a hockey blog lets use shooting percentage as an example.

In hockey, shooting percentage observations show a lot of randomness in them due to the small sample sizes we are dealing with. Here is a simple model to show this.

I have 80 players for which I will assign them a ‘true talent’ shooting percentage ranging from 6 to 9.95 in steps of 0.05. So, the 80 players would have ‘true talent’ shooting percentage of 6.00, 6.05, 6.10, 6.15, …, 9.90, 9.95.

Next I’ll conduct coin flip trials to generate ‘observed’ shooting percentages using a weighted coin equal to their ‘true talent’ shooting percentage. For this I am going to model 100 coin flips, or 100 shots, to determine how many goals are scored and then calculate the observed shooting percentages.

If I now plot the players ‘observed’ shooting percentage against their ‘true talent’ shooting percentage I’ll get something like this.

If we saw this chart we would conclude that there is some relationship between the two variables but it isn’t overly strong, however we do know that there is a relationship between true talent and observed shooting percentage. In fact, if I increased trials from n=100 to n=infinity the r^2 would become 1.00 – the observed shooting percentage and true talent shooting percentage would be identical.

The problem is, we can’t play an infinite number of hockey games to overcome the sample size. However, we could bin our data to improve our ability to observe the relationship. For example, lets use a bin size of 5 making bin 1 have true shooting percentages of 6.00, 6.05, 6.10, 6.15 and 6.20. The second bin would have the next 5 true talent shooting percentages and so on. Then we calculate bin average of both the true talent shooting percentage and the observed shooting percentage. When we plot this we get would get the following:

Well, now that is a better R^2. We are getting close to what we expect the relationship to be (which I remind you is an R^2 of 1.00).

How about we try n=10 for bin size.

Well now we are getting somewhere. If you saw that you would conclude there is a fairly strong correlation between the two variables, which we know is the case here.

Binning can be used to overcome noise in observations. Of course, in reality we don’t know true talent shooting percentage but you could bin based on some other variable. In one of my favourite hockey analytics articles, Tom Awad binned players based on ice time to make it easier to study what makes good players better than not so good players. I binned players by role to investigate if roles influence a players statistics. Tyler Dellow is using a bin of ‘star players’ as a way to determine which players are getting the toughest quality of competition. In each of these cases bins are used to strengthen our ability to observe relationships within the data.

Hockey analytics has enough limitations in the data we are working with that we don’t need to impose more limitations on ourselves by arbitrarily throwing away useful tools and techniques at our disposal.

Use bins wisely, but don’t hate the bin.


This article has 8 Comments

  1. Let’s say I have a player who played one season and had an Sh% of x (let’s saw he played a league average amount of time). How much should we regress his numbers (towards either league average or a prior….this isn’t the point of this example)? I only ask this because you do not make this clear in your post.

      1. That’s exactly my point. What I’m trying to say is that’s it’s very nice that you can bin players together and achieve some absurdly high correlation but what’s the point. What does it achieve? What do I now do with this newfound information and this value of r^2=.8364. You say binning strengthens “our ability to observe relationships within the data” (which is something which is actually pretty obvious if the variables already have a relationship to begin with). Ok, but what real conclusions can I draw from this binned data.

        To go back to my first comment, the binned data here won’t have any effect on how much I would regress a player. I would never use binned data for that. And that’s really all I care to do. I have a player’s Sh% and I want to know how much to regress. Binning doesn’t help me with that.

        Note: I’d also really appreciate it if you could give me a real answer and not some snide remark. I’m honestly not sure what the point of this post is.

        1. Here are what bins are good for, as explained in the article.
          “Binning can be used to overcome noise in observations.”
          “…bins are used to strengthen our ability to observe relationships within the data.”

          Using binning to regress shooting percentage is beyond the scope of the article however the idea of “bins” can (and should) be used in regression.

          In the past people have looked at Y1vsY2 or even vs odd charts of player On-ice Sh% or On-ice Sv% and observed poor correlations and suggested we should regress shooting percentage 80% (Desjardins) and 67% (Tulsky).

          For years I argued this is over regression and frequently referenced Tom Awad’s article to show that talented players (those that get a lot of ice time) actually do maintain elevated shooting percentages.

          Eventually Tulsky took this into account and determined an expected on-ice shooting for a player given his ice time and then regressed towards that instead of the league-wide mean on-ice shooting percentage. The amount you would regress a players shooting percentage was reduced dramatically. Tulsky didn’t explicitly use bins however it appears he used rolling averages of Sh% data sorted by player ice time. Rolling averages are essentially bins but on a continuous scale. People use rolling averages to smooth out data all the time. Bins are just discrete versions of rolling averages.

          Ultimately, binning primarily allows us to overcome noise in data making it easier to observe and identify relationships in the data. We can then utilize that information to improve on our data models, including regression.

          I have written about this previously in my failures of predictive analytics article.

          Hope this helps.

          1. “For years I argued this is over regression and frequently referenced Tom Awad’s article to show that talented players (those that get a lot of ice time) actually do maintain elevated shooting percentages.”

            David, there’s a difference between “over regression” and using a better prior. Tulsky (and Awad) showed that players with more TOI tend to have a better sh% not that they regress less (they would technically regress less because they are likely to have taken more shots but a better regression method would take care of this). So if two players each had 100 shots but played a different amount of minutes a game, they would be regressed the same amount (if the regression depends on shots) but to different means.

            Also, this->”Bins are just discrete versions of rolling averages” isn’t true either and that’s pretty fairly obvious. Splitting players into four tiers like Awad did and making the measure continuous are not the same as it ignores the variance inside of each bucket (those at the beginning of each bucket are different from those at the end).

          2. You are playing semantics. Over-regressed, over-adjusted. You know what I mean. First regression method would regress a high ice time 10% shooter to ~8.4% while the second method would regress to ~9.8%. That’s a big difference that caused a lot of unnecessary debates as one could easily look at the data and realize Sidney Crosby shouldn’t have his shooting percentage regressed to 8.4%.

  2. It’s not letting me reply to your last comment…..Nevertheless, I really don’t think it’s semantics. The term regressed has a technical definition which should be used consistently. When you say “over-regressed” does that mean literally or just that a better prior is used. They are two different things. I’d like to also note, most good models do use priors (e.g. DTM’s war model).

    For what it’s worth, you still haven’t given me an answer as to why binning is better beyond that it essentially just displays relationships better (any r^2 from a binned correlation is worthless as it ignores differences in that bucket).

    1. As I said in my original post, binning is useful because it allows you to recognize relationships and the strength of the relationship better. It’s less valuable in an actual data model, however understanding relationships better should lead to development of better models. Regression models being a perfect example. Awad’s binning analysis showed Sh% is a more significant talent than year over year or split half analysis did.

Comments are closed.