Machine Learning, NBA

Back to 50+ PT. 1 of n

Ever since I did my deep dive on scoring 50+ (read about it here), I have come back to the subject over and over in my mind. Part of my fascination is that the game has changed so much and scoring 50+ is now more attainable than ever with the three point line and free throws. (Although, the NBA has said there will be an emphasis on calling real shooting fouls this year. I for one will be watching closely.) So, here we go around again… below will be some more commentary and details on scoring 50+ points. There will be several posts forthcoming with various machine learning models. In this post we explore linear regression.

Correlation

First, and yes to belabor the point, scoring 50+ points is highly correlated to three pointers and free throws. To illustrate this point I counted the number of times there has been a 50+ point game per season (from 1979 to 2020), then I found the correlation to the league averages per game on 20 recorded stats like: FT, FTA, TOV, BLK, eFG% and Otrg (see basketball reference). These are good placeholders for the general sense of the leagues change over time. And these same stats will be used later for our linear model. Here are the top five correlated scores and note that anything over .4 is considered strongly correlated.

  1. eFG% – .62
  2. ORtg – .56
  3. 3P – .55
  4. 3PA – .55
  5. FT% – .50

In looking at the above figures it would make sense that if players cannot shoot the ball accurately, we aren’t going to see a lot of players hitting 50 points. Same with an offense rating, if a team lacks a proficient offense the again there will be less high marks. Admittedly, this is an is a simplified view, but shows the macro trend. The below chart shows the total number of games of 50+ and the eFG% overlaid. Note that the scaling on the eFG% starts at 45%; no season has ever been below that mark and it is much harder to see a change if we started at zero.

Players are getting better at putting the ball into the net

Multiple Linear Regression Model

Going from macro to micro now, let’s talk briefly about multiple linear regression, which is the most basic machine learning model. Essentially, it is an equation that will provide the relationship (slope or line) of multiple predictors to the outcome, with the least amount of errors – it searches for the line with the least amount of errors. I like to think of the model showing the slope of a line towards the outcome or answer that is being searched for. In this model we are predicting points scored and how does one arrive to that outcome. One may naturally be inclined to ask, “How does the number of fields goals made in addition to other in game stats, like minutes played, affect the relationship?” Well, that is what the linear model is trying to discover. It looks at the data and spits back the line that fits “best”. The model results will show the magnitude and direction of a variable. For instance, if a player has more personal fouls, it will affect their playing time, thereby affecting their chance for more points.

For this model I found the last 17 years of game detail (thanks to someone on GitHub) with over half a million observations. This dataset has all the in game details for a player and works really nicely for a the linear model in theory. Note, that this model is taking into account every player for every game. I did omit some of those details where a player didn’t record a stat. Out of the 20 predictors, just six were found to be significant (reach back into your memory banks from your 101 stats class and recall confidence intervals of 95% – thats the threshold here).

Below is the equation:

PTS scored = -1.298E-11 + (Mins played x 2.726E-13) + (FGM x 2) + (FGA x 1.71E-14) + 
(FG% x 2.69E-13) + (FG3M x 1) + (FTM x 1)

Due to the number of variables, I am not able to show a graphical representation of the linear model (this would have to be a high dimensional visual, rather than 2D). However, the below charts illustrate each predictor used (those that were statistically significant found in the model) against our outcome (PTS). Again, we are trying to find the best fitting line or relationship of the predictors to the outcome.

This shows the linear relationship for the specified predictor, while others are held constant

When looking at the model it pretty obvious that the results will be right in line as we are essentially summing the actual points up, but hey, we have our first model. Below is a tool to use this model. Simply input a players game log to see how close the model is to predicting the actual score.

Leave a Reply