1st 2nd 3rd All-NBA teams and 1st 2nd Rookie All-NBA teams prediction - project for course "Selected topics of machine learning"
- 2023/2024 NBA Awards prediction
- Table of Contents
- Requirements
- 1. Getting the data
- 2. Data preprocessing
- 2.1. Seasons and types of matches
- 2.2. Player statistics
- 2.3. Awards
- 2.4. Average statistics and normalization
- 2.4.1. Eliminating players with low statistics for All-NBA teams prediction
- 2.4.2. Eliminating players with low statistics for Rookie All-NBA teams prediction
- 2.4.3. Statistics correlation for All-NBA teams prediction
- 2.4.4. Statistics correlation for All-NBA Rookie teams prediction
- 3. Splitting the data for training and validation sets
- 4. Metric
- 5. Models
- 5.1. All-NBA teams prediction
- 5.1.1. Baseline model (score: 148.25)
- 5.1.2. Random Forest Classifier with only per game statistics (score: 141.50)
- 5.1.3. Random Forest Classifier with prediction voting (score: 154.50)
- 5.1.4. Comparison of different default models - Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, K-Nearest Neighbors, XGBOOST, LightGBM, Voting Classifier (score: 158.75)
- 5.1.5. Hyperparameter tuning and feature selection (score 175.75)
- 5.1.6. How predictions for validation set could be improved
- 5.2. Rookie All-NBA teams prediction
- 5.1. All-NBA teams prediction
- 6. Predictions for 2023/2024 season
- 7. Summary
- 8. Possible improvements
To run the prediction for the 2023/2024 season, run the main.py script with path to a file where the data (as JSON) should be save. Example:
python main.py ~/Documents/predictions.json
The project was written in Python 3.11. The required packages are listed in the requirements.txt file. To install it, run:
pip install -r requirements.txt
The data was downloaded from nba.com/stats using the nba_api library. The data was downloaded for all NBA seasons (1946-47 - 2023-24) and contains:
- player statistics in each game (downloaded by this script) - because the file with statistics from all matches is too big to be uploaded to the repository (around 250MB), it is available in this Kaggle dataset,
- team statistics in each game (calculated based on the data from the previous point in this script),
- player statistics in each season (calculated based on the data from the player statistics in this script),
- player awards (downloaded by this script),
- information about rookie seasons of the players (downloaded by this script),
- dates of beginning and end of each season (regular season, playoffs and finals) - based on Wikipedia data.
All data was saved in the data
directory in the csv
format.
Note
I found some mistakes in the data in older seasons - for example, some players were in scoreboard of a match, but they didn't play in that game (they weren't playing for any team from that game). It was usually caused by the same last name of the players and the data was doubled and as a result final scores might differ from the real ones.
The data had to be preprocessed because on the NBA website the seasonal statistics were available only for seasons 1996-97 - 2023-24. Because of that, the data was downloaded for each game in history of NBA and then aggregated to get the seasonal statistics (if specific statistic was used at the time - link to list).
Because the All-NBA teams are selected after the regular season, the data was divided into the following types of matches:
- Regular Season,
- All-Star Game,
- Play-in Tournament,
- Playoffs,
- Finals,
- In-Season Tournament Final (other games of the In-Season Tournament are officially considered as Regular Season games).
Inside NBA_Seasons_Dates.csv file there are dates of beginning and end of regular season, playoffs and finals for each season. That information was used to add information about the type of match and season to the statistics.
Apart from the statistics available on the NBA website, the following statistics were calculated:
Fantasy Points
- based on the formula:FP
=PTS
+ 1.2 *REB
+ 1.5 *AST
+ 3 *STL
+ 3 *BLK
-TO
,Player Impact Estimate
- based on the formula:PIE
= (PTS
+FGM
+FTM
-FGA
-FTA
+DREB
+ 0.5 *OREB
+AST
+STL
+ 0.5 *BLK
-PF
-TO
) / (GmPTS
+GmFGM
+GmFTM
-GmFGA
-GmFTA
+GmDREB
+ 0.5 *GmOREB
+GmAST
+GmSTL
+ 0.5 *GmBLK
-GmPF
-GmTO
),- number of statistics in double digits - if the number was >= 2, then the player had a double-double
DD
and if the number was >= 3, then the player had a triple-doubleTD
, - field goals made (and 3PT shots made) only if the number of attempts was available - in older seasons not all of the statistics were saved and that could cause FG% to be over 100%,
- information about win/loss in the match.
After that, the data was summed up to get the seasonal statistics for each player.
The data about awards was downloaded for each player and information about the following awards were added to the dataset:
- Most Valuable Player,
- Rookie of the Year,
- Defensive Player of the Year,
- Most Improved Player,
- 6th Man of the Year,
- All-NBA teams (1st, 2nd, 3rd),
- All-Defensive teams (1st, 2nd),
- All-Rookie teams (1st, 2nd),
- All-Star Game player,
- All-Star Game MVP,
- Finals MVP,
- number of Player of the Week awards,
- number of Player of the Month awards,
- number of Rookie of the Month awards.
The correlation between the awards and the selection to All-NBA teams since 1988-89 season was checked and that data is shown in the table below (for POTM, POTW it meant that the player won at least one award during the season):
Award | 1st All-NBA Team | 2nd All-NBA Team | 3rd All-NBA Team | Not selected |
---|---|---|---|---|
MVP | 35 | 0 | 0 | 0 |
DPOY | 11 | 6 | 8 | 10 |
ROY | 1 | 0 | 1 | 35 |
6MOY | 0 | 0 | 1 | 24 |
MIP | 0 | 5 | 4 | 26 |
All-Star Game Player | 163 | 156 | 142 | 356 |
All-Star Game MVP | 26 | 7 | 1 | 2 |
POTW | 151 | 125 | 105 | 423 |
POTM | 109 | 56 | 31 | 52 |
The data shows that the MVPs are always selected to 1st All-NBA team, All-Star Game MVPs are usually selected to 1st All-NBA team or 2nd All-NBA team. DPOYs, All-Star Game Players, POTWs and POTMs have high chance to be selected to All-NBA teams.
The correlation between the awards and the selection to All-NBA Rookie teams since 1988-89 season was checked and that data is shown in the table below (for ROTM it meant that the player won at least one award during the season):
Award | 1st All-NBA Rookie Team | 2nd All-NBA Rookie Team | Not selected |
---|---|---|---|
MVP | 0 | 0 | 0 |
DPOY | 0 | 0 | 0 |
ROY | 37 | 0 | 0 |
6MOY | 1 | 0 | 0 |
MIP | 0 | 0 | 0 |
All-Star Game Player | 7 | 0 | 0 |
All-Star Game MVP | 0 | 0 | 0 |
POTW | 24 | 0 | 0 |
POTM | 0 | 0 | 0 |
ROTM | 111 | 28 | 23 |
In the data in the table we can see that most of the awards weren't won by players who were selected to any of the All-NBA Team. However all Rookie Of The Year winners were selected to 1st All-NBA Rookie Team. Winning Rookie Of The Month also means a player has high chance of being selected to the All-NBA Rookie Teams. Unfortunately in the data there were no Rising-Star matches that take place during All-Star Weekends as this could have impact.
The statistics were averaged for each player to get his average impact on the game per match (by doing so the number of games player played doesn't matter).
Also because basketball and players were evolving over the years, the statistics were normalized so that the player with highest certain statistic in specific season would have value 1 and the rest of the players would have proportionally lower values.
However this could cause problems with players who played just a few games during a season and had very high statistics in those games.
To eliminate the issue, after displaying data for all players who were selected to All-NBA teams (graph below), the following filters were applied:
Games Played
>= 40,Minutes
played during season >= 1250,Points
scored during season >= 333,Fantasy Points
scored during season >= 1250.
By doing so, the data for seasons 1988-89 till 2023-23 was reduced from 16711 to 6074 players. For season 2023-24 there was also a requirement for Games Played
>= 65 and that caused that only 146 were eligible for All-NBA teams.
Firstly all non Rookie players were removed from the dataset. After that the same statistics were chosen and displayed as for All-NBA teams. The filters for Rookie players were applied as follows:
Games Played
>= 24,Minutes
played during season >= 650,Points
scored during season >= 250,Fantasy Points
scored during season >= 500.
The filters allowed to reduce the data from 2801 players to just 917. Also only 25 players were eligible to be selected to All-NBA Rookie Teams in 2023-24 season.
After normalizing the data, the correlation between the normalized statistics and the selection to All-NBA teams was checked. The correlation matrix is shown below:
Based on the correlation matrix, the highest importance for the selection to All-NBA teams have the following statistics:
Player Impact Estimate
,Fantasy Points
,Points
,Free Throws Made
,Field Goals Made
.
High correlation between those above statistics and being selected to All-NBA teams is understandable as those statistics (apart from Free Throws made) directly show impact on the game. Free Throws Made may be correlated because good players usually play more and create more actions so the possiblity of being fouled is higher.
The least correlated statistics are:
Free Throw Percentage
,3PT Field Goal Percentage
,3 PT Field Goals Made
.
The low correlation between 3PT Shot statistics is probably caused by the fact that the Centers and Power Forwards usually don't shoot 3PT shots. And in the past those kind of players also weren't good in Free Throws what explains the low correlation with Free Throw Percentage.
After normalizing the data for Rookie players, the correlation between the normalized statistics and the selection to All-NBA Rookie teams was checked. The correlation matrix is shown below:
The most correlated statistics with being selected to All-NBA Rookie teams are:
Field Goals Made
,Fantasy Points
,Points
,Player Impact Estimate
,Minutes
.
Most of the statistics are the same as for All-NBA teams prediction. The fact that Minutes
is highly correlated with being selected to All-NBA Rookie teams may be caused by the fact that most Rookies aren't starters and they don't play as much as the experienced players (so only really good rookies play a lot of time).
The least correlated statistics are:
3PT Field Goal Percentage
,Free Throw Percentage
,Wins
,Field Goal Percentage
,Triple Doubles
.
The low correlation between Wins
and being selected to All-NBA Rookie teams is understandable because the best rookies usually play in the teams from bottom of the table (because the worst teams get first picks in the draft). The low correlation between Triple Doubles
and being selected to All-NBA Rookie teams is probably caused by the fact that achieving a Triple-Double is difficult even for experienced players and rookies usually spend less time on the court, so it's even harder to get any (so most of them don't achieve even one).
The data was split into training and validation sets and each season is fully in either training or validation set. 4 validation seasons were randomly selected and the score for validation set was calculated as mean value of the metric for each of the 4 seasons.
The following metric was used to evaluate the model (proposed by the course lecturer):
+10 points
for each player in correct team,+8 points
for each player that is classified in a team that's number differ by 1 from the correct one,+6 points
for each player that is classified in a team that's number differ by 2 from the correct one,+5 points
if 2 players are in correct team,+10 points
if 3 players are in correct team,+20 points
if 4 players are in correct team,+40 points
if 5 players are in correct team.
That means that the maximum number of points for a season is
Using metrics like accuracy would be misleading because the number of players not selected to any of the All-NBA teams is much higher than those who got selected. An example could be to classify every of the 146 players eligible to be selected to All-NBA teams in 2023-24 season as not selected and the accuracy would be 0.89.
Below are some of the models that were used to predict the players selected to All-NBA teams.
The baseline model was a Random Forest Classifier with n_estimators = 100
that was predicting probability of player being selected to each of the All-NBA teams. The mean score on validation set was 148.25 out of 270 points. The feature importance for the model is shown below:
The baseline model got a high score so it's a good starting point, but also makes it harder to find early improvements.
After removing the statistics that weren't calculated as mean per game, the score of the model decreased to 141.5.
The model was predicting the probability of player being selected to each of the All-NBA teams and than the predictions were used to calculate voting points from the formula:
The formula is based on the formula to calculate results of real All-NBA Team voting. After calculating the points, top players were added to each team. The score of the model was 154.5.
Only the mean per game statistics were used as the score was higher than for the model with all statistics.
5.1.4. Comparison of different default models - Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, K-Nearest Neighbors, XGBOOST, LightGBM, Voting Classifier (score: 158.75)
The comparison of the models is shown in the table below:
Model | Only per game stats + Voting |
No per game stats + Voting |
All stats + Voting |
Only per game stats + No Voting |
No per game stats + No Voting |
All stats + No Voting |
---|---|---|---|---|---|---|
Logistic Regression | 121.75 | 109.25 | 109.25 | 116.00 | 109.25 | 111.75 |
Support Vector Machine | 113.25 | 124.25 | 124.25 | 118.50 | 122.75 | 122.75 |
Decision Tree | 120.50 | 102.75 | 110.00 | 117.00 | 86.75 | 87.75 |
Random Forest | 154.50 | 158.75 | 145.75 | 142.00 | 149.25 | 148.25 |
K-Nearest Neighbors | 106.50 | 105.25 | 105.25 | 95.00 | 104.25 | 104.25 |
XGBOOST | 141.00 | 143.00 | 143.25 | 131.75 | 131.00 | 138.50 |
LightGBM | 135.50 | 147.75 | 137.50 | 137.25 | 140.50 | 137.50 |
Voting Classifier* | 133.75 | 145.00 | 136.50 | 139.25 | 139.50 | 136.25 |
*Voting Classifier was built from all the above models.
With bold are marked the best scores for each configuration.
The best score was achieved by Random Forest Classifier (158.75). Scores above 140 points were achieved also by:
- XGBOOST - in 3 configurations,
- LightGBM - in 2 configurations,
- Voting Classifier - in 1 configuration.
Only the 4 models that achieved 140 points at least once were selected for hyperparameter tuning.
The following parameter grid was created (Voting Classifier was built only from other models in this table):
Random Forest Classifier | XGBoost | LightGBM | Voting Classifier | |
---|---|---|---|---|
Parameters | {'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [10, 25, 50, 100, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4], 'max_features': ['sqrt', 'log2']} |
{'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [10, 25, 50, 100, None], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'subsample': [0.6, 0.8, 1], 'colsample_bytree': [0.5, 0.8, 1], 'gamma': [0, 0.1, 0.2, 0.3, 0.4]} |
{'n_estimators': [100, 200, 300, 400, 500], 'max_depth': [10, 25, 50, 100, None], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'subsample': [0.6, 0.8, 1], 'colsample_bytree': [0.5, 0.8, 1]} |
{'weights': [[1, 1, 1], [1, 2, 1], [1, 1, 2], [2, 1, 1], [2, 2, 1], [1, 2, 2]] 'voting': 'soft'} |
Also the feature selection was implemented. In each iteration there were randomly chosen a random number of features (at least 5) from the list of statistics and for each set of features there were 50 iterations of hyperparameter tuning.
By randomly choosing the features 50 times and then randomly choosing parameters 50 times, there was a total of 2500 results for each model (10000 in total). Also each model was tested with and without additional voting for the prediction so as a result 20000 combinations were checked. The optimization process took ~6 hours. The best model got score 175.75 (what is a significant improvement over the baseline model). Features and hyperparameters for the best model are as follows:
- model:
Random Forest Classifier
, - model parameters:
{'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}
, - features:
['STL', 'FTM_2', 'STL_per_GP', 'PIE_per_GP', 'FTA', 'FG3M_per_GP', 'POTM', 'All-Star', 'MIN', 'FGM_2', 'FP', 'DD', 'GP', 'FTA_per_GP', 'PTS', 'REB', 'DPOY', 'FT_PCT', 'REB_per_GP', 'All-Star-MVP', 'L', 'TD', 'FG3_PCT', 'BLK_per_GP', 'PTS_per_GP', 'AST', 'PIE', 'W', 'FG3M_2', 'TO_per_GP', 'FGM_per_GP', 'FGA', 'FTM_per_GP', 'ROTM']
, - additional voting:
True
.
All the models with their parameters and features were saved to a csv file.
Before 2023/2024 season, each of the All-NBA teams was containing 2 guards, 2 forwards and 1 center (since 2023/24 the voting is positionless). With that in mind the model could be improved by adding information about the position of the player and then filtering the predictions to have correct number of players in each position.
Random Forest Classifier with n_estimators = 100
was used as a baseline model. The score on the validation set (with all features) was 131.25 (out of 180).
Similar to the model for All-NBA teams, the probability voting was added. The formula for voting was changed to:
After adding the voting, the score for the model increased to 136.5.
Using the model that was best for predicting All-NBA teams and the same features, resulted in a score of 126.00 with additional voting and 115.50 without it.
The same parameter grid was used as for All-NBA teams prediction. Once again the parameters were randomly chosen 50 times for each of 50 randomly chosen sets of features. The best score (174.5) was achieved by 3 models:
- XGBoost:
- model parameters:
{'n_estimators': 400, 'max_depth': 25, 'learning_rate': 0.1, 'subsample': 0.8, 'colsample_bytree': 0.5, 'gamma': 0.0}
, - features:
['All-Star-MVP' 'FG3_PCT' 'W' 'REB_per_GP' 'POTW' 'FP_per_GP' 'FGM_2' 'STL_per_GP' 'STL' 'TD' 'FG_PCT' 'REB' 'FG3A' 'ROTM' 'FGA' 'FGA_per_GP' 'FTM_2' 'PTS' 'FP' 'AST_per_GP' 'DD' 'POTM' 'MIN' 'AST' 'FG3M_per_GP' 'TO_per_GP' 'All-Star' 'PIE' 'PTS_per_GP' 'L' 'BLK_per_GP' 'PIE_per_GP' 'FTA_per_GP']
, - additional voting:
True
,
- model parameters:
- LightGBM:
- model parameters:
{'n_estimators': 300, 'max_depth': 10, 'learning_rate': 0.01, 'subsample': 0.6, 'colsample_bytree': 0.8}
, - features:
['All-Star-MVP' 'POTW' 'W' 'ROTM' 'FG3A' 'STL' 'REB' 'PIE' 'REB_per_GP' 'TO_per_GP' 'FG3M_2' 'PTS' 'STL_per_GP' 'FP_per_GP' 'L' 'FP' 'MIN' 'MIN_per_GP' 'FG3A_per_GP' 'PIE_per_GP' 'FTM_per_GP' 'FGA_per_GP' 'FG3M_per_GP' 'POTM' 'GP' 'FG3_PCT']
, - additional voting:
True
,
- model parameters:
- LightGBM:
- model parameters:
{'n_estimators': 100, 'max_depth': None, 'learning_rate': 0.05, 'subsample': 0.8, 'colsample_bytree': 1.0}
, - features:
['All-Star-MVP' 'POTW' 'W' 'ROTM' 'FG3A' 'STL' 'REB' 'PIE' 'REB_per_GP' 'TO_per_GP' 'FG3M_2' 'PTS' 'STL_per_GP' 'FP_per_GP' 'L' 'FP' 'MIN' 'MIN_per_GP' 'FG3A_per_GP' 'PIE_per_GP' 'FTM_per_GP' 'FGA_per_GP' 'FG3M_per_GP' 'POTM' 'GP' 'FG3_PCT']
, - additional voting:
True
.
- model parameters:
The XGBoost was chosen because it was the first model with the highest score. All the models with their parameters, used features and score were saved to a csv file.
Predictions for the 2023/2024 season are based on the best model from section 5.1.5. The predictions are shown in the table below:
1st Team | 2nd Team | 3rd Team |
---|---|---|
Nikola Jokic | Jalen Brunson | Devin Booker |
Luka Doncic | Anthony Davis | Domantas Sabonis |
Shai Gilgeous-Alexander | Anthony Edwards | Damian Lillard |
Giannis Antetokounmpo | Kevin Durant | Kawhi Leonard |
Jayson Tatum | LeBron James | Tyrese Haliburton |
Predictions for the 2023/2024 season are based on the best model from section 5.2.4. The predictions are shown in the table below:
1st Team | 2nd Team |
---|---|
Victor Wembanyama | Scoot Henderson |
Chet Holmgren | Keyonte George |
Brandon Miller | Amen Thompson |
Jaime Jaquez Jr. | Dereck Lively II |
Brandin Podziemski | GG Jackson |
The final score on the 2023-24 season was 356 points (out of 450):
- All-NBA Teams:
- 1st Team: 10+10+10+10+10+40 = 90,
- 2nd Team: 10+10+10+10+8+20 = 68,
- 3rd Team: 10+10+0+8+10+10 = 48,
- All-NBA Rookie Teams:
- 1st Team: 10+10+10+10+10+40 = 90,
- 2nd Team: 0+10+10+10+10+20 = 60.
The models correctly predicted 21 out of 25 players, 2 players were in wrong teams (difference of 1 team) and 2 players were missing.
As mentioned in section 5.1.6., before 2023/24 season the players were chosen to All-NBA Teams based on their position what could be added to the model to predict on validation set but not on the 2023-24 season (that doesn't apply to All-NBA Rookie Teams as they were always positionless). That also creates that the data wasn't perfect for training.
Apart from adding the information about the positions of the players, the following improvements could be implemented:
- age/number of seasons in NBA of the player,
- draft pick number,
- use the whole dataset since 1946-47 season (the problem is that before 1988-89 only 2 All-NBA teams were selected),
- create even bigger parameter grid and test more combinations.