Historical NBA Player Stat AnalysisΒΆ
OverviewΒΆ
In this notebook, I will demonstrate the process of scraping the web for Historical NBA player statistics and then analyzing trends in the league over time.
Part 1 - ScrapingΒΆ
For this project, I scraped data from https://stats.nba.com using a Python API client known as nba_api. The dataset is also available on kaggle. The scraper takes about 12 hours to run because of API rate limitations.
Step 1 - Import Libraries
This crawler uses the following libraries:
time: to pause script in between API calls to avoid rate limitrandom: to select a random time to pause, to make sure pauses are different each time so we don't flag bot detectors.nba_api: to interact with the apirequests: to determine if we have recieved a rate limit
import time # python time library
import random # python randomization library
from nba_api.stats.static import players # library to interact with the players endpoints of the NBA api
from nba_api.stats.endpoints import playercareerstats # library to interact with the career stats endpoint of the NBA API
from requests.exceptions import ReadTimeout # library to determine if our API request has timed out
Step 2 - Define Scraper Function
The web scraper function does the following:
- take input of a dictionary called
player - extract the players name and id from the dictionary
- try to make an API request to get the players stats using ID
- if rate limit, pause for 30-90 seconds and try again
- once working save carreer stats into a dataframe
- add a column for the players name
- sleep for 0.4-0.7 seconds to avoid read timouts
- return the stats
def crawl(player): # 1. define function and input player dictionary
id, name = player['id'], player['full_name'] # 2. extract id and name from dictionary
try: # 3. try to make API call
stats = playercareerstats.PlayerCareerStats(player_id=id).get_data_frames()[0] # 5. save players stats into a dataframe
stats['Name'] = name # 6. add a column for the players name
except ReadTimeout: # 4. run if rate limit
time.sleep(random.uniform(30, 90)) # 4. pause for 30-90 seconds
stats = crawl(player) # 4. call function again
time.sleep(random.uniform(0.4, .7)) # 7. sleep for 0.4-0.7 seconds
return stats # 8. return dataframe
Step 3 - Calling the Function
This section does the following:
- Sets having a header to false, because the csv does not yet have a header
- make an API call to request all players, saving their names and ids into a list of dictionaries
- open the output csv
- iterate through list of players
- scrape players stats and append to csv
- set having a header to true, so we don't have a header after every player
h = False # 1. set header to false
nba_players = players.get_players() # 2. call API for list of players
with open("data/ALL_NBA_PLAYERS.csv", mode='w', newline='', encoding='utf-8') as file: # 3. open output csv file
for i in nba_players: # 4. iterate through scraped player names
crawl(i).to_csv(file, header=(h == False), index=False) # 5. scrape stats for current player and append to csv
h = True # 6. make sure header is set to true
Part 2 - AnalyzingΒΆ
This notebook calculates and plots the following data:
- Average Player Age by Season
- Average Defensive vs Offensive Rebounds by Season
- Average Personal Foulds by Season
- Average Free Throw Percent by Season
- Average Steals vs Blocks by season
Import Libraries
We use the following libraries:
Pandas: to load data from spreadsheetmatplotlib: to plot data
import pandas as pd # for loading data
import matplotlib.pyplot as plt # for plotting graphs
Read data
This part of the code stores the data from the dataset into a pandas dataframe.
df = pd.read_csv("data/ALL_NBA_PLAYERS.csv") # store data in pandas dataframe
Calculate and Plot Average Player Age by Season
This code calculates the average player age by season, then sets up and plots the graph.
avg_age = df.groupby('SEASON_ID')["PLAYER_AGE"].mean() # store the average player age in pandas series by season
plt.figure(figsize=(10,6)) # set graph size
plt.plot(avg_age.index, avg_age, label='Average Age', marker='o', color='r') # plot lines
plt.xlabel('Season') # set x axis label
plt.ylabel('Average Age') # set y axis label
plt.title('Average Player Age by Season') # set graph title
plt.legend() # create key
plt.xticks(avg_age.index[::2], rotation=90) # set X axis ticks to every other season, rotate 90 degrees
plt.grid(True) # enable grid
plt.show() # show graph
Calculate and Plot Average Offensive, Defensive, and Total Rebounds
This code calculates the average offensive, defensive, and total rebounds, sets up the graph, then plots all three for comparison.
avg_reb = df.groupby("SEASON_ID")[["OREB", "DREB", "REB"]].mean() # store average Defensive, offensive, and total rebounds into series by season
plt.figure(figsize=(10,6)) # set graph size
plt.plot(avg_reb.index, avg_reb["OREB"], label="Average Offensive Rebounds", color='red', marker="^") # plot lines
plt.plot(avg_reb.index, avg_reb["DREB"], label="Average Defensive Rebounds", color='blue', marker="v") # plot lines
plt.plot(avg_reb.index, avg_reb["REB"], label="Average Combined Rebounds", color='green', marker="o") # plot lines
plt.xlabel("Season") # set x axis label
plt.ylabel("Average Rebounds") # set y axis label
plt.title("Average Offensive Rebounds vs Average Defensive Rebounds by Season") # set graph title
plt.legend() # create key
plt.xticks(avg_reb.index[::2], rotation=90) # set X axis ticks to every other season, rotate 90 degrees
plt.grid(True) # enable grid
plt.show() # show graph
Calculate and Plot Average Personal Fouls by Season
This code calculates the average personal fouls by season then graphs the data.
avg_pf = df.groupby('SEASON_ID')["PF"].mean() # store average personal fouls grouped by season
plt.figure(figsize=(10,6)) # set graph size
plt.plot(avg_pf.index, avg_pf.values, label="Average Personal Fouls", color='red', marker="o") # plot lines
plt.xlabel("Season") # set x axis label
plt.ylabel("Average Personal Fouls") # set y axis label
plt.title("Average Personal Fouls by Season") # set graph title
plt.legend() # create key
plt.xticks(avg_pf.index[::2], rotation=90) # set X axis ticks to every other season, rotate 90 degrees
plt.grid(True) # enable grid
plt.show() # show graph
Calculate and Plot Average Free Throw Percent by Season
this code calculates the average free throw percent by season, multiplies the decimal by 100 to get the percent, then graphs
avg_ftpct = df.groupby("SEASON_ID")["FT_PCT"].mean() # store average free throw percent grouped by season
avg_ftpct *= 100 # multiply decimal by 100 to get %
plt.figure(figsize=(10,6)) # set graph size
plt.plot(avg_ftpct.index, avg_ftpct.values, label="Average Free Throw Percent", color="red", marker="o") # plot lines
plt.xlabel("Season") # set x axis label
plt.ylabel("Average Free Throw Percent") # set y axis label
plt.title("Average Free Throw Percent by Season") # set graph title
plt.legend() # create key
plt.xticks(avg_ftpct.index[::2], rotation=90) # set X axis ticks to every other season, rotate 90 degrees
plt.grid(True) # enable grid
plt.show() # show graph
Compare Average Steals vs Blocks by Season
The code gets the average steals and blocks by season then graphs them for comparison
avg_blocks_steals = df.groupby("SEASON_ID")[["STL", "BLK"]].mean() # get averages for steals and blocks by season
plt.figure(figsize=(10,6)) # set graph size
plt.plot(avg_blocks_steals.index, avg_blocks_steals.STL, label="Average Blocks", color="red", marker="^") # plot lines
plt.plot(avg_blocks_steals.index, avg_blocks_steals.BLK, label="Average Steals", color="blue", marker="v") # plot lines
plt.xlabel("Season") # set x axis labe;
plt.ylabel("Averages") # set y axis label
plt.title("Average Blocks vs Average Steals by Season") # set graph title
plt.xticks(avg_blocks_steals.index[::2], rotation=90) # set X axis ticks to every other season, rotate 90 degrees
plt.grid(True) # enable grid
plt.legend() # create key
plt.show() # show graph
LiscenseΒΆ
This notebook and it's code is liscensed under the Apache 2.0 open source liscense.