Historical NFL Data AnalysisΒΆ
OverviewΒΆ
This notebook goes over scraping the web for historical NFL data and then analyzing the leagues trends over time.
Part 1 - ScrapingΒΆ
For this project, I scraped data from https://www.pro-football-reference.com/ using a python web crawler. The crawler also saves each season into seperate files, however in this notebook I will be using the combined data. I have made this dataset publicly available on kaggle meaning you can skip this step using: https://www.kaggle.com/datasets/flynn28/1926-2024-nfl-scores.
Step 1 - Import Libraries
This crawler uses requests to fetch the website, BeautifulSoup to parse the html, Pandas to create the dataframe, and time to control rate limiting.
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
Step 2 - Define Crawl function The crawl function takes input for the year, fetches the data from our source, parses the html for the table, cleans the data, saves the data to csv, then returns the data to append to our combined dataset.
def crawl(year): # define function and it's input
response = requests.get(f"https://www.pro-football-reference.com/years/{year}/games.htm") # get the content of that seasons page
soup = BeautifulSoup(response.content, 'html.parser') # define HTML parser
table = soup.find('table', {'class': 'stats_table'}) # parse html for table
data = [] # define data variable
for row in table.find_all('tr')[1:]: # iterate through table skipping header
columns = [col.get_text().strip() for col in row.find_all('td')] # store all columns into "columns"
if columns: # make sure columns exist
data.append(columns) # append columns to data
game_type = "Regular Season"
cleaned = [ # define list to store cleaned data
[i[1], i[0], i[3], i[5], i[7], i[8], (game_type if i[1] != "Playoffs" else ("Playoff" if (game_type := "Playoff") else "Regular Season")), year] # sort through and clean data
for i in data if i[1] != "Playoffs" or (game_type := "Playoff")
]
df = pd.DataFrame(cleaned, columns=["Date", "DOW", "WT", "LT", "WTS", "LTS", "Type", "Season"]) # define dataframe with our data and header
df.to_csv(f"data/{year}_NFL_SCORES.csv", index=False) # save data to csv
time.sleep(5) # wait 5 seconds to avoid DOS
return cleaned # return the list
Step 3 - Iterate Seasons and Save
This section of code iterates through our range of years, 1926 through 2025, saving the data when finished.
all_seasons = [] # define list to save all the data to
for year in range(1926, 2025): # this repeats the crawl function for every year from 1926-2024
season = crawl(year) # sets the season variable to the data returned by the crawl function
if season: # check if data exists
all_seasons.extend(season) # appends to combined data
else:
print(f"{year}: not found") # return message if no data is found
df = pd.DataFrame(all_seasons, columns=["Date", "DOW", "WT", "LT", "WTS", "LTS", "Type", "Season"]) # define dataframe to store data
df.to_csv("data/1926-2024_COMBINED_NFL_SCORES.csv", index=False) # save dataframe to csv
Part 2 - AnalyzingΒΆ
In this section, I extract various features from the previously scraped data to see how the league has evolved over the years.
Step 1 - Import Libraries
This section uses two libraries, Pandas, to read data from the csv, and Matplotlib, to graph data.
import pandas as pd
import matplotlib.pyplot as plt
Step 2 - Importing data
Load the previously scraped data using Pandas.
df = pd.read_csv("data/1926-2024_COMBINED_NFL_SCORES.csv") # store the data frame into "df"
Step 3 - Feature Extraction
Extract and plot features such as, average winning and loosing scores by season, average game point differentials by season, regular season and playoff game totals by season, average regular season games per team, and average points per game.
avg_scores = df.groupby('Season')[['WTS', 'LTS']].mean() # extract the data from the file
plt.figure(figsize=(10,6)) # define graph and set size
plt.plot(avg_scores.index, avg_scores['WTS'], label='Average Winning Score', marker='o', color='g') # plot the average winning scores
plt.plot(avg_scores.index, avg_scores['LTS'], label='Average Losing Score', marker='o', color='r') # plot the average loosing scores
plt.xlabel('Season') # print the X axis label
plt.ylabel('Score') # print the Y axis label
plt.title('Average Winning and Losing Scores by Season') # print the graphs title
plt.legend() # create a key to the graph
plt.grid(True) # turn on grid for graph
plt.show() # display the graph
df['Point_Differential'] = df['WTS'] - df['LTS'] # create a column containing point differentials (winning score - loosing score)
avg_point_differential = df.groupby('Season')['Point_Differential'].mean() # get the average of the point differential column by season
plt.figure(figsize=(10, 6)) # define graph and set size
plt.plot(avg_point_differential.index, avg_point_differential.values, label='Average Point Differential', marker='o', color='b') # plot the data
plt.xlabel('Season') # title the X axis
plt.ylabel('Average Point Differential') # title the Y axis
plt.title('Average Point Differential by Season') # title the graph
plt.grid(True) # enable grid
plt.show() # display graph
game_counts = df.groupby(['Season', 'Type']).size().unstack(fill_value=0) # count each type of game for each season
game_counts.plot(figsize=(10, 6), color=['blue', 'red']) # plot the data
plt.xlabel('Season') # label X axis
plt.ylabel('Number of Games') # label Y axis
plt.title('Number of Regular Season and Playoff Games by Season') # title grap
plt.legend(title='Game Type') # add key
plt.grid(True) # turn on grid
plt.show() # display graph
games = df[df['Type'] == 'Regular Season'] # create a dataframe with only regular season games
games_per_year = games.groupby('Season').size() # seperate games by season
teams = pd.concat([games['WT'], games['LT']]).groupby(games['Season']).nunique() # count amount of teams that season
avg_games_per_team = games_per_year / teams # calculate average games per team for each season
plt.figure(figsize=(10, 6)) # define graph
plt.plot(avg_games_per_team.index, avg_games_per_team.values, marker='o', color='g') # plot data
plt.xlabel('Season') # label X axis
plt.ylabel('Average Games per Team (Regular Season)') # label Y axis
plt.title('Average Regular Season Games per Season per Team') # title graph
plt.grid(True) # enable grid
plt.show() # display graph
df['Total_Points'] = df['WTS'] + df['LTS'] # create a column for total points in a game
combined_points = df.groupby('Season')['Total_Points'].sum() # calculate sum of total points for each year
games_per_year = df.groupby('Season').size() # count the number of games each season
avg_points_per_game = combined_points / games_per_year # calculate average points per game for each season
plt.figure(figsize=(10, 6)) # deine plot
plt.plot(avg_points_per_game.index, avg_points_per_game.values, marker='o', color='g') # plot data
plt.xlabel('Season') # Label X axis
plt.ylabel('Average Points per Game') # label Y axis
plt.title('Average Points per Game by Season') # title graph
plt.grid(True) # turn on plots
plt.show() # display graph
LicenseΒΆ
This Notebook has been released under the Apache 2.0 open source license.