Historical NFL Data AnalysisΒΆ

OverviewΒΆ

This notebook goes over scraping the web for historical NFL data and then analyzing the leagues trends over time.

Part 1 - ScrapingΒΆ

For this project, I scraped data from https://www.pro-football-reference.com/ using a python web crawler. The crawler also saves each season into seperate files, however in this notebook I will be using the combined data. I have made this dataset publicly available on kaggle meaning you can skip this step using: https://www.kaggle.com/datasets/flynn28/1926-2024-nfl-scores.

Step 1 - Import Libraries This crawler uses requests to fetch the website, BeautifulSoup to parse the html, Pandas to create the dataframe, and time to control rate limiting.

InΒ [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

Step 2 - Define Crawl function The crawl function takes input for the year, fetches the data from our source, parses the html for the table, cleans the data, saves the data to csv, then returns the data to append to our combined dataset.

InΒ [2]:
def crawl(year): # define function and it's input
    response = requests.get(f"https://www.pro-football-reference.com/years/{year}/games.htm") # get the content of that seasons page
    soup = BeautifulSoup(response.content, 'html.parser') # define HTML parser
    table = soup.find('table', {'class': 'stats_table'}) # parse html for table
    data = [] # define data variable
    for row in table.find_all('tr')[1:]: # iterate through table skipping header
        columns = [col.get_text().strip() for col in row.find_all('td')] # store all columns into "columns"
        if columns: # make sure columns exist
            data.append(columns) # append columns to data

    game_type = "Regular Season"
    cleaned = [ # define list to store cleaned data
        [i[1], i[0], i[3], i[5], i[7], i[8], (game_type if i[1] != "Playoffs" else ("Playoff" if (game_type := "Playoff") else "Regular Season")), year] # sort through and clean data
        for i in data if i[1] != "Playoffs" or (game_type := "Playoff")
    ]


    df = pd.DataFrame(cleaned, columns=["Date", "DOW", "WT", "LT", "WTS", "LTS", "Type", "Season"]) # define dataframe with our data and header
    df.to_csv(f"data/{year}_NFL_SCORES.csv", index=False) # save data to csv
    time.sleep(5) # wait 5 seconds to avoid DOS
    return cleaned # return the list

Step 3 - Iterate Seasons and Save

This section of code iterates through our range of years, 1926 through 2025, saving the data when finished.

InΒ [3]:
all_seasons = [] # define list to save all the data to
for year in range(1926, 2025):  # this repeats the crawl function for every year from 1926-2024 
    season = crawl(year) # sets the season variable to the data returned by the crawl function
    if season: # check if data exists
        all_seasons.extend(season) # appends to combined data
    else:
        print(f"{year}: not found") # return message if no data is found

df = pd.DataFrame(all_seasons, columns=["Date", "DOW", "WT", "LT", "WTS", "LTS", "Type", "Season"]) # define dataframe to store data
df.to_csv("data/1926-2024_COMBINED_NFL_SCORES.csv", index=False) # save dataframe to csv

Part 2 - AnalyzingΒΆ

In this section, I extract various features from the previously scraped data to see how the league has evolved over the years.

Step 1 - Import Libraries

This section uses two libraries, Pandas, to read data from the csv, and Matplotlib, to graph data.

InΒ [4]:
import pandas as pd
import matplotlib.pyplot as plt

Step 2 - Importing data

Load the previously scraped data using Pandas.

InΒ [6]:
df = pd.read_csv("data/1926-2024_COMBINED_NFL_SCORES.csv") # store the data frame into "df"

Step 3 - Feature Extraction

Extract and plot features such as, average winning and loosing scores by season, average game point differentials by season, regular season and playoff game totals by season, average regular season games per team, and average points per game.

InΒ [7]:
avg_scores = df.groupby('Season')[['WTS', 'LTS']].mean() # extract the data from the file

plt.figure(figsize=(10,6)) # define graph and set size 
plt.plot(avg_scores.index, avg_scores['WTS'], label='Average Winning Score', marker='o', color='g') # plot the average winning scores
plt.plot(avg_scores.index, avg_scores['LTS'], label='Average Losing Score', marker='o', color='r') # plot the average loosing scores
plt.xlabel('Season') # print the X axis label
plt.ylabel('Score') # print the Y axis label
plt.title('Average Winning and Losing Scores by Season') # print the graphs title
plt.legend() # create a key to the graph
plt.grid(True) # turn on grid for graph
plt.show() # display the graph
No description has been provided for this image
InΒ [8]:
df['Point_Differential'] = df['WTS'] - df['LTS'] # create a column containing point differentials (winning score - loosing score)
avg_point_differential = df.groupby('Season')['Point_Differential'].mean() # get the average of the point differential column by season

plt.figure(figsize=(10, 6)) # define graph and set size 
plt.plot(avg_point_differential.index, avg_point_differential.values, label='Average Point Differential', marker='o', color='b') # plot the data
plt.xlabel('Season') # title the X axis
plt.ylabel('Average Point Differential') # title the Y axis
plt.title('Average Point Differential by Season') # title the graph
plt.grid(True) # enable grid
plt.show() # display graph
No description has been provided for this image
InΒ [9]:
game_counts = df.groupby(['Season', 'Type']).size().unstack(fill_value=0) # count each type of game for each season

game_counts.plot(figsize=(10, 6), color=['blue', 'red']) # plot the data
plt.xlabel('Season') # label X axis
plt.ylabel('Number of Games') # label Y axis
plt.title('Number of Regular Season and Playoff Games by Season') # title grap
plt.legend(title='Game Type') # add key
plt.grid(True) # turn on grid
plt.show() # display graph
No description has been provided for this image
InΒ [10]:
games = df[df['Type'] == 'Regular Season'] # create a dataframe with only regular season games
games_per_year = games.groupby('Season').size() # seperate games by season
teams = pd.concat([games['WT'], games['LT']]).groupby(games['Season']).nunique() # count amount of teams that season
avg_games_per_team = games_per_year / teams # calculate average games per team for each season

plt.figure(figsize=(10, 6)) # define graph
plt.plot(avg_games_per_team.index, avg_games_per_team.values, marker='o', color='g') # plot data
plt.xlabel('Season') # label X axis
plt.ylabel('Average Games per Team (Regular Season)') # label Y axis
plt.title('Average Regular Season Games per Season per Team') # title graph
plt.grid(True) # enable grid
plt.show() # display graph
No description has been provided for this image
InΒ [11]:
df['Total_Points'] = df['WTS'] + df['LTS'] # create a column for total points in a game

combined_points = df.groupby('Season')['Total_Points'].sum() # calculate sum of total points for each year
games_per_year = df.groupby('Season').size() # count the number of games each season
avg_points_per_game = combined_points / games_per_year # calculate average points per game for each season

plt.figure(figsize=(10, 6)) # deine plot
plt.plot(avg_points_per_game.index, avg_points_per_game.values, marker='o', color='g') # plot data
plt.xlabel('Season') # Label X axis
plt.ylabel('Average Points per Game') # label Y axis 
plt.title('Average Points per Game by Season') # title graph 
plt.grid(True) # turn on plots
plt.show() # display graph
No description has been provided for this image

LicenseΒΆ

This Notebook has been released under the Apache 2.0 open source license.