Predicting Formula 1 Race Winners using pre-race data

Proposal

Predicting Formula 1 race outcomes using historical data and regression-based machine learning models.

Author

Affiliation

Team 1 - Rucha Bhale, Aiden Insall

College of Information Science, University of Arizona

import numpy as np
import pandas as pd

Dataset

races = pd.read_csv('data/raw/races.csv')
results = pd.read_csv('data/raw/results.csv')
drivers = pd.read_csv('data/raw/drivers.csv')
constructors = pd.read_csv('data/raw/constructors.csv')
driver_standings = pd.read_csv('data/raw/driver_standings.csv')
constructor_standings = pd.read_csv('data/raw/constructor_standings.csv')
qualifying = pd.read_csv('data/raw/qualifying.csv')
circuits = pd.read_csv('data/raw/circuits.csv')

A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.

Our analysis utilizes the comprehensive Formula 1 World Championship dataset available on Kaggle, originally compiled from the Ergast Developer API. This dataset represents historical records of Formula 1 racing spanning from 1950 to 2024.

Dataset Dimensions: - Coverage: 74 seasons (1950–2024) - Total races: ~1,100 Grand Prix events - Race results: ~25,000 driver-race combinations - Files: 14 CSV files including results, qualifying, standings, circuits, and lap times

Why This Dataset?

Sufficient data for training robust machine learning models.
The data has been carefully curated and cleaned by the F1 analytics community, reducing preprocessing overhead and ensuring reliability.
The dataset’s structure naturally separates pre-race information (qualifying results, standings, circuit data) from race outcomes, which is crucial for building a prediction model that avoids data leakage.
The dataset is well-documented and widely used in motorsport analytics research, providing a strong foundation for reproducible research.

Motivation & Real-World Applications

Understanding which pre-race factors drive Formula 1 success provides actionable insights for multiple stakeholders: - For F1 Teams: Predictive models can inform resource allocation between qualifying optimization and race-day strategy. By quantifying the relative importance of grid position, recent form, and team performance, teams can justify investments in qualifying simulations, aerodynamic development, or strategic planning. - For Analysts & Broadcasters: Data-driven pre-race predictions enhance viewer engagement through evidence-based commentary. - For Sports Analytics Research: This project demonstrates best practices for temporal modeling in sequential competitions—preventing data leakage in time-series data, balancing model complexity with interpretability, and validating predictions using chronological splits. These techniques transfer to other motorsports and sports with similar temporal structures.

Questions

Question 1: Which pre-race factors are most predictive of a Formula 1 driver’s race outcome? (e.g., grid position, constructor, driver form, circuit characteristics).
Question 2: Can regression models accurately predict driver finishing positions using only pre-race data, and which model performs best for this task? * The focus will be on understanding what influences success rather than perfect prediction accuracy.

Analysis plan

We will predict a driver’s finishing position (positionOrder) using data available before or at the start of each race. This will help identify which pre-race factors contribute most strongly to race outcomes.

Data Preparation

Merge relevant CSVs: results, races, drivers, constructors, circuits, and qualifying.
Select only pre-race variables to avoid leakage from in-race or post-race data.
Apply feature engineering to derive:
- Average driver points from recent races
- Team performance trend (constructor standings)
- Qualifying position and grid advantage
- Circuit-level attributes (e.g., country, type)
Handle missing values, encode categorical variables, and normalize numerical ones.

Modeling Approach

We will start with multiple regression models to predict finishing position (positionOrder) for each driver: - Linear Regression: baseline for interpretability - Random Forest Regressor: capture non-linear interactions - XGBoost Regressor: efficient, high-performance gradient boosting method If results show instability or complementary strengths across models, we may test a simple ensemble (e.g., stacking) to assess improvement, but only if it adds meaningful value.

Evaluation Metrics

We will evaluate model performance using: - Mean Absolute Error (MAE) – how far the prediction is from the true position - R² Score, i.e., how much variance is explained by the model We’ll also compare feature importances to identify the most influential predictors (e.g., grid position, constructor, driver consistency).

Validation

Train/test split by season to mimic real-world forecasting
5-fold cross-validation to ensure generalization and reduce overfitting

Weekly Plan

Weeks 1–2: Loading data, performing exploratory analysis, and engineering features. Merging the 14 CSV files and creating historical rolling averages.
Weeks 3–4: Training individual models and performing hyperparameter tuning. Establishing baseline performance and identifying the strongest individual model through cross-validation.
Week 5: Implementing the ensemble model and conducting a comprehensive evaluation.
Week 6: Performing final analysis, creating one or more visualizations, and preparing the report.

--- title: "Predicting Formula 1 Race Winners using pre-race data" subtitle: "Proposal" author: - name: "Team 1 - Rucha Bhale, Aiden Insall" affiliations: - name: "College of Information Science, University of Arizona" description: "Predicting Formula 1 race outcomes using historical data and regression-based machine learning models." format: html: code-tools: true code-overflow: wrap code-line-numbers: true embed-resources: true editor: visual code-annotations: hover execute: warning: false jupyter: python3 --- ```{python} #| label: load-pkgs #| message: false import numpy as np import pandas as pd ``` ## Dataset ```{python} #| label: load-dataset #| message: false races = pd.read_csv('data/raw/races.csv') results = pd.read_csv('data/raw/results.csv') drivers = pd.read_csv('data/raw/drivers.csv') constructors = pd.read_csv('data/raw/constructors.csv') driver_standings = pd.read_csv('data/raw/driver_standings.csv') constructor_standings = pd.read_csv('data/raw/constructor_standings.csv') qualifying = pd.read_csv('data/raw/qualifying.csv') circuits = pd.read_csv('data/raw/circuits.csv') ``` ## A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset. Our analysis utilizes the comprehensive Formula 1 World Championship dataset available on Kaggle, originally compiled from the Ergast Developer API. This dataset represents historical records of Formula 1 racing spanning from 1950 to 2024. Dataset Dimensions: - Coverage: 74 seasons (1950–2024) - Total races: ~1,100 Grand Prix events - Race results: ~25,000 driver-race combinations - Files: 14 CSV files including results, qualifying, standings, circuits, and lap times ### Why This Dataset? - Sufficient data for training robust machine learning models. - The data has been carefully curated and cleaned by the F1 analytics community, reducing preprocessing overhead and ensuring reliability. - The dataset's structure naturally separates pre-race information (qualifying results, standings, circuit data) from race outcomes, which is crucial for building a prediction model that avoids data leakage. - The dataset is well-documented and widely used in motorsport analytics research, providing a strong foundation for reproducible research. ### Motivation & Real-World Applications Understanding which pre-race factors drive Formula 1 success provides actionable insights for multiple stakeholders: - For F1 Teams: Predictive models can inform resource allocation between qualifying optimization and race-day strategy. By quantifying the relative importance of grid position, recent form, and team performance, teams can justify investments in qualifying simulations, aerodynamic development, or strategic planning. - For Analysts & Broadcasters: Data-driven pre-race predictions enhance viewer engagement through evidence-based commentary. - For Sports Analytics Research: This project demonstrates best practices for temporal modeling in sequential competitions—preventing data leakage in time-series data, balancing model complexity with interpretability, and validating predictions using chronological splits. These techniques transfer to other motorsports and sports with similar temporal structures. ## Questions Question 1: Which pre-race factors are most predictive of a Formula 1 driver’s race outcome? (e.g., grid position, constructor, driver form, circuit characteristics). Question 2: Can regression models accurately predict driver finishing positions using only pre-race data, and which model performs best for this task? * The focus will be on understanding what influences success rather than perfect prediction accuracy. ## Analysis plan We will predict a driver’s finishing position (positionOrder) using data available before or at the start of each race. This will help identify which pre-race factors contribute most strongly to race outcomes. ### Data Preparation - Merge relevant CSVs: results, races, drivers, constructors, circuits, and qualifying. - Select only pre-race variables to avoid leakage from in-race or post-race data. - Apply feature engineering to derive: - Average driver points from recent races - Team performance trend (constructor standings) - Qualifying position and grid advantage - Circuit-level attributes (e.g., country, type) - Handle missing values, encode categorical variables, and normalize numerical ones. ### Modeling Approach We will start with multiple regression models to predict finishing position (positionOrder) for each driver: - Linear Regression: baseline for interpretability - Random Forest Regressor: capture non-linear interactions - XGBoost Regressor: efficient, high-performance gradient boosting method If results show instability or complementary strengths across models, we may test a simple ensemble (e.g., stacking) to assess improvement, but only if it adds meaningful value. ### Evaluation Metrics We will evaluate model performance using: - Mean Absolute Error (MAE) – how far the prediction is from the true position - R² Score, i.e., how much variance is explained by the model We’ll also compare feature importances to identify the most influential predictors (e.g., grid position, constructor, driver consistency). ### Validation - Train/test split by season to mimic real-world forecasting - 5-fold cross-validation to ensure generalization and reduce overfitting ## Weekly Plan Weeks 1–2: Loading data, performing exploratory analysis, and engineering features. Merging the 14 CSV files and creating historical rolling averages. Weeks 3–4: Training individual models and performing hyperparameter tuning. Establishing baseline performance and identifying the strongest individual model through cross-validation. Week 5: Implementing the ensemble model and conducting a comprehensive evaluation. Week 6: Performing final analysis, creating one or more visualizations, and preparing the report.