Predicting Formula 1 Race Winners using pre-race data
Proposal
Dataset
races = pd.read_csv('data/raw/races.csv')
results = pd.read_csv('data/raw/results.csv')
drivers = pd.read_csv('data/raw/drivers.csv')
constructors = pd.read_csv('data/raw/constructors.csv')
driver_standings = pd.read_csv('data/raw/driver_standings.csv')
constructor_standings = pd.read_csv('data/raw/constructor_standings.csv')
qualifying = pd.read_csv('data/raw/qualifying.csv')
circuits = pd.read_csv('data/raw/circuits.csv')A brief description of your dataset including its provenance, dimensions, etc. as well as the reason why you chose this dataset.
Our analysis utilizes the comprehensive Formula 1 World Championship dataset available on Kaggle, originally compiled from the Ergast Developer API. This dataset represents historical records of Formula 1 racing spanning from 1950 to 2024.
Dataset Dimensions: - Coverage: 74 seasons (1950–2024) - Total races: ~1,100 Grand Prix events - Race results: ~25,000 driver-race combinations - Files: 14 CSV files including results, qualifying, standings, circuits, and lap times
Why This Dataset?
- Sufficient data for training robust machine learning models.
- The data has been carefully curated and cleaned by the F1 analytics community, reducing preprocessing overhead and ensuring reliability.
- The dataset’s structure naturally separates pre-race information (qualifying results, standings, circuit data) from race outcomes, which is crucial for building a prediction model that avoids data leakage.
- The dataset is well-documented and widely used in motorsport analytics research, providing a strong foundation for reproducible research.
Motivation & Real-World Applications
Understanding which pre-race factors drive Formula 1 success provides actionable insights for multiple stakeholders: - For F1 Teams: Predictive models can inform resource allocation between qualifying optimization and race-day strategy. By quantifying the relative importance of grid position, recent form, and team performance, teams can justify investments in qualifying simulations, aerodynamic development, or strategic planning. - For Analysts & Broadcasters: Data-driven pre-race predictions enhance viewer engagement through evidence-based commentary. - For Sports Analytics Research: This project demonstrates best practices for temporal modeling in sequential competitions—preventing data leakage in time-series data, balancing model complexity with interpretability, and validating predictions using chronological splits. These techniques transfer to other motorsports and sports with similar temporal structures.
Questions
Question 1: Which pre-race factors are most predictive of a Formula 1 driver’s race outcome? (e.g., grid position, constructor, driver form, circuit characteristics).
Question 2: Can regression models accurately predict driver finishing positions using only pre-race data, and which model performs best for this task? * The focus will be on understanding what influences success rather than perfect prediction accuracy.
Analysis plan
We will predict a driver’s finishing position (positionOrder) using data available before or at the start of each race. This will help identify which pre-race factors contribute most strongly to race outcomes.
Data Preparation
- Merge relevant CSVs: results, races, drivers, constructors, circuits, and qualifying.
- Select only pre-race variables to avoid leakage from in-race or post-race data.
- Apply feature engineering to derive:
- Average driver points from recent races
- Team performance trend (constructor standings)
- Qualifying position and grid advantage
- Circuit-level attributes (e.g., country, type)
- Handle missing values, encode categorical variables, and normalize numerical ones.
Modeling Approach
We will start with multiple regression models to predict finishing position (positionOrder) for each driver: - Linear Regression: baseline for interpretability - Random Forest Regressor: capture non-linear interactions - XGBoost Regressor: efficient, high-performance gradient boosting method If results show instability or complementary strengths across models, we may test a simple ensemble (e.g., stacking) to assess improvement, but only if it adds meaningful value.
Evaluation Metrics
We will evaluate model performance using: - Mean Absolute Error (MAE) – how far the prediction is from the true position - R² Score, i.e., how much variance is explained by the model We’ll also compare feature importances to identify the most influential predictors (e.g., grid position, constructor, driver consistency).
Validation
- Train/test split by season to mimic real-world forecasting
- 5-fold cross-validation to ensure generalization and reduce overfitting
Weekly Plan
Weeks 1–2: Loading data, performing exploratory analysis, and engineering features. Merging the 14 CSV files and creating historical rolling averages.
Weeks 3–4: Training individual models and performing hyperparameter tuning. Establishing baseline performance and identifying the strongest individual model through cross-validation.
Week 5: Implementing the ensemble model and conducting a comprehensive evaluation.
Week 6: Performing final analysis, creating one or more visualizations, and preparing the report.