Exploration of Air Quality Index Geographic and Seasonal Factors

Proposal

Project description

Author

Affiliation

Team Name - Jack Stevens, Lila Carruthers, Rithik Mehta

College of Information Science, University of Arizona

The overall goal of our project is to generate an interactive dashboard that displays information about air quality in the United States over the past year. Ideally, this dashboard could serve as a tool for people to better understand the factors influencing air quality both in their area and across the United States more broadly.We are aiming to have visualizations for each of the major air pollutants in addition to an overall graph of the AQI over time. Further visualizations could include a rendering of the AQI scale showing the range from good to hazardous, a graph of the top 10 counties with the worst air quality on average, or another similar visual displaying patterns or outliers in our data. The dashboard will include filters for time period, season, and geographic region. During the EDA phase of our project, we may also choose to enrich the data further with population density by county or other key features that may provide further context to the dashboard. Our dashboard will be one page to ensure data can be easily compared and it is as intuitive to use as possible.

import numpy as np
import pandas as pd

Dataset

data = pd.read_csv('./data/daily_aqi_by_county_2025.csv')
print(f'Data Shape of "daily_aqi_by_county_2025.csv:"', data.shape, '\n')
print('Data Types for "daily_aqi_by_county_2025.csv":')
print(data.dtypes)
data2 = pd.read_csv('./data/daily_TEMP_2025.csv')
print(f'Data Shape of "daily_TEMP_2025.csv:"', data2.shape, '\n')
print('Data Types for "daily_TEMP_2025.csv":')
print(data2.dtypes)
data3 = pd.read_csv('./data/daily_WIND_2025.csv')
print(f'Data Shape of "daily_WIND_2025.csv:"', data2.shape, '\n')
print('Data Types for "daily_TEMP_2025.csv":')
print(data2.dtypes)

Data Shape of "daily_aqi_by_county_2025.csv:" (105869, 10) 

Data Types for "daily_aqi_by_county_2025.csv":
State Name                   object
county Name                  object
State Code                    int64
County Code                   int64
Date                         object
AQI                           int64
Category                     object
Defining Parameter           object
Defining Site                object
Number of Sites Reporting     int64
dtype: object
Data Shape of "daily_TEMP_2025.csv:" (82755, 29) 

Data Types for "daily_TEMP_2025.csv":
State Code               int64
County Code              int64
Site Num                 int64
Parameter Code           int64
POC                      int64
Latitude               float64
Longitude              float64
Datum                   object
Parameter Name          object
Sample Duration         object
Pollutant Standard     float64
Date Local              object
Units of Measure        object
Event Type             float64
Observation Count        int64
Observation Percent    float64
Arithmetic Mean        float64
1st Max Value          float64
1st Max Hour             int64
AQI                    float64
Method Code              int64
Method Name             object
Local Site Name         object
Address                 object
State Name              object
County Name             object
City Name               object
CBSA Name               object
Date of Last Change     object
dtype: object
Data Shape of "daily_WIND_2025.csv:" (82755, 29) 

Data Types for "daily_TEMP_2025.csv":
State Code               int64
County Code              int64
Site Num                 int64
Parameter Code           int64
POC                      int64
Latitude               float64
Longitude              float64
Datum                   object
Parameter Name          object
Sample Duration         object
Pollutant Standard     float64
Date Local              object
Units of Measure        object
Event Type             float64
Observation Count        int64
Observation Percent    float64
Arithmetic Mean        float64
1st Max Value          float64
1st Max Hour             int64
AQI                    float64
Method Code              int64
Method Name             object
Local Site Name         object
Address                 object
State Name              object
County Name             object
City Name               object
CBSA Name               object
Date of Last Change     object
dtype: object

We decided to use the United States Environmental Protection Agency’s (EPA) 2025 daily Air Quality Index (AQI) dataset by county. It contains 105,869 rows with 10 variables showing insight into daily air quality and how it changes over days, months, and seasons. The dataset comes from collections of air quality measurements by specific monitoring stations over the year. Key variables include the location of the measurement, the AQI value (standardized value representing the pollution level), and the air quality category (qualitative variable representing how poor or great the air quality is on a scale).

This dataset was ideal for our goal to create an interactive dashboard portraying visual patterns and trends with air quality in different regions and seasons over time. Additionally, it will allow for further discovery into the differences in air quality over weeks, months and the whole year. Furthermore, being able to create a dashboard with real environmental implications was important as we would be able to visualize how pollution has affected certain regions over the last year. Using daily data with several data points will also allow us to create a more accurate representation of pollution and air quality in different regions and seasons over 2025. Ultimately, the EPA’s daily AQI index dataset was ideal for portraying air quality in an interactive dashboard through data driven visualization.

Variables:

State Name: US State
County Name: US County Name
State Code: Code for State
County Code: Code for County
Date: date of measurement
AQI: Air Quality Index value
Category: Qualitative value for AQI
Defining Paramter: AQS code corresponding to the parameter measured
Defining Site: Site corresponding to the site where measurement was reported from
Number of Sites Reporting: Total amount of sites with recording measurement

Data Types:

State Name: String
County Name: String
State Code: Int
County Code: Int
Date: String
AQI: Int
Category: String
Defining Parameter: String
Defining Site: String
Number of Sites Reporting: Int

Questions

How the Air Quality Index (AQI) changes across different regions and over time: Are there any extreme fluctuations in AQI within certain regions or shifts in patterns throughout the year?
How do AQI factors change over time: Are there seasonality or geographic factors that contribute to the AQI or its key contributors?

Analysis plan

Week 1 - EDA / Preprocessing - Lila

This week should be focused on understanding the data and any interesting relationships between the features and the target variable (AQI value). The focus should be handling any data quality issues (missing value, encoding categorical variables, scaling values) and removing any unnecessary columns. By the end of this week, we should have a summary of key findings from the EDA and a preprocessed dataset.

Week 2 - Modeling - Rithik

This week should be focused on performing advanced modeling techniques ( OLS and other regression models ) to understand the importance of the features and rank the influence that the features have on the target variable. By the end of this week, we should have a list of features that contribute to the Air Quality Index.

Week 3 - Dashboard Wireframing - Jack

This week should be focused on designing and mapping out the metrics to use on the interactive dashboard. This should include the type of visuals, the expected interactions of variables and require data structures to build the report. By the end of this week, we should have a rough sketch of the dashboard that answers the questions above.

Week 4 - Dashboard Building - Jack

This week should be focused on implementing wireframed dashboards using the Stream-lit library. This should include the building, testing, and finding key interactions between the visuals to demonstrate during the final presentation. By the end of this week, we should have a deployed dashboard with 2-3 patterns identified by the dashboard that demonstrate correlation between the AQI value and chosen features of the dashboard.

Week 5 - Final Report

This week should be focused on completing the final write-up and presentation. The group members for each week should discuss their decision about the work they were responsible for. By the end of the week, we should have a finalized report and recorded video.

Repository Structure

Data-

This folder should contain the datasets that are being used for this analysis. This folder should have a README.md that contains the description of the dataset as well.

Image-

This folder should contain any graphs / images that we plan on using for the presentation. This is where the wireframe of the dashboard will be stored.

notebook-

This folder should contain the work that we complete each week. These notebooks should be used to demonstrate analytical thought process behind the problem and experimentation.

src-

This is where any scripts will be stored to referenced in the notebooks. This could be any custom graphing, preprocessing, or model validation functions. This folder will also contain the script for the dashboard.

All other files / folders

This items will be used to generate the final presentation and site of our findings.