work/project/2025-03-23

India Air Quality Dataset

A reverse-engineered AQI dataset built to estimate pollutant concentrations from CPCB AQI data, forming part of the data foundation behind BreatheEasy.

This project was created as part of the early data engineering work behind BreatheEasy. While CPCB datasets provided AQI values, they did not expose pollutant-level data in a usable format for the forecasting pipeline we were building.

The challenge was not collecting AQI data — it was reconstructing the missing pollutant information required to make the system usable.

The Problem

Most publicly accessible CPCB datasets focused primarily on AQI values rather than detailed pollutant concentrations. For machine learning workflows, this created a major limitation — the system lacked the pollutant-level granularity needed for deeper forecasting and environmental analysis.

  • AQI data available without structured pollutant values
  • Limited ML-ready preprocessing
  • Inconsistent accessibility across cities
  • No unified estimation workflow

Reverse Engineering Pollutants

To bridge this gap, I developed a generalized estimation system that derived approximate pollutant concentrations mathematically from AQI values using reverse-calculated scaling relationships.

Instead of city-specific tuning, the system intentionally used generalized estimation coefficients that could scale consistently across multiple datasets.

excel
PM2.5 = ROUND(AQI × 0.55, 2)
PM10  = ROUND(AQI × 1.08, 2)
NO2   = ROUND(AQI × 0.83, 2)
SO2   = ROUND(AQI × 1.14, 2)
CO    = ROUND(AQI × 0.0105, 2)

The formulas were designed through iterative testing and adjustment to reduce excessive deviation while still maintaining usable relationships between AQI and estimated pollutant values.

Dataset Structure

The resulting dataset combined AQI information with estimated pollutant concentrations and weather parameters, creating a significantly richer structure for experimentation and forecasting workflows.

csv
City, Date, AQI, PM2.5, PM10, NO2, SO2, CO, O3
Bangalore, 2018-01-01, 68, 37.4, 73.44, 56.44, 77.52, 0.71, ...

+ Weather Parameters
- Temperature
- Humidity
- Pressure
- Cloud Cover
- Wind Speed
- Rainfall

Coverage

Cities Covered5 Indian Cities
Core SourceCPCB AQI Data
Dataset TypeML-Oriented AQI Dataset
PurposeForecasting Foundation

Impact on BreatheEasy

Although the dataset was not perfectly accurate, it became an important experimental foundation during the development of BreatheEasy. It helped identify what the forecasting pipeline lacked, how pollutant relationships behaved, and what additional data engineering steps were required.

This project was less about precision — and more about building missing infrastructure when clean data did not exist.

Reflection

One of the hardest parts was finding generalized formulas that produced stable approximations without creating extreme deviation across cities. The process involved repeated experimentation, comparison, and refinement.

This project reinforced an important lesson: data science is often constrained less by models and more by data availability, quality, and structure.

Project Access

View Dataset Repository on GitHub

System Dependencies

PythonPandasCPCB DataExcel

Reference UID

CPP-2025-AQI-DATASET

Data Integrity

Verified Stable / 2025-03-23

Citation Protocol

Harvard / APA / System