Air Quality Index and Air Pollution Prognosis Using Machine Learning Technology

This study has progressed through three major stages. The first included preparation and processing of air quality parameters. The second stage consists of calculations of AQI. Finally, we develop and evaluate the ML model. The framework adopted is shown in Figure 1.

Data preparation and processing

The air pollution data used in this study can be accessed online in a real-time air pollution monitoring data set generated using IoT – Mendelly Data^{twenty two}. This dataset was collected hourly from January 1, 2022 to December 31, 2022, using an IoT-based monitoring system in Ghazipur, Bangladesh. Contains six contaminant concentration levels: PM_2.5PM₁₀,co, no₂so₂and o₃was used to calculate the air quality index (AQI). The AQI was calculated according to the US Environmental Protection Agency (EPA) methodology using linear interpolation equations employed by the Environmental Bureau of Bangladesh (DOE) and national air quality breakpoints (see Table 1). For each contaminant, sub-index ${i} _ {p} $ Calculated using the formula. (1)^{twenty three}and the entire AQI was determined as the maximum computational subindex of the six contaminants.

$${i}_{p} = \frac {{i}_{high} – {i}_{low}} {{c}_{high} – {c}_{low}} \left({c}_{p} – {c}_{c}_{c}_{c}_{c}_{i}_{low} $$

(1)

where:

Table 1. AQI standards according to Bangladesh's DOE.

${i} _ {p} $=AQI value corresponding to contaminants p.

${c} _{p} $= Measured concentration of pollutants p

${c} _{low} $ Threshold of concentration that is =≤ ${c} _{p} $

${c} _{high} $ =≥ concentration threshold ${c} _{p} $

${i} _{low} $ Threshold for the index associated with =${c} _{low} $

${i} _{high} $ Threshold for the index associated with =${c} _{high} $

To ensure data quality, a box plot (Figure 2) was first applied to identify and remove outliers from the raw concentration values of each contaminant. Each boxplot displays the distribution of one contaminant using the actual measurement unit._2.5 and PM₁₀ (μm), Co (mg/m³), etc.₂no₂and o₃ (g/m³). Following the removal of outliers, all variables were normalized to the 0-1 range using the MIN-MAX scaling technique, resulting in features on equal scales suitable for machine learning algorithms, while retaining the original distributed shape. The cleaned dataset was split into 80% in training and 20% in testing. To reduce sampling bias and improve generalizability, training and testing were repeated multiple times, and a 10x cross-validation was performed to assess model stability.

To identify the most influential input variables for AQI predictions, a random forest was employed for characterization importance assessment. This technique effectively captures nonlinear relationships and interactions between variables, enabling a robust, data-driven approach to feature selection. Analysis revealed the PM_2.5 The most important score (12.6654) followed by the PM₁₀ (1.8387) and Co (1.7082). It's a PM_2.5 and PM₁₀ We showed a moderate correlation (r = 0.3014) calculated using Pearson correlation coefficients as defined in the equation. (2), both were retained due to a clear and substantial contribution to AQI prediction. In contrast, no₂ (0.7395)₂ (0.6767), and o₃ (0.6499) showed lower significance and was excluded from the final model. A bar chart summarizing the importance scores of these features is shown in Figure 3 to increase the clarity, transparency, and reproducibility of the variable selection process in alignment with best practices in machine learning-based environmental modeling.

$$r = \frac {\sum_{i = 1}^{n} \left({x}_{i} – \overline {x}\right)({y}_{i} – \overline {y})} {\sqrt {\sum_{i = 1}^{n} {({x}_{i} – \overline {x})}^{2}}. \sqrt {\sum_ {i = 1}^{n} {({y}_ {i} – \overline {y})}^{2}}}} $$

(2)

where:

$r = \text {pearson correlation coefficient} $

${x}_{i} = Individual values of contaminants x and y $

$\overline {x} and \overline {y} = mean x and y, respectively$

n = Number of data points

Development and evaluation of ML models

The Learner Regression App is a graphical interface provided within Matlab's statistics and machine learning toolbox^{twenty four}. The development and analysis of regression models for use in predictive modeling tasks is easily created by this tool. This application provides an intuitive interface that facilitates interactive exploration and analysis of data, prediction of model construction, performance evaluation of algorithms, and prediction. This study utilizes regression techniques such as GPR, ER, SVM, RT, and KAR. Each model was selected based on its theoretical fit and previous success in the environmental forecasting task. GPR offers robustness to stochastic output and noise. ER enhances generalization by aggregating multiple basic learners. SVM is suitable for high-dimensional data spaces. RT offers interpretability and simplicity. KAR enhances the model's ability to capture complex, nonlinear relationships.

Model training was performed using standardized input variables (PM_2.5CO, and PM₁₀), and hyperparameters were tuned to optimize performance. To ensure robustness and minimize overfitting, all models were cross-validated using 10x cross-validation. This is a way to systematically divide the data to reduce model bias and variance. Performance assessments were performed using established regression metrics containing R²RMSE, and MAE. Table 2 summarizes the detailed configuration and optimized hyperparameters applied to each model.

Table 2 Hyperparameters applied during the training phase.

Performance assessment of machine learning models

When predicting AQI using ML, model evaluation is important, so the learner regression tool provides and evaluates three key metrics. These three metrics are the absolute mean error (MAE), root root mean square error (RMSE), and coefficient of determination (r)²). The following equations can represent these statistical indicators:

(a)

May.

This condition allows the value of the error to be measured in the predictive dataset while being indifferent to the instructions. The MAE reflects the average absolute deviation between observed and predicted values across the test sample. It can be calculated from the equation. (3):

$$ mae = \frac {1}{n}\sum_{i=1}^{n}\left | {x}_{i} – {y}_{i}\right | $$

(3)

where:

$n $ = Data Point Number.

${x} _ {i} $ = Actual value.

${y} _ {i} $ =Predicted value.

(b)

rmse.

RMSE is further used to estimate the value of the error. To achieve this, we find the square root of the latter by taking the average of the average of the statistical variables in terms of actual and predicted values, as calculated in the equation. (4):

$$rmse = \sqrt {\frac {1}{n}\sum_ {i = 1}^{n} {\left({x}_ {i} – {y}_ {i} \right)}^{2}}}

(4)

where:

${x} _ {i} $ = Actual observation.

${y} _ {i} $=Predicted value.

n = number of data points.

The coefficient of determination represents a metric that evaluates the degree to which the model explains the variance of observed data compared to the prediction. Specifically, it quantifies the percentage of total variation in actual values that the model's predictions can explain. Its values range from 0 to 1, and a high value suggests excellent model performance. Conceptually, it is the ratio of variance explained by the model to the total variance observed in the data. An R-squared value approaching 1 indicates that the model's prediction matches the actual data value. You can calculate it as shown in the formula. (5):

$${r}^{2} = 1- \frac {{\sum }_{i = 1}^{n} {\left({x}}_{i} – {y}_{i}_{i}_{right)}^{2}}} {\sum_{i = 1}^{n} {\left({x}_{i} – \overs } \right)}^{2}} $$

(5)

where:

${x} _ {i} $= Actual value.

${y} _ {i} $=Predicted value.

$\overline {x} $=Average of actual values.

n= Data Point Number.

Source link

Najlepszy kod polecajacy Binance commented on Insights from Nabil Batawi, Group CHRO, Alkhorayef Group, KSA, ETHRWorldME: Your point of view caught my eye and was very inte
Parker Robinson commented on AI platform Hugging Face says hackers have stolen authentication tokens from Spaces: Bitcoin Mining for Passive Income in 2026 https://
100 USDT commented on How to Make AI Work for You, at Work: Thanks for sharing. I read many of your blog posts
创建Binance账户 commented on AI jobs in financial services: $350k for junior hires: Your article helped me a lot, is there any more re
1win commented on Do AI apps really need a GPU or NPU?: Saved as a favorite, I really like your website!

Air Quality Index and Air Pollution Prognosis Using Machine Learning Technology

Data preparation and processing

Development and evaluation of ML models

Performance assessment of machine learning models

Leave a Reply

RECENT POSTS

Are you using AI in the pit? Here are five safe apps for your toolbelt.

You can now ask AI to create your video feed on YouTube

AI-fabricated citations in over 2,800 biomedical journal articles

Data preparation and processing

Development and evaluation of ML models

Performance assessment of machine learning models

Related Posts

Leave a Reply