Evaluating machine learning models for air quality error mapping in Kraków, Poland

Mateusz Zaręba; Szymon Cogiel; Elżbieta Węglińska; Tomasz Danek

doi:10.2478/mgrsd-2025-0026

Full Article

Introduction

The smart city concept is one of the urban development approaches based on using technology to improve residents’ quality of life (Firszt & Jabłoński 2023). One of the key elements of the smart city concept is clean air (Liang & Gong 2020). Kobus et al. (2018) emphasized the importance of delivering air quality information to city residents to raise awareness about this critical issue. Jonek-Kowalska (2023) pointed out the lack of long-term studies on air pollution in cities aspiring to become smart cities. Zareba et al. (2024a) performed an in-depth analysis of air pollution concentration to evaluate the effects of energy transition and environmental policies, aiming to understand the long-term environmental consequences of these actions. Kraków (Figure 1) has progressed in terms of air pollution but extreme smog episodes (ESE) persist (Danek & Zareba 2021; Zareba & Danek 2021). There are no strict PM limit values defining smog but, in Poland, the limit values for PM₁₀ are set at 100 µg/m³ as the information threshold and 150 µg/m³ as the alarm level, while, for PM_2.5, Poland has no official 24-hour limit (Rada Ministrów 2019); however, WHO recommends a 24-hour guideline of 15 µg/m³ for PM_2.5 and 45 µg/m³ for PM₁₀ (WHO 2021). Despite a solid fuel ban, smog remains a problem in Kraków due to its geography (Morawska-Horawska & Lewik 2003). The city’s location along the Wisła River valley, surrounded by hills, causes pollution to flow into Kraków from neighbouring towns where the burning ban is not in effect (Danek et al., 2022). While reference stations provide highly accurate air quality data, they are expensive and limited in coverage. In contrast, low-cost sensors (LCS) are a reliable and affordable choice for smart cities to monitor air pollution on a larger scale, even though they are sensitive to humidity (Di Antonio et al. 2018). As part of the Internet of Things (IoT), these sensors provide real-time air quality data, helping cities to track harmful emissions and take action to improve public health. Smart cities use web-based and remote monitoring to better understand pollution levels in public and green spaces. By identifying major sources of pollution and informing the public, cities can encourage eco-friendly practices, promote sustainable policies, and reduce overall pollution levels for a cleaner and healthier environment (Ullah et al. 2023). Despite the high availability of data from LCS, reliable pollution predictions are challenging and depend on the model choice, especially for ESE. Previous studies successfully demonstrated a distinct seasonality in air pollution levels in Kraków (Zareba et al. 2024b) and introduced machine learning (ML) techniques that proved highly effective for spatiotemporal analysis of PM_x distributions (Zareba et al. 2023).

This paper’s novelty lies in a detailed examination of the spatial distribution of prediction errors across various algorithms. The study presents the results of global models, using XGBoost (Chen & Guestrin 2016) and DLinear (Zeng et al. 2022), and evaluating their effectiveness compared to the classic ARIMA (Box & Jenkins, 1970) method. In previous research by Zareba et al. (2024a), these models were used to calculate 24-hour forecasts of PM_2.5 concentrations at the sensor locations, and the errors between the real hourly data and the model predictions were calculated. The 24-hour forecast is generated by predicting each hour separately, one step at a time, for the next 24 hours using the trained model. This study goes further by explaining how these errors vary spatially. A hot-spot analysis was conducted for prediction errors from each method, comparing an ESE with a low pollution day, using the local Getis-Ord G_i* spatial autocorrelation method. We consider a smog day as one in which a PM_2.5 episode exceeding 50 μg/m³ occurs – a threshold commonly used for PM₁₀. A day on which this limit value is not exceeded is classified as a low pollution day. An interpretation of other factors, such as location and proximity to water bodies, was also carried out to assess their potential influence on prediction errors. In this study, prediction error refers to the discrepancy between the observed PM_2.5 concentrations from the LCS and the values predicted by the models. Depending on the context, this is quantified using metrics such as RMSE (Root Mean Squared Error) or MAE (Mean Absolute Error). Rather than solely evaluating overall prediction accuracy, this study focuses on understanding the factors influencing model performance in different geographic conditions. It should be noted that models are compared with LCS data, so result from inaccuracies in the measurement values, not solely from incorrect model predictions. By leveraging explainable AI (XAI) techniques, we analyse why specific models perform better in certain areas, and under what conditions errors increase, aiming to uncover whether a single predictive model can generalize across Kraków or if a hybrid approach is necessary. Our findings indicate that different models exhibit varying sensitivity to local environmental factors, such as humidity and proximity to water bodies, which significantly impact prediction reliability. This suggests that selecting the most suitable predictive approach requires not just statistical performance evaluation but also an understanding of spatial dependencies, ultimately contributing to more interpretable and effective air pollution forecasting.

Materials and Methods

The impact of the Low-Emission Reduction Programme (PONE) in Kraków was analysed for the period from 2014 to 2019 (Kraków 2019). Spatiotemporal data from 52 Airly LCS (Airly 2024) were used. LCS may have lower accuracy than reference stations but their coverage allows for a quasi-uniform spatial distribution across the city. In previous research on the sensors used in this study, the median difference from reference stations was approximately 3.0 µg/m³, with pollution levels reaching around 80.0 µg/m³ on some days (Danek & Zareba 2021). While LCS have inherent uncertainties compared to reference stations, the Airly LCS in this study underwent manufacturer calibration using AI/ML and weather corrections, achieving performance validated within approximately 10% of reference measurements. The data accessed via API were already adjusted through this manufacturer calibration process. These inherent uncertainties do not affect the analysis of ML prediction errors, as the same calibrated data were used both for training and evaluation. In studies comparing two sensors side by side, the Airly LCS uncertainty was 2.4 µg/m³ for PM_2.5 (Zięba & Dworakowska 2018). In other studies, Airly LCS without prior calibration demonstrated strong correlations with reference instruments for PM_2.5 (0.77 < R² < 0.90) (South Coast Air Quality Management District n.d.). The impact of the Low-Emission Reduction Programme (PONE) in Kraków was analysed for the 2014 –2019 period (Kraków 2019). An automated pipeline was developed to clean the data, fill in missing values using linear interpolation, perform preprocessing, and conduct feature engineering. Missing value imputation can also be done using machine learning methods (Arnaut et al. 2024). During the feature engineering phase, cyclical features (e.g. day, hour, wind, etc.), sunset times, and seasonal trends were created for different lags. The training process used 168 unique lagged features, derived from 28 input data features. The models were trained with global and local approaches for the PM_2.5 forecast: XGBoost and DLinear from Darts library (Herzen et al. 2021) trained on all LCS (global), while ARIMA trained on each one separately. The training period spanned from 1 January 2022 to 20 October 2022 at 01:00:00. Backtesting, a method for evaluating predictive models on historical data, was then performed using a rolling window with a stride of 1 from 20 October 2022 at 01:00:00 to 1 January 2023. The backtesting concept is described in Herzen et al. (2021). The model trained on 80% of the data and made 24-hour forecasts at each step, updating iteratively with new data. The testing dataset consisted of the remaining 20% of the data. In this study, specific dates were selected based on concentration analysis from Zareba et al. (2024a), and the model training followed the methodology outlined in that study. The XGBoost model is a scalable and efficient gradient-boosting framework that builds additive regression trees. Additionally, XGBoost is highly effective for time series forecasting because it can include additional covariates that influence the target feature. Its ability to extract and model complex, general features from data ensures robust and accurate predictions, making it well suited for spatial forecasting tasks.

XGBoost aims to minimize the following objective function, which includes a regularization term to prevent overfitting, as shown in Eq. (1): (1) $O b j (Θ) = \sum_{t = 1}^{n} l (y_{t}, \hat{y_{t}}) + \sum_{k = 1}^{K} Ω (f_{k})$ Obj\left( \Theta \right) = \sum\limits_{t = 1}^n {l\left( {y_t ,\widehat{y_t }} \right) + \sum\limits_{k = 1}^K {\Omega \left( {f_k } \right)} } where l is a differentiable convex loss function (e.g. mean squared error), y_t represents the measured PM_2.5 concentration at the t-th hour for a given sensor, y_t represents the predicted PM_2.5 concentration at the t-th hour, Ω(f_k) is the regularization term for the k-th tree f_k, and Θ represents the parameters of all trees. The XGBoost model was trained using the following hyperparameters: number of boosting rounds = 500 (controlling model complexity), maximum tree depth = 6 (capturing complex patterns), learning rate = 0.3 (adjusting weight updates per round), L2 regularization = 1 (mitigating overfitting), and output chunk length = 24 (predicting 24 future time steps at once). The lags used were [−48, −24, −2, −1] for the main variables, [−48, −24, −12, −8, −2, −1] for past covariates, and [−8, −4, −3, −2, −1, 0] for future covariates. Following model training and backtesting, the XGBoost model achieved an MAE of 4.104 and an RMSE of 6.763, indicating a satisfactory level of predictive accuracy for the given forecasting task.

The ARIMA method (Box & Jenkins 1970) is an Autoregressive Integrated Moving Average model, used for analysing and forecasting time series data. It is defined by the equation Eq. (2): (2) $y_{t} = c + ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + \dots + ϕ_{p} y_{t - p} + θ_{1} ɛ_{t - 1} + θ_{2} ɛ_{t - 2} + \dots θ_{q} ɛ_{t - q} + ɛ_{t}$ y_t = c + \phi _1 y_{t - 1} + \phi _2 y_{t - 2} + \cdots + \phi _p y_{t - p} + \theta _1 \varepsilon _{t - 1} + \theta _2 \varepsilon _{t - 2} + \cdots \theta _q \varepsilon _{t - q} + \varepsilon _t where y_t is the differenced PM_2.5 time series at time t, c is a constant term, ϕ₁, ⋯,ϕ_p are the autoregressive (AR) coefficients, θ₁, ⋯,θ_q are the moving average (MA) coefficients, and ϵ_t is the error term at time t. For comparison, the ARIMA model underperformed with an MAE of 6.688 and an RMSE of 9.282, confirming that XGBoost captured the underlying patterns more effectively.

The DLinear model (Zeng et al. 2022) is a deep learning architecture designed for long-term time series forecasting. It focuses on modelling the linear dynamics by decomposing the time series and employing linear transformations. DLinear achieved the best results, with an MAE of 2.947 and an RMSE of 3.888. SHAP (SHapley Additive exPlanations) values were computed following the methodology of Lundberg & Lee (2017) to quantify the contribution of each meteorological factor according to the model’s predictions.

The study focused on global models (linear neural networks or ensemble trees) and the local ARIMA method to assess predictive potential using data from either all sensors or individual ones. The hot-spot analysis was conducted using the Getis-Ord G_i^* (High/Low Clustering) tool, implemented in ArcGIS Pro (ESRI 2021). This tool identifies statistically significant clusters of high and low values. The Getis-Ord local statistic G_i^* (Getis & Ord 1992) is calculated according to the formula Eq. (3): (3) $G_{i}^{*} = \frac{\sum_{j = 1}^{n} w_{i j} x_{j} - \bar{X} \sum_{j = 1}^{n} w_{i j}}{S \sqrt{\frac{n \sum_{j = 1}^{n} w_{i j}^{2} - {(\sum_{j = 1}^{n} w_{i j})}^{2}}{n - 1}}}$ G_i^ * = {{\sum\nolimits_{j = 1}^n {w_{ij} \,x_j - \overline X } \sum\nolimits_{j = 1}^n {w_{ij} } } \over {S\sqrt {{{n\sum\nolimits_{j = 1}^n {w_{ij}^2 - \left( {\sum\nolimits_{j = 1}^n {w_{ij} } } \right)^2 } } \over {n - 1}}} }} where x_j is the attribute value of x at location j, n is the number of location points, w_ij is the spatial weight between locations i and j, X̅ is the mean, and S is the standard deviation of the attribute.

Results

The research uncovers spatiotemporal dynamics in air pollution distribution. Long-term observations in Kraków show that ESE remain significant despite an overall decline in pollution.

Figure 2 presents a clear decrease in yearly PM_2.5 averages, with the largest drop between 2015 and 2016. Similar trends have been seen in neighbouring Slovakia, linked to winter heating practices (Sediva & Stefanik 2024). A linear relationship exists between the amount spent on the PONE (Kraków 2019) programme and the removal of coal furnaces and boilers. A similar behaviour can be observed in renewable energy sources.

Violin plots show prediction errors for ESE on 30 December at 23:00 (Figure 3) and steady low pollution on 12 December at 10:00 (Figure 4). The DLinear model accurately captured rare values on both analysed days, with the highest error below 25 µg/m³. For the ESE, the highest prediction errors occurred with the XGBoost model (orange), reaching nearly 150 µg/m³. On the no-smog day, the highest error was from the ARIMA model (green), at 30 µg/m³, while the highest error for DLinear on this day was less than 7 µg/m³. Errors for the DLinear models (blue) on both days were around the LCS accuracy level of 10 µg/m³. For the ARIMA model during ESE days, two groups can be observed – one with the majority of errors around 12.5 µg/m³ and the other with around 37.5 µg/m³.

Figure 5 presents the distribution of errors for the DLinear, XGBoost, and Arima models – for both ESE and steady low pollution days. For the DLinear model during ESE, the errors across the entire area are relatively low and distributed along a NW–SE axis, with a tendency towards a latitudinal orientation in the northern part. The largest errors are associated with LCS 10 (Krzywaczka), 41 (Trojanowice), 43 (Karniów), and the 30–36–37 group (in the vicinity of the open swimming and fishing area in Zabierzow Bochenski). The lowest errors are observed for LCS 8 (Kwapinka), 17 (Rzeszotary), 25 (Zwierzyniec-Kraków city), and 32 (Aleksandrowice). These XGBoost errors during ESE, are significantly larger (7–8 times) than those using the DLinear model. Three distinct areas of high errors are evident, distributed across various sections of the studied terrain. The largest error is for LCS 10 (Krzywaczka), located in a locally isolated valley near the Skawinka River. This is followed by LCS 22 (Zagórze), situated in a local depression at the confluence of small streams with the Podlezynka River and a small pond, and LCS 37 (Chabot) in the open fishing and swimming area and floodplain of the Wisła River. The ARIMA error distribution during ESE, may be described as the superposition of DLinear and XGBoost surface locations of the highest errors. Error values lie between those obtained with the above-mentioned ML models. The errors are significantly lower across all methods during steady low pollution days. While a latitudinal distribution characterizes the DLinear model, the XGBoost model exhibits a more longitudinal orientation, whereas ARIMA appears to fall somewhere in between. The area around the floodplains near Zaborze Bocheńskie is consistently characterized by relatively high errors, which is not evident in the case of ARIMA. For both XGBoost and ARIMA, the largest error is observed in LCS 49, located in Proszowice.

Figure 6 presents the relative importance of meteorological factors across LCS 10, 19, and 37, as computed from contributions to the total SHAP values attributed to each sensor. The influence of various phases of water is observable. Relative humidity and soil moisture are identified as the most significant factors, with air temperature typically occupying the third position in terms of importance. This finding is particularly noteworthy, as existing research indicates that the presence of water vapour, such as fog, can act as a precursor to pollution accumulation (Tsyro 2005). Furthermore, it is important to consider that optical measurements derived from LCS may be affected by the presence of H₂O, thereby potentially leading to increased measurement errors under these conditions (Peltier et al. 2020).

Figure 7 shows hot-spot and cold-spot maps with their confidence level for PM_2.5 error prediction using Getis-Ord G_i* for ESE and low pollution days. The results were presented for each model using: DLinear, XGBoost, and ARIMA. This method only allowed hot spots to be identified. The largest prediction errors for the DLinear method were observed at LCS 10 (95% confidence level), and LCS 4, 41, and 43 (90% confidence level) during smog events. For the low pollution day, significant errors were noted at LCS 4, 36, and 37 (90% confidence level). Exceptionally high prediction errors for the XGBoost model were observed during the ESE at LCS 10 (99% confidence level), and at LCS 19 and 37 (95% confidence level). On the low pollution day, error hot spots were identified at LCS 31, 49, and 50 (99% confidence level) and LCS 39 and 39 (90% confidence level). For the ARIMA model, error hot spots were identified with 99% confidence at LCS 9, with 95% confidence at LCS 23, and with 90% confidence at LCS 24. On the low pollution day, a hot spot was identified with 99% confidence at LCS 49 and with 90% confidence at LCS 3.

Discussion

Short- and long-term PM_2.5 data, combined with PONE data, enabled a spatiotemporal assessment of ML methods in air quality management.

For long-term analyses, a downward trend in average PM_2.5 concentrations is evident. Unfortunately, despite generally good annual average concentrations, Kraków continues to experience ESE due to the inflow of pollutants into the city. In general, financial expenditure for the PONE programme has been significant but has been declining since 2017. It can be observed that the reduction in funding directly impacts the number of replaced furnaces and the installation of new renewable energy sources. It is essential to increase funding and intensify efforts at the regional level, so that not only the city of Kraków but also the surrounding areas, which potentially produce pollution, employ environmentally friendly heating sources. Such actions would yield health benefits for the entire region.

Prediction errors play a key role in both model evaluation and XAI. Defined as the difference between observed and predicted PM_2.5 values, they provide insights into model reliability, while also revealing spatial and temporal patterns that can improve interpretability. Unusual errors may indicate sensor limitations, missing features, or complex relationships that are not effectively captured by the model. Building on previous research by Zareba et al. (2024a), this study shows that errors are not just performance metrics but can enhance our understanding of model behaviour. Spatial patterns in errors highlight areas where predictions are less reliable, offering a way to cross-check feature importance and detect biases. PM_2.5 prediction errors during ESE varied significantly – LCS 10, 17, and 22 had very high errors, especially for the XGBoost model (over 130 µg/m³), while for the DLinear models, the errors are much lower (20 µg/m³ or less). The ARIMA local model is challenging to evaluate: it shows lower errors than XGBoost in some cases but tends to have higher errors in others. Overall, DLinear is the most accurate model, outperforming XGBoost due to specific data characteristics and model mechanics. For example:

DLinear models are designed to predict temporal dependencies and, thanks to their linear architecture, they can be more reactive to rapid changes in datasets, handling outlier predictions (like ESE) better. In contrast, the XGBoost model is not primarily designed for time series analysis and may struggle to learn temporal patterns.
The DLinear model architecture uses trend, seasonality, and residuals decomposition, which naturally allows both the general trend and short-term rare events, such as smog spikes, to be captured. XGBoost, which is based on ensemble methods with decision trees, generally provides good predictions but prioritizes overall trend predictions rather than rare peaks, as splits are done based on the most representative part of the data, not rare events. The DLinear model’s sensitivity to ESE is higher than that of XGBoost. It is important to mention that the analysis of high-error areas from XGBoost errors can be used as a feature or be part of the interpretation.
The loss function in XGBoost is typically mean square error, which is not optimal for learning patterns with rare peaks, as the goal is to minimize the error for the majority of the data. DLinear models, on the other hand, typically use a quantile loss function that is more inclusive of large errors, allowing the model to learn from spike events.

A key direction for further research is to use errors as lagged features in hybrid models. Capturing spatial and temporal dependencies in errors could improve both predictive accuracy and interpretability, aligning with XAI principles to make air quality models more transparent and trustworthy.

Conclusions

For annual average PM_2.5 concentrations, a clear decreasing trend is evident. The yearly PM_2.5 average levels (Bujaka station) have declined from 44 µg/m³ to 20 µg/m³. There is a connection between the PM_2.5 trend and financial investments in the PONE programme, the number of coal-fired stoves or boilers replaced, and the number of installations based on renewable energy sources. Despite these positive actions, Kraków struggles with ESE, primarily due to the influx of pollutants from outside the city. Predicting these rare episodes is crucial and requires a complex spatiotemporal approach, as the dynamics of the atmosphere and PM_2.5 migration pathways and deposition trends are characterized by a high degree of complexity. The research analysed ESE and low pollution days. One of the key findings of our study is that DLinear models consistently outperform other methods in air pollution forecasting, particularly for rare smog episodes, when trained as a global model across all monitoring stations. Prediction errors are generally the lowest and least varied for global linear models of neural networks (DLinear). The global XGBoost model exhibited the largest errors, reaching nearly 130 µg/m³ (13 times the accuracy of the particulate matter measuring device) on ESE days, while, on days without such episodes, this dependency was exhibited by the local model of traditional ARIMA prediction. Detailed feature contribution analysis for LCS with the highest errors showed a clear connection with vapour-related features.

Evaluating machine learning models for air quality error mapping in Kraków, Poland

Full Article

Paradigm

My account