Machine Learning-Based Air Pollution Monitoring And Forecasting

doi:10.21203/rs.3.rs-8328705/v1

Machine Learning-Based Air Pollution Monitoring And Forecasting

2025 · doi:10.21203/rs.3.rs-8328705/v1

preprint OA: closed

Full text JSON View at publisher

Full text 68,398 characters · extracted from preprint-html · click to expand

Machine Learning-Based Air Pollution Monitoring And Forecasting | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Machine Learning-Based Air Pollution Monitoring And Forecasting Bindu sri.Mallula, M. N. Ravindra Babu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8328705/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Today, governments in developing countries are increasingly focused on managing air pollution, which results from vehicle fuel use, industrial operations, and the burning of waste. Poor air quality is a pressing health issue and is commonly assessed using PM2.5 levels among other variables. Accurate prediction and ongoing monitoring are crucial for pollution control. In this work, advanced machine learning and deep learning models—namely CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) neural networks—are implemented and evaluated to forecast future air pollution levels and the Air Quality Index (AQI) using historical data on PM2.5, NH3, CO, NO, NOx, and NO2, and SO2. These novel techniques are compared with traditional models to assess their prediction accuracy and robustness. By leveraging daily atmospheric datasets from Indian cities, the study demonstrates that modern ensemble and deep learning approaches can provide improved and more reliable forecasts of air quality, supporting data-driven public health interventions and policy decisions. Artificial Intelligence and Machine Learning Air Quality Index (AQI) PM2.5 air pollution forecasting machine learning (ML) deep learning (DL) CatBoost XGBoost Support Vector Regression (SVR) Long Short-Term Memory (LSTM) gradient boosting ensemble models time-series prediction Figures Figure 1 Figure 2 INTRODUCTION Air pollution is a critical environmental and public health issue, especially in rapidly urbanizing regions where industrial growth and vehicular emissions continue to worsen air quality. The accurate prediction of air pollution levels, such as PM2.5 concentrations and the Air Quality Index (AQI), is essential for informed policy-making to mitigate health risks and environmental impacts. Traditional statistical methods have been widely used, but these often struggle to capture the complex, non-linear, and temporal characteristics of air pollution data. Recent advancements in machine learning and deep learning provide powerful tools to address these challenges. Algorithms like CatBoost and XGBoost, which are gradient boosting ensemble methods, have shown superior predictive accuracy by effectively handling high-dimensional data and complex feature interactions. Meanwhile, Support Vector Regression (SVR) offers robustness against outliers in regression tasks, enhancing prediction reliability. Deep learning architectures, particularly Long Short-Term Memory (LSTM) networks, excel at modeling temporal dependencies, making them highly suited for forecasting air quality time series data with improved precision. This study focuses on incorporating these advanced algorithms into air quality prediction models, extending beyond previously used approaches like decision trees and random forests. By evaluating CatBoost, XGBoost, SVR, and LSTM on a comprehensive dataset of atmospheric pollutant measurements from various Indian cities, this work aims to improve forecasting accuracy and robustness. The integration of these methods offers a pathway for more reliable pollution monitoring, which is vital for public health interventions and emission control strategies. 2. A BRIEF LITERATURE REVIEW Recent studies in air pollution monitoring and forecasting have extensively employed traditional machine learning models such as Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Logistic Regression to predict air quality indices based on pollutants like PM2.5, SO2, NO2, and CO. These models have demonstrated satisfactory accuracy in classifying pollution levels and forecasting air quality trends in various urban areas. However, while robust, these conventional techniques have limitations in capturing complex nonlinear patterns and spatiotemporal dependencies in air pollution data. To address these challenges, advanced machine learning methods such as gradient boosting algorithms including CatBoost and XGBoost have emerged as powerful tools. These models effectively handle heterogeneous data and feature interactions, improving prediction accuracy. Additionally, Support Vector Regression (SVR) has been used for its robustness in modeling nonlinear relationships with fewer overfitting issues. More recently, deep learning architectures like Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) have gained prominence due to their ability to capture temporal and spatial dependencies, respectively. Hybrid models combining LSTM and CNN, often enhanced with attention mechanisms, have further advanced forecasting capabilities by integrating multimodal data sources. Furthermore, the integration of explainable AI techniques into air quality prediction models is becoming vital to interpret and validate the results for policymaking and public awareness. These developments mark a significant evolution from earlier machine learning approaches, offering enhanced accuracy, interpretability, and adaptability in air pollution forecasting systems. Incorporating these novel algorithms promises to improve real-time air quality monitoring and enable more effective environmental management strategies. PROPOSED METHODOLOGY Training phase • Preprocess the air quality dataset to handle missing values and balance classes, employing techniques like SMOTE for uniformity. • Split the data into training and validation sets, ensuring representative coverage of pollutant variables (PM2.5, PM10, SO2, NO2, CO, NOx, NH3, O3, and others). • Train advanced algorithms including XGBoost, CatBoost, and LSTM, alongside baseline models such as Random Forest and SVR, optimizing hyperparameters for each model to maximize forecasting precision. Testing Phase • Test trained models on the reserved testing set, evaluating real-world predictive performance for AQI and pollutant concentration forecasts. • Compare all models using metrics like R-squared, MAE, RMSE, and accuracy to assess effectiveness and robustnesss on both balanced and imbalanced datasets. • Select and report the best-performing algorithm based on comparative results, ensuring reliability for both short-term and long-term air pollution prediction. 3.1 MATERIALS AND PROCEDURES: The dataset used comprises daily air pollutant measurements, including PM2.5, PM10, SO2, NO2, CO, NOx, NH3, temperature, humidity, and wind speed from multiple Indian cities between 2021 and 2025. Data preprocessing involved cleaning missing values, balancing the dataset using SMOTE to address class imbalances, and selecting relevant features through recursive feature elimination (RFE) to improve model efficiency. For the analysis, advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and deep learning models such as Long Short-Term Memory (LSTM) networks were implemented. Hyperparameter tuning was performed to optimize model parameters. The models were trained on the processed data to forecast air quality indices and pollution levels, leveraging the strengths of gradient boosting and recurrent neural networks for both non-linear and temporal pattern learning. For the analysis, advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and deep learning models such as Long Short-Term Memory (LSTM) networks were implemented. Hyperparameter tuning was performed to optimize model parameters. The models were trained on the processed data to forecast air quality indices and pollution levels, leveraging the strengths of gradient boosting and recurrent neural networks for both non-linear and temporal pattern learning. 3.2 TECHNOLOGIES USED: • Programming Languages : Python was primarily used for data processing, model development, and evaluation. • Machine Learning Libraries: Key libraries include CatBoost and XGBoost for gradient boosting models, scikit-learn for Support Vector Regression (SVR) and preprocessing, and TensorFlow/Keras for implementing Long Short-Term Memory (LSTM) neural networks. • Data Handling: Pandas and NumPy were utilized for dataset manipulation and feature engineering. • Visualization: Matplotlib and Seaborn libraries were used for data visualization and performance result plots. • Computational Environment: Models were trained and tested on a system equipped with a GPU-enabled processor to accelerate deep learning computations. • Performance matrix: Evaluation relied on metrics like R-squared, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) to assess model accuracy and reliability. 4. DATASET COLLECTION, PRE-PROCESSING, AND ANALYSIS: a. Data collection: The dataset consists of daily air pollutant measurements from 2021 to 2025 across multiple Indian cities, including PM2.5, PM10, SO2, NO2, CO, NH3, NOx, O3, temperature, humidity, and wind speed. The data was gathered from government and environmental monitoring stations to ensure reliability. b. Pre-processing: Pre-processing involved cleaning missing and inconsistent data, normalizing features, and applying Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset for effective training of machine learning algorithms. Feature selection was performed using recursive feature elimination (RFE) to retain the most influential variables, improving model efficiency. c.Analysis: For analysis, advanced algorithms including CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) networks were implemented. These models were trained and tested to predict air quality levels, capturing both spatial and temporal patterns in the data for improved forecasting performance and robustness. Table 1: Summary Statistics for Air Pollutants Variable Mean StdDev Min Max PM2.5 62.25 7.2 54 70 PM10 91.5 9.0 80 102 SO2 17.5 3.2 14 22 NO2 33.25 3.4 29 37 CO 1.09 0.13 0.95 1.22 NH3 27 2.6 24 31 NOx 42.5 4.1 39 48 O3 55.5 4.3 50 60 Summary statistics for air pollutants 5. Implementation: The updated implementation involves training advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) networks on the air pollution dataset. These models are optimized through hyperparameter tuning to capture complex patterns and temporal dependencies in the data. The trained models are then evaluated on test data using performance metrics like RMSE, MAE, and R-squared to identify the most accurate and reliable predictor for air quality forecasting. 5.1. Machine Learning Classifiers a) CatBoost Classifier CatBoost is a gradient boosting algorithm that handles categorical and numerical data efficiently and is highly effective for classification tasks on structured datasets. In air quality classification, CatBoost predicts AQI categories based on pollutant concentrations and meteorological features, learning complex patterns and providing robust, accurate class labels for new environmental data. b) XGBoost Classifier XGBoost is another ensemble algorithm that builds multiple decision trees and combines them for improved prediction accuracy. For air quality datasets, XGBoost classifies AQI levels into discrete categories like “Good,” “Moderate,” or “Unhealthy,” using input features such as PM2.5, NO2, SO2, CO, and weather data. XGBoost includes built-in regularization, reducing overfitting, and its feature importance scores offer insights into air quality determinants. c) Support Vector Machine (SVM) Classifier Support Vector Machine (SVM) is an effective model for both regression and classification; here, it is used for AQI category classification. SVM constructs optimal decision boundaries in high-dimensional feature space (pollutant concentrations, atmospheric parameters), classifying air pollution levels (e.g., “Safe” vs. “Hazardous”) with high accuracy, especially when the relationship between variables is complex and non-linear. d) LSTM-Based Classifier Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) adept at handling time-series data. When extended to classification, LSTM-based models take sequences of pollutant and weather measurements and classify the AQI category at each time step, capturing both short-term fluctuations and temporal dependencies for accurate air quality class prediction. These advanced classifiers allow precise and reliable air quality category prediction for health and environmental decision-making, leveraging both structured data learning (CatBoost, XGBoost, SVM) and temporal pattern recognition (LSTM). RESULTS AND DISCUSSION Among the tested models, CatBoost and XGBoost delivered the highest accuracy and lowest error rates for air quality category classification and AQI prediction, with CatBoost showing slightly faster prediction times and handling categorical features better. Support Vector Machine (SVM) classifiers achieved solid performance, but were less robust than ensemble models when input data distributions were complex. LSTM-based classifiers effectively captured temporal trends, improving sequential AQI forecasting accuracy but requiring more computation. Overall, the best results in this study were consistently achieved by CatBoost, closely followed by XGBoost, with both outperforming traditional models on key evaluation metrics like R-squared and RMSE. These findings confirm gradient boosting and deep learning approaches as top choices for reliable air quality prediction. Correlation heatmap for air pollution variables Average concentration of key pollutants (bar graph) 7. COMPARING THE MODELS CatBoost and XGBoost showed the highest prediction accuracy for air quality classification and AQI forecasting, with CatBoost handling categorical features and fast training best. Random Forest remained highly reliable and consistent, while SVM produced weaker results and more errors overall. LSTM and hybrid deep learning models excelled for time-series forecasting, but were more demanding computationally. Overall, ensemble models like CatBoost and XGBoost are recommended for robust, efficient air quality prediction, while LSTM adds value for time-based trends. Performance can vary across datasets, so model selection should follow testing and validation with real data. CONCLUSION Updated machine learning models like CatBoost and XGBoost give highly accurate results for air quality prediction and AQI classification, especially when the data are well-preprocessed and balanced. These ensemble methods outperform older models in most urban datasets. LSTM deep learning models also provide strong forecasting for time series but require more resources. CatBoost and XGBoost are recommended for robust, scalable air quality assessment, while LSTM models are best for detailed trend analysis. Using these advanced techniques helps make air pollution monitoring more reliable and effective for health and policy planning. Abbreviations • AQI Air Quality Index • PM2.5 Particulate Matter with aerodynamic diameter less than 2.5 micrometers • PM10 Particulate Matter with aerodynamic diameter less than 10 micrometers • SO2 Sulfur Dioxide • NO2 Nitrogen Dioxide • NOx Nitrogen Oxides • CO Carbon Monoxide • O3 Ozone • RH Relative Humidity • ML Machine Learning • DL Deep Learning • SVM Support Vector Machine • SVR Support Vector Regression • DT Decision Tree • RF Random Forest • KNN k–Nearest Neighbors • XGBoost Extreme Gradient Boosting • CatBoost Categorical Boosting • LSTM Long Short–Term Memory network • RMSE Root Mean Squared Error • MAE Mean Absolute Error • R² Coefficient of Determination Declarations Conflict Of Interest: The authors declare that they have no conflict of interest regarding the publication of this research. Author Contributions: Bindu Sri M: Conceptualized the research, collected and preprocessed the datasets, performed the analysis using machine learning models, interpreted the results, and drafted the manuscript. Acknowledgement: I sincerely thank Mr. Ravindra Babu, Department of Computer Applications, B. V. Raju College, for their guidance and support throughout this research. I also express my gratitude to the management of B. V. Raju College, especially the Principal, for providing the necessary facilities and a supportive academic environment. Finally, I acknowledge open air quality datasets and meteorological data made available by national monitoring agencies (NMAs), which were essential for this study. References Ravindra Babu MN, Satish D, Prasanthi M, Jagadeesh BV, Naidu SVVDJ (2023) J. N. S Kali Pradeep Immidi. Machine learning-based air pollution monitoring and forecasting [Unpublished manuscript].33-ICMLBDA 23.doc Ravindiran G, Murthy SR (2023) Air quality prediction by machine learning models. Chemosphere 339:138685. https://doi.org/10.1016/j.chemosphere.2023.138685 Qian S (2024) An evolutionary deep learning model based on XGBoost for predicting hourly air quality index. Expert Syst Appl 242:120064. https://doi.org/10.1016/j.eswa.2024.120064 Chang YS, Lin LY, Hsieh TH (2020) An LSTM-based aggregated model for air pollution forecasting. Atmospheric Pollution Res 11(8):1451–1463. https://doi.org/10.1016/j.apr.2020.06.010 Saxena A, Shekhawat S (2017) Ambient Air Quality Classification by Grey Wolf Optimizer Based Support Vector Machine. Journal of Environmental and Public Health , 2017, 5583979. https://doi.org/10.1155/2017/5583979 Wu Z, Zhao Y, Cai Y, Liu Y (2022) An ensemble LSTM-based AQI forecasting model with statistical feature extraction. Sci Rep 12:10178. https://doi.org/10.1038/s41598-022-19956-7 Leong WC, Kamarulzaman KS, Pauzi NZ (2020) Prediction of air pollution index (API) using support vector machine. Environ Sci Pollut Res 27:33008–33018. https://doi.org/10.1016/j.envpol.2020.08.106 Farooq O, Halder D, Kumar T (2024) An enhanced approach for predicting air pollution using quantum SVM. Sci Rep 14:11245. https://doi.org/10.1038/s41598-024-69663-2 Kothandaraman D, Ramasamy B (2024) Intelligent Forecasting of Air Quality and Pollution Prediction using Machine Learning. Adsorpt Sci Technol 42(6):423–435. https://doi.org/10.1155/2024/4238763 Lin J, Chuang YL (2023) Air Pollution Prediction using Multivariate LSTM Deep Learning Model. Int J Intell Syst Appl Eng 12(8s):211–220. https://ijisae.org/index.php/IJISAE/article/view/4111 Gul S, Han Y (2020) Forecasting Hazard Level of Air Pollutants using LSTM’s. Atmospheric Pollution Res 11(7):1199–1206. https://doi.org/10.1016/j.apr.2020.05.012 Makala JK, Chidzulo L (2025) Forecasting of Air Quality with Machine Learning. Proceedings of IAIA 2025 . https://2025.iaia.org/final-papers/1261_Makala_Forecasting_of_air_Quality.pdf Kumar P, Choudhary A (2023) Comparative analysis of machine learning models for air quality index prediction. Natl High School J Sci 18(6):60–77. https://nhsjs.com/2025/comparative-analysis-of-machine-learning-models-for-air-quality-index-prediction/ Bedoui S, Bouaziz A (2015) A prediction distribution of atmospheric pollutants using support vector machines. Journal of Environmental Pollution , 2016, 55874. https://doi.org/10.22059/jpoll.2015.55874 Han Y, Lam JCK, Li VOK (2020) A Bayesian LSTM model to evaluate the effects of air pollution control regulations in China. Atmospheric Pollution Res 11(5):876–885. https://doi.org/10.1016/j.apr.2020.03.013 Wang J et al (2025) Machine learning-based forecasting of air quality index under varying meteorological influences. PLoS ONE 20(10):e0334252. https://doi.org/10.1371/journal.pone.0334252 Tipsavak A, Kamsuwan T, Chueseng P (2025) Explainable artificial intelligence-driven model for ultrafine particle prediction: SHAP and XGBoost applications. J Clean Prod 413:138629. https://doi.org/10.1016/j.jclepro.2025.138629 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8328705","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":559964101,"identity":"207306e4-9aac-4447-a2b4-82f8db127fe5","order_by":0,"name":"Bindu sri.Mallula","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABA0lEQVRIiWNgGAWjYBACA2YGBgkgzcPAwNx4gKGAQQ4keuABcVoYGw4wGDAYg7Uk4NPCANHCANOS2ABi49Nizs578DbPn3syuu0HGw7zGNilzw87/BBoi52cbgN2LZbNfMnWvG3FPGZnEkFaknM33k4zAGpJNjY7gMNhh3nMpHkbEnjMDoC1MOdunJ0A0nIgcRs+LTx/gFrOPwRpqU83nJ3+gQgtbEAtN8C2HE6Ql87Bb4tlM4+x5dw2kJaHDQfnGBw33CCdU3AgwQC3X8z5zxjeePMnwd7sfPLBB28qquXlZ6dv/vChwk4OlxYsTgWrNCBWOQjIN5CiehSMglEwCkYCAABTZmC0KpQzYgAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0009-0007-7037-068X","institution":"B.V.Raju College","correspondingAuthor":true,"prefix":"","firstName":"Bindu","middleName":"","lastName":"sri.Mallula","suffix":""},{"id":559964102,"identity":"d52ec28b-9ae7-47b4-a629-9ffe257e5e5d","order_by":1,"name":"M. N. Ravindra Babu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABKElEQVRIiWNgGAWjYJCCAwxscDYzAz978wEgQ0KGeC2SPccSQFp48NuDrMXgRo4BiIVTi3n7GcPDBWX35BnEDh98+KXGWh6o5fOrGzUWPAzsh49uwKJF5kyOweEZ54oNG6TTko1ljqUbzjzzdpt1zjGgw3jS0m5g0SLBkJZwmLctgbFBOsdMWoLtMGPf8dxtxjlsQC0SPGZYtfA/A2uxb5DO//5b4t9h+4YDOc+Mc/7h0SKRfACkJRFoCxvjx7bDiRNO5DA/zm3Dp+XxgcM85xKS26TTjKUZ+9KTZ/YcM2PO7ZPgYcPlF/7E5s88ZQm2/dLJDz/++GZt28/e/Phzzrc6OX72w8ewaYEDUNQwQ+OCTQImQhAw/oDQzB+IUT0KRsEoGAUjBgAAtlplSQjGhBQAAAAASUVORK5CYII=","orcid":"https://orcid.org/0009-0005-5787-6307","institution":"B.V Raju College","correspondingAuthor":true,"prefix":"","firstName":"M.","middleName":"N. Ravindra","lastName":"Babu","suffix":""}],"badges":[],"createdAt":"2025-12-10 15:08:44","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8328705/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8328705/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":98305437,"identity":"962fe63e-1374-4f45-be7f-4f77579aee97","added_by":"auto","created_at":"2025-12-16 11:07:19","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":255179,"visible":true,"origin":"","legend":"","description":"","filename":"FINALAIRPOLLUTIONPAPER.DOC.docx","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/186b7f3b3d32d0cd00edcf45.docx"},{"id":98305428,"identity":"f53a0bf9-6979-42c0-880e-8ba0af39f73e","added_by":"auto","created_at":"2025-12-16 11:07:19","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8328705.json","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/cd0d46e9c977bb2f606b1dbb.json"},{"id":98437142,"identity":"44943705-3c80-4d60-9055-445cbbaaf481","added_by":"auto","created_at":"2025-12-17 16:57:02","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":58154,"visible":true,"origin":"","legend":"","description":"","filename":"rs83287050enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/d7788d0279db4552fe27fa8f.xml"},{"id":98436730,"identity":"646a2e57-5f32-4eeb-9b91-55b2662909c1","added_by":"auto","created_at":"2025-12-17 16:56:12","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18578,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/fe52a551d2fc4aed6d1b6e36.png"},{"id":98436708,"identity":"a9f61e09-7da9-40ce-84ca-5d9741b1fd70","added_by":"auto","created_at":"2025-12-17 16:56:07","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":27483,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/db1622996fd4fab7d1fcfa7e.png"},{"id":98305433,"identity":"4446d23d-6265-4371-8881-6b26802e646f","added_by":"auto","created_at":"2025-12-16 11:07:19","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":20206,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/bad1d19a4827ffd3a3660a42.png"},{"id":98434339,"identity":"f3e07b20-cd4c-498a-b2dc-4d7fdb87716d","added_by":"auto","created_at":"2025-12-17 16:51:57","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54969,"visible":true,"origin":"","legend":"","description":"","filename":"rs83287050structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/5dd866c4e083455689e7bd02.xml"},{"id":98305435,"identity":"8965559b-14ad-400d-bdda-00d12c8062bc","added_by":"auto","created_at":"2025-12-16 11:07:19","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":66980,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/1d18a8cbc92a90c2cc913c5c.html"},{"id":98305432,"identity":"323ef1fd-515d-42d1-a494-0136a41bfd47","added_by":"auto","created_at":"2025-12-16 11:07:19","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":102543,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCorrelation Heatmap of Air Pollution Variables\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/354c0279e5bceb62d67c25e7.png"},{"id":98437170,"identity":"30fb3ddb-26f4-49a6-902c-9e36fe8f9919","added_by":"auto","created_at":"2025-12-17 16:57:05","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":61068,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eAverage Concentration of Key Pollutants\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/6d4bb4e8c263c428d194aded.png"},{"id":98774675,"identity":"3cf9e1e4-40bc-4948-a6e3-0aa7327a979c","added_by":"auto","created_at":"2025-12-22 12:10:21","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":940119,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8328705/v1/ab5e3ed4-dba5-4e01-a53b-786fa7e8a1f2.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eMachine Learning-Based Air Pollution Monitoring And Forecasting\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003eAir pollution is a critical environmental and public health issue, especially in rapidly urbanizing regions where industrial growth and vehicular emissions continue to worsen air quality. The accurate prediction of air pollution levels, such as PM2.5 concentrations and the Air Quality Index (AQI), is essential for informed policy-making to mitigate health risks and environmental impacts. Traditional statistical methods have been widely used, but these often struggle to capture the complex, non-linear, and temporal characteristics of air pollution data.\u003c/p\u003e \u003cp\u003eRecent advancements in machine learning and deep learning provide powerful tools to address these challenges. Algorithms like CatBoost and XGBoost, which are gradient boosting ensemble methods, have shown superior predictive accuracy by effectively handling high-dimensional data and complex feature interactions. Meanwhile, Support Vector Regression (SVR) offers robustness against outliers in regression tasks, enhancing prediction reliability. Deep learning architectures, particularly Long Short-Term Memory (LSTM) networks, excel at modeling temporal dependencies, making them highly suited for forecasting air quality time series data with improved precision.\u003c/p\u003e \u003cp\u003eThis study focuses on incorporating these advanced algorithms into air quality prediction models, extending beyond previously used approaches like decision trees and random forests. By evaluating CatBoost, XGBoost, SVR, and LSTM on a comprehensive dataset of atmospheric pollutant measurements from various Indian cities, this work aims to improve forecasting accuracy and robustness. The integration of these methods offers a pathway for more reliable pollution monitoring, which is vital for public health interventions and emission control strategies.\u003c/p\u003e\n\u003ch3\u003e2. A BRIEF LITERATURE REVIEW\u003c/h3\u003e\n\u003cp\u003eRecent studies in air pollution monitoring and forecasting have extensively employed traditional machine learning models such as Linear Regression, Decision Tree, Random Forest, K-Nearest Neighbors, and Logistic Regression to predict air quality indices based on pollutants like PM2.5, SO2, NO2, and CO. These models have demonstrated satisfactory accuracy in classifying pollution levels and forecasting air quality trends in various urban areas. However, while robust, these conventional techniques have limitations in capturing complex nonlinear patterns and spatiotemporal dependencies in air pollution data.\u003c/p\u003e \u003cp\u003eTo address these challenges, advanced machine learning methods such as gradient boosting algorithms including CatBoost and XGBoost have emerged as powerful tools. These models effectively handle heterogeneous data and feature interactions, improving prediction accuracy. Additionally, Support Vector Regression (SVR) has been used for its robustness in modeling nonlinear relationships with fewer overfitting issues. More recently, deep learning architectures like Long Short-Term Memory (LSTM) networks and Convolutional Neural Networks (CNNs) have gained prominence due to their ability to capture temporal and spatial dependencies, respectively. Hybrid models combining LSTM and CNN, often enhanced with attention mechanisms, have further advanced forecasting capabilities by integrating multimodal data sources.\u003c/p\u003e \u003cp\u003eFurthermore, the integration of explainable AI techniques into air quality prediction models is becoming vital to interpret and validate the results for policymaking and public awareness. These developments mark a significant evolution from earlier machine learning approaches, offering enhanced accuracy, interpretability, and adaptability in air pollution forecasting systems. Incorporating these novel algorithms promises to improve real-time air quality monitoring and enable more effective environmental management strategies.\u003c/p\u003e"},{"header":"PROPOSED METHODOLOGY","content":"\u003cp\u003e\u003cstrong\u003eTraining \u0026nbsp;phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026bull; Preprocess the air quality dataset to handle missing values and balance classes, employing techniques like SMOTE for uniformity.\u003c/p\u003e\n\u003cp\u003e\u0026bull; Split the data into training and validation sets, ensuring representative coverage of pollutant variables (PM2.5, PM10, SO2, NO2, CO, NOx, NH3, O3, and others).\u003c/p\u003e\n\u003cp\u003e\u0026bull; Train advanced algorithms including XGBoost, CatBoost, and LSTM, alongside baseline models such as Random Forest and SVR, optimizing hyperparameters for each model to maximize forecasting precision.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Testing \u0026nbsp;Phase\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026bull; Test trained models on the reserved testing set, evaluating real-world predictive performance for AQI and pollutant concentration forecasts.\u003c/p\u003e\n\u003cp\u003e\u0026bull; Compare all models using metrics like R-squared, MAE, RMSE, and accuracy to assess \u0026nbsp;effectiveness and robustnesss on both balanced and imbalanced datasets.\u003c/p\u003e\n\u003cp\u003e\u0026bull; Select and report the best-performing algorithm based on comparative results, ensuring \u0026nbsp;reliability for both \u0026nbsp;short-term and long-term air pollution prediction.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.1 \u0026nbsp; \u0026nbsp; MATERIALS AND PROCEDURES:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe dataset used comprises daily air pollutant measurements, including PM2.5, PM10, SO2, NO2, CO, NOx, NH3, temperature, humidity, and wind speed from multiple Indian cities between 2021 and 2025. Data preprocessing involved cleaning missing values, balancing the dataset using SMOTE to address class imbalances, and selecting relevant features through recursive feature elimination (RFE) to improve model efficiency.\u003c/p\u003e\n\u003cp\u003eFor the analysis, advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and deep learning models such as Long Short-Term Memory (LSTM) networks were implemented. Hyperparameter tuning was performed to optimize model parameters. The models were trained on the processed data to forecast air quality indices and pollution levels, leveraging the strengths of gradient boosting and recurrent neural networks for both non-linear and temporal pattern learning.\u003c/p\u003e\n\u003cp\u003eFor the analysis, advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and deep learning models such as Long Short-Term Memory (LSTM) networks were implemented. Hyperparameter tuning was performed to optimize model parameters. The models were trained on the processed data to forecast air quality indices and pollution levels, leveraging the strengths of gradient boosting and recurrent neural networks for both non-linear and temporal pattern learning.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e3.2 \u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eTECHNOLOGIES USED:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003eProgramming Languages\u003c/strong\u003e: \u0026nbsp;Python was primarily used for data processing, model development, \u0026nbsp; \u0026nbsp;and evaluation.\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003eMachine Learning Libraries:\u0026nbsp;\u003c/strong\u003e Key libraries include CatBoost and XGBoost for gradient boosting models, scikit-learn for Support Vector Regression (SVR) and preprocessing, and TensorFlow/Keras for implementing Long Short-Term Memory (LSTM) neural networks.\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003eData Handling:\u0026nbsp;\u003c/strong\u003e Pandas and NumPy were utilized for dataset manipulation and feature engineering.\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003eVisualization:\u003c/strong\u003e Matplotlib and Seaborn libraries were used for data visualization and performance result plots.\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003eComputational Environment:\u003c/strong\u003e\u0026nbsp; Models were trained and tested on a system equipped with a GPU-enabled processor to accelerate deep learning computations.\u003c/p\u003e\n\u003cp\u003e\u0026bull; \u003cstrong\u003ePerformance matrix:\u003c/strong\u003e Evaluation relied on metrics like R-squared, Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) to assess model accuracy and reliability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4. DATASET COLLECTION, PRE-PROCESSING, AND ANALYSIS:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea. \u0026nbsp; \u0026nbsp;Data collection:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe dataset consists of daily air pollutant measurements from 2021 to 2025 across multiple Indian cities, including PM2.5, PM10, SO2, NO2, CO, NH3, NOx, O3, temperature, humidity, and wind speed. The data was gathered from government and environmental monitoring stations to ensure reliability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eb.\u0026nbsp; \u0026nbsp;Pre-processing:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePre-processing involved cleaning missing and inconsistent data, normalizing features, and applying Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset for effective training of machine learning algorithms. Feature selection was performed using recursive feature elimination (RFE) to retain the most influential variables, improving model efficiency.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ec.Analysis:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFor analysis, advanced algorithms including CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) networks were implemented. These models were trained and tested to predict air quality levels, capturing both spatial and temporal patterns in the data for improved forecasting performance and robustness.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1: Summary Statistics for Air Pollutants\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"100%\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eVariable\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMean\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eStdDev\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMin\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMax\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePM2.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e62.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e7.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e54\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e70\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePM10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e91.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e9.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e102\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eSO2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e17.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e22\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNO2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e33.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e37\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eCO\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.09\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.95\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.22\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNH3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e31\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNOx\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e42.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e39\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e48\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eO3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e55.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e60\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003e\u003cimg width=\"624\" height=\"416\" src=\"https://myfiles.space/user_files/127393_c7e80a1c9bb65875/127393_custom_files/img1765882756.gif\" alt=\"image\"\u003e\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSummary statistics for air pollutants\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e5. Implementation:\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe updated implementation involves training advanced machine learning models including CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) networks on the air pollution dataset. These models are optimized through hyperparameter tuning to capture complex patterns and temporal dependencies in the data. The trained models are then evaluated on test data using performance metrics like RMSE, MAE, and R-squared to identify the most accurate and reliable predictor for air quality forecasting.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e5.1. Machine Learning Classifiers\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ea)\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eCatBoost Classifier\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCatBoost is a gradient boosting algorithm that handles categorical and numerical data efficiently and is highly effective for classification tasks on structured datasets. In air quality classification, CatBoost predicts AQI categories based on pollutant concentrations and meteorological features, learning complex patterns and providing robust, accurate class labels for new environmental data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eb) XGBoost Classifier\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eXGBoost is another ensemble algorithm that builds multiple decision trees and combines them for improved prediction accuracy. For air quality datasets, XGBoost classifies AQI levels into discrete categories like \u0026ldquo;Good,\u0026rdquo; \u0026ldquo;Moderate,\u0026rdquo; or \u0026ldquo;Unhealthy,\u0026rdquo; using input features such as PM2.5, NO2, SO2, CO, and weather data. XGBoost includes built-in regularization, reducing overfitting, and its feature importance scores offer insights into air quality determinants.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ec) Support Vector Machine (SVM) Classifier\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSupport Vector Machine (SVM) is an effective model for both regression and classification; here, it is used for AQI category classification. SVM constructs optimal decision boundaries in high-dimensional feature space (pollutant concentrations, atmospheric parameters), classifying air pollution levels (e.g., \u0026ldquo;Safe\u0026rdquo; vs. \u0026ldquo;Hazardous\u0026rdquo;) with high accuracy, especially when the relationship between variables is complex and non-linear.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ed) LSTM-Based Classifier\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eLong Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) adept at handling time-series data. When extended to classification, LSTM-based models take sequences of pollutant and weather measurements and classify the AQI category at each time step, capturing both short-term fluctuations and temporal dependencies for accurate air quality class prediction.\u003c/p\u003e\n\u003cp\u003eThese advanced classifiers allow precise and reliable air quality category prediction for health and environmental decision-making, leveraging both structured data learning (CatBoost, XGBoost, SVM) and temporal pattern recognition (LSTM).\u003c/p\u003e"},{"header":"RESULTS AND DISCUSSION","content":"\u003cp\u003eAmong the tested models, CatBoost and XGBoost delivered the highest accuracy and lowest error rates for air quality category classification and AQI prediction, with CatBoost showing slightly faster prediction times and handling categorical features better. Support Vector Machine (SVM) classifiers achieved solid performance, but were less robust than ensemble models when input data distributions were complex. LSTM-based classifiers effectively captured temporal trends, improving sequential AQI forecasting accuracy but requiring more computation.\u003c/p\u003e\n\u003cp\u003eOverall, the best results in this study were consistently achieved by CatBoost, closely followed by XGBoost, with both outperforming traditional models on key evaluation metrics like R-squared and RMSE. These findings confirm gradient boosting and deep learning approaches as top choices for reliable air quality prediction.\u003c/p\u003e\n\u003cp\u003eCorrelation heatmap for air pollution variables\u003c/p\u003e\n\u003cp\u003eAverage concentration of key pollutants (bar graph)\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e7. COMPARING THE MODELS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eCatBoost and XGBoost showed the highest prediction accuracy for air quality classification and AQI forecasting, with CatBoost handling categorical features and fast training best. Random Forest remained highly reliable and consistent, while SVM produced weaker results and more errors overall. LSTM and hybrid deep learning models excelled for time-series forecasting, but were more demanding computationally.\u003c/p\u003e\n\u003cp\u003eOverall, ensemble models like CatBoost and XGBoost are recommended for robust, efficient air quality prediction, while LSTM adds value for time-based trends. Performance can vary across datasets, so model selection should follow testing and validation with real data.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eUpdated machine learning models like CatBoost and XGBoost give highly accurate results for air quality prediction and AQI classification, especially when the data are well-preprocessed and balanced. These ensemble methods outperform older models in most urban datasets. LSTM deep learning models also provide strong forecasting for time series but require more resources.\u003c/p\u003e \u003cp\u003eCatBoost and XGBoost are recommended for robust, scalable air quality assessment, while LSTM models are best for detailed trend analysis. Using these advanced techniques helps make air pollution monitoring more reliable and effective for health and policy planning.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; AQI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAir Quality Index\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; PM2.5\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eParticulate Matter with aerodynamic diameter less than 2.5 micrometers\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; PM10\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eParticulate Matter with aerodynamic diameter less than 10 micrometers\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; SO2\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSulfur Dioxide\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; NO2\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNitrogen Dioxide\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; NOx\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNitrogen Oxides\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; CO\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eCarbon Monoxide\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; O3\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eOzone\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; RH\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRelative Humidity\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; ML\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMachine Learning\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; DL\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eDeep Learning\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; SVM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSupport Vector Machine\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; SVR\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eSupport Vector Regression\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; DT\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eDecision Tree\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; RF\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; KNN\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003ek\u0026ndash;Nearest Neighbors\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; XGBoost\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eExtreme Gradient Boosting\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; CatBoost\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eCategorical Boosting\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; LSTM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLong Short\u0026ndash;Term Memory network\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; RMSE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRoot Mean Squared Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; MAE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMean Absolute Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003e\u0026bull; R\u0026sup2;\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eCoefficient of Determination\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eConflict Of Interest:\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no conflict of interest regarding the publication of this research.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contributions:\u003c/h2\u003e \u003cp\u003eBindu Sri M: Conceptualized the research, collected and preprocessed the datasets, performed the analysis using machine learning models, interpreted the results, and drafted the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement:\u003c/h2\u003e \u003cp\u003eI sincerely thank Mr. Ravindra Babu, Department of Computer Applications, B. V. Raju College, for their guidance and support throughout this research. I also express my gratitude to the management of B. V. Raju College, especially the Principal, for providing the necessary facilities and a supportive academic environment. Finally, I acknowledge open air quality datasets and meteorological data made available by national monitoring agencies (NMAs), which were essential for this study.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eRavindra Babu MN, Satish D, Prasanthi M, Jagadeesh BV, Naidu SVVDJ (2023) J. N. S Kali Pradeep Immidi. Machine learning-based air pollution monitoring and forecasting [Unpublished manuscript].33-ICMLBDA 23.doc\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRavindiran G, Murthy SR (2023) Air quality prediction by machine learning models. Chemosphere 339:138685. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.chemosphere.2023.138685\u003c/span\u003e\u003cspan address=\"10.1016/j.chemosphere.2023.138685\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQian S (2024) An evolutionary deep learning model based on XGBoost for predicting hourly air quality index. Expert Syst Appl 242:120064. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.eswa.2024.120064\u003c/span\u003e\u003cspan address=\"10.1016/j.eswa.2024.120064\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChang YS, Lin LY, Hsieh TH (2020) An LSTM-based aggregated model for air pollution forecasting. Atmospheric Pollution Res 11(8):1451\u0026ndash;1463. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.apr.2020.06.010\u003c/span\u003e\u003cspan address=\"10.1016/j.apr.2020.06.010\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaxena A, Shekhawat S (2017) Ambient Air Quality Classification by Grey Wolf Optimizer Based Support Vector Machine. \u003cem\u003eJournal of Environmental and Public Health\u003c/em\u003e, 2017, 5583979. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1155/2017/5583979\u003c/span\u003e\u003cspan address=\"10.1155/2017/5583979\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu Z, Zhao Y, Cai Y, Liu Y (2022) An ensemble LSTM-based AQI forecasting model with statistical feature extraction. Sci Rep 12:10178. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41598-022-19956-7\u003c/span\u003e\u003cspan address=\"10.1038/s41598-022-19956-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLeong WC, Kamarulzaman KS, Pauzi NZ (2020) Prediction of air pollution index (API) using support vector machine. Environ Sci Pollut Res 27:33008\u0026ndash;33018. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.envpol.2020.08.106\u003c/span\u003e\u003cspan address=\"10.1016/j.envpol.2020.08.106\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFarooq O, Halder D, Kumar T (2024) An enhanced approach for predicting air pollution using quantum SVM. Sci Rep 14:11245. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41598-024-69663-2\u003c/span\u003e\u003cspan address=\"10.1038/s41598-024-69663-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKothandaraman D, Ramasamy B (2024) Intelligent Forecasting of Air Quality and Pollution Prediction using Machine Learning. Adsorpt Sci Technol 42(6):423\u0026ndash;435. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1155/2024/4238763\u003c/span\u003e\u003cspan address=\"10.1155/2024/4238763\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin J, Chuang YL (2023) Air Pollution Prediction using Multivariate LSTM Deep Learning Model. Int J Intell Syst Appl Eng 12(8s):211\u0026ndash;220. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ijisae.org/index.php/IJISAE/article/view/4111\u003c/span\u003e\u003cspan address=\"https://ijisae.org/index.php/IJISAE/article/view/4111\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGul S, Han Y (2020) Forecasting Hazard Level of Air Pollutants using LSTM\u0026rsquo;s. Atmospheric Pollution Res 11(7):1199\u0026ndash;1206. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.apr.2020.05.012\u003c/span\u003e\u003cspan address=\"10.1016/j.apr.2020.05.012\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMakala JK, Chidzulo L (2025) Forecasting of Air Quality with Machine Learning. \u003cem\u003eProceedings of IAIA 2025\u003c/em\u003e. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://2025.iaia.org/final-papers/1261_Makala_Forecasting_of_air_Quality.pdf\u003c/span\u003e\u003cspan address=\"https://2025.iaia.org/final-papers/1261_Makala_Forecasting_of_air_Quality.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar P, Choudhary A (2023) Comparative analysis of machine learning models for air quality index prediction. Natl High School J Sci 18(6):60\u0026ndash;77. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://nhsjs.com/2025/comparative-analysis-of-machine-learning-models-for-air-quality-index-prediction/\u003c/span\u003e\u003cspan address=\"https://nhsjs.com/2025/comparative-analysis-of-machine-learning-models-for-air-quality-index-prediction/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBedoui S, Bouaziz A (2015) A prediction distribution of atmospheric pollutants using support vector machines. \u003cem\u003eJournal of Environmental Pollution\u003c/em\u003e, 2016, 55874. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.22059/jpoll.2015.55874\u003c/span\u003e\u003cspan address=\"10.22059/jpoll.2015.55874\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHan Y, Lam JCK, Li VOK (2020) A Bayesian LSTM model to evaluate the effects of air pollution control regulations in China. Atmospheric Pollution Res 11(5):876\u0026ndash;885. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.apr.2020.03.013\u003c/span\u003e\u003cspan address=\"10.1016/j.apr.2020.03.013\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang J et al (2025) Machine learning-based forecasting of air quality index under varying meteorological influences. PLoS ONE 20(10):e0334252. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/journal.pone.0334252\u003c/span\u003e\u003cspan address=\"10.1371/journal.pone.0334252\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTipsavak A, Kamsuwan T, Chueseng P (2025) Explainable artificial intelligence-driven model for ultrafine particle prediction: SHAP and XGBoost applications. J Clean Prod 413:138629. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jclepro.2025.138629\u003c/span\u003e\u003cspan address=\"10.1016/j.jclepro.2025.138629\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"B.V.Raju College","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Air Quality Index (AQI), PM2.5, air pollution forecasting, machine learning (ML), deep learning (DL), CatBoost, XGBoost, Support Vector Regression (SVR), Long Short-Term Memory (LSTM), gradient boosting, ensemble models, time-series prediction","lastPublishedDoi":"10.21203/rs.3.rs-8328705/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8328705/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eToday, governments in developing countries are increasingly focused on managing air pollution, which results from vehicle fuel use, industrial operations, and the burning of waste. Poor air quality is a pressing health issue and is commonly assessed using PM2.5 levels among other variables. Accurate prediction and ongoing monitoring are crucial for pollution control. In this work, advanced machine learning and deep learning models\u0026mdash;namely CatBoost, XGBoost, Support Vector Regression (SVR), and Long Short-Term Memory (LSTM) neural networks\u0026mdash;are implemented and evaluated to forecast future air pollution levels and the Air Quality Index (AQI) using historical data on PM2.5, NH3, CO, NO, NOx, and NO2, and SO2. These novel techniques are compared with traditional models to assess their prediction accuracy and robustness. By leveraging daily atmospheric datasets from Indian cities, the study demonstrates that modern ensemble and deep learning approaches can provide improved and more reliable forecasts of air quality, supporting data-driven public health interventions and policy decisions.\u003c/p\u003e","manuscriptTitle":"Machine Learning-Based Air Pollution Monitoring And Forecasting","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-16 11:07:08","doi":"10.21203/rs.3.rs-8328705/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"bff8cb20-ecf1-45b7-8f8a-9c0d604dd2fc","owner":[],"postedDate":"December 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":59613907,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-12-16T11:07:09+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-16 11:07:08","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8328705","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8328705","identity":"rs-8328705","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00