Machine Learning Insights into Regional Dynamics and Prevalence of COVID-19 Variants in US Health and Human Services Regions

preprint OA: closed
Full text JSON View at publisher
Full text 97,876 characters · extracted from preprint-html · click to expand
Machine Learning Insights into Regional Dynamics and Prevalence of COVID-19 Variants in US Health and Human Services Regions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Machine Learning Insights into Regional Dynamics and Prevalence of COVID-19 Variants in US Health and Human Services Regions Lejia Hu, Xuan Zhang, Fabian D’Souza This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4208741/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background The COVID-19 pandemic arising from the emergence of SARS-CoV-2 in late 2019 has led to global devastation with millions of lives lost by January 2024. Despite the WHO's declaration of the end of the global health emergency in May 2023, the virus persists, propelled by mutations. Variants continue to challenge vaccination efforts, underscoring the necessity for ongoing vigilance. This study aimed at contributing to a more data-driven approach to pandemic management by employing random forest regression to analyze regional variant prevalence. Methods This study utilized data from various sources including National COVID Cohort Collaborative database, Bureau of Transportation Statistics, World Weather Online, EPA, and US Census. Key variables include pollution, weather, travel patterns, and demographics. Preprocessing steps involved merging and normalization of datasets. Training data spanned from January 2021 to February 2023. The Random Forest Regressor was chosen for its accuracy in modeling. To prevent data leakage, time series splits were employed. Model performance was evaluated using metrics such as MSE and R-squared. Results The Alpha variant was predominant in the Southeast, with less than 80% share even at its peak. Delta surged initially in Kansas City and maintained dominance there for over 5 months. Omicron subvariant BA.5 spread nationwide, becoming predominant across all Health and Human Services regions simultaneously, with New York seeing the earliest and fastest decline in its share. Variant XBB.1.5 concentrated more in the Northeast, but limited data hindered full analysis. Using RF regressor, key features affecting spread patterns were identified, with high predictive accuracy. Each variant showed specific environmental correlations; for instance, Alpha with air quality index and temperature, Delta with ozone density, BA.5 with UV index, and XBB.1.5 with location, land area, and income. Correlation analysis further highlighted variant-specific associations. Conclusions This research provides a comprehensive analysis of the regional distribution of COVID-19 variants, offering critical insights for devising targeted public health strategies. By utilizing machine learning, the study uncovers the complex factors contributing to variant spread and reveals how specific factors contribute to variant prevalence, offering insights crucial for pandemic management. COVID-19 variants Random Forest Regressor Regional spread Factor Importance Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Background The COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, emerged in late 2019 and rapidly spread globally. Characterized by respiratory symptoms, the virus led to widespread illness, overwhelmed healthcare systems, and resulted in extensive social and economic disruptions. Since its initial emergence, the World Health Organization (WHO) has documented over 774 million COVID-19 cases and 7 million COVID-related deaths globally as of January 14, 2024 [ 1 ]. Efforts to control the pandemic included lockdowns, vaccination campaigns, and public health measures, with ongoing challenges and impacts across different regions. Although the WHO declared the end of the COVID-19 “global health emergency” in May 2023 [ 2 ], the virus remains in circulation and exhibits periodic fluctuations characterized by peaks and troughs, representing phases of increased and decreased infection rates. One key factor hampering efforts to contain the virus is the emergence of new mutations. Different variants of the SARS-CoV-2 virus exhibit changes in key genetic material, leading to alterations that lead to specific characteristics. Some variants spread more easily, evade immunity gained through previous infection or vaccination, and cause more severe illness [ 3 ]. Health bodies globally continue to track COVID-19 mutations, designating some mutations as “variants of concern” (VOC) and “variants of interest” (VOI). A VOI is a variant with genetic alterations affecting virus characteristics, causing notable community transmission across multiple countries, and showing increasing prevalence and cases over time, posing an emerging risk to global public health [ 4 ]. A VOC meets the criteria for a VOI and has been established as a threat to health concerning transmissibility, health outcomes, and/or vaccination or therapeutic efficacy [ 3 ]. Monitoring and understanding these variants are crucial for public health efforts to adopt strategies such as vaccine updates and preventive measures to effectively combat the evolving nature of the virus. In pandemics, authorities can increase the production of medical supplies, share resources, or redistribute demand to manage healthcare resources [ 5 ]. Regional analysis, in the context of COVID-19, can help fulfill this objective by enabling the efficient allocation of resources, including medical supplies and vaccines, by identifying regions with higher infection rates or vulnerable populations. Additionally, regional data guides policymakers in making informed decisions, allowing for the adoption of nuanced, region-specific policies and resource management that balance virus containment with socioeconomic considerations. Regional analyses may uncover predictors of variant prevalence and contribute to a more data-driven approach to pandemic management by investigating regional distribution and trends of specific COVID-19 variants. The primary approach for studying and predicting the spread of infectious diseases involves the creation of mathematical models. The two most commonly used methods for this are equation-based modeling, which uses differential equations to describe population-level dynamics [ 6 ], and agent-based modeling [ 7 ], which simulates individual interactions [ 8 ]. A review of modeling papers found most models were compartmental epidemiological models [ 9 ], such as susceptible–infectious–recover models [ 10 ], or a modified version, susceptible-exposed-infected-recovered models, which focus on a human-to-human transmission pathway [ 11 ]. By establishing models that reflect the process, laws, and trends of infectious disease spread, this approach allows for a thorough analysis of dynamic characteristics, providing a solid foundation for understanding the causes and key factors of disease transmission and devising optimal prevention and control strategies [ 12 ]. While machine learning has been used to model COVID-19 [ 13 ], this approach is used far less frequently than mathematical modeling [ 9 ]. Machine learning can enhance traditional mathematical epidemic models by leveraging extensive datasets, such as epidemic, genetic, demographic, geospatial, and mobility data [ 14 ]. These datasets often surpass the scale that conventional mathematical models can effectively handle, allowing for more comprehensive and improved modeling. Additionally, machine learning does not have to assume any underlying relationships between variables [ 15 ]. This study analyzed the influence of various regional or temporal factors on the percentage of four key COVID-19 variants over time from different Health and Human Services (HHS) regions in the United States (US) using the machine learning technique of random forest (RF) regression. This represents a novel application of a well-established technique, by using RF to analyze the share of COVID-19 variants at a regional level to identify predictors of variant prevalence. The results of this study aim to contribute to a more data-driven approach to pandemic management. Methods Data sources and research objects Feature selection is a key process to enhance machine learning model efficiency, interpretability, and generalization by excluding irrelevant or redundant features and minimizing the likelihood of overfitting [ 16 ]. Therefore, when selecting datasets to be used for model training, they were chosen to reflect factors that are important in disease spread and factors that have been empirically linked to the spread of COVID-19. Important variables highlighted in the literature included pollution and air quality [ 17 ], weather [ 18 ], travel and transport [ 19 ], and population demographics [ 20 ]. Additionally, a systematic review of the literature regarding machine learning modeling for detection and prediction of disease outbreaks found that the most commonly used databases were epidemiological and meteorological databases [ 21 ]. These factors were considered to determine the appropriate dataset selections for model training. The data sources chosen were the National COVID Cohort Collaborative (N3C) database, the Bureau of Transportation Statistics, World Weather Online, the United States Environmental Protection Agency, and US Census data. The N3C database provides a comprehensive view of COVID-19 variants, their share across various HHS regions, and how these variants evolve every week. The term "share" represents the relative percentage of a specific variant's prevalence at a particular time within a specific region, allowing tracking of variant evolution and spread. HHS regions divide the United States into ten regions, they are administrative divisions used by the US government to manage and coordinate healthcare and public health activities. These regions serve the following important purposes: coordination of healthcare programs, response to public health emergencies, and healthcare resource allocation. The other databases used in this study consist of publicly available data, including travel, weather, demographics, and air quality information. Collectively, these databases form a rich source of important variables that facilitate an in-depth study of COVID-19 variants and various environmental and societal factors. Outcomes and predictor variables Four COVID-19 variants, Delta, Alpha, XBB.1.5, and Omicron BA.5, were selected for the analysis. Delta, Alpha, and XBB.1.5 were declared VOCs by the WHO in April 2021, December 2020, and January 2023, respectively [ 22 ]. Omicron BA.5 was categorized as a VOC in May 2022 by the European Centre for Disease Prevention and Control [ 23 ]. The outcome of interest was the share of each variant in each HHS region. Predictor variables used for the model training include travel behaviors ( e.g. travel frequency and number of people staying at home ), atmospheric levels of NO2, SO2, CO, and Ozone, income levels, educational attainment, population density, air temperature, humidity, atmospheric pressure, and UV index. Quality control The primary goal of data preprocessing was to prepare collected datasets for seamless integration into the N3C database. This began with data cleaning on four publicly sourced datasets covering weather, air quality, travel, and demographics. Each dataset underwent merging, cleaning, and aggregation, employing techniques such as summing and averaging across different regions, depending on the nature of the variable. For instance, land areas were summed up for each HHS region, while incomes were averaged. After the preprocessing, each dataset was merged with the N3C dataset based on the HHS region and the time. The resulting dataset comprised 103 variables summarizing regional characteristics. To refine the dataset, seven variables including housing units and population estimates from 2000 were removed due to outdated information or observed collinearity. Additionally, two time-related variables—sun hour and moon illumination—were eliminated. Next, datasets were created for each of the four selected variants as well as the ‘Other’ (a group of mixed variants in N3C) group as a baseline variant. Data normalization was undertaken for each of the five datasets using the min-max scaler method. Once scaled, all values of the features range between 0 and 1, preventing bias in the model's learning process. Sample size N3C data had records of data of variants’ share over the period from 2021/1/2 to 2023/2/11. Other databases provided records from December 2020 to October 2023 to cover the full period of the N3C database. The numbers of records for the four variants were 210 for XBB 1.5, 33357 for B1.617.2, 4398 for B.1.1.7, and 8032 for BA.5, respectively. The sizes were large enough to provide sufficient training and testing data. Machine learning methods The machine learning algorithm used in this study was RF Regressor, which includes a built-in feature importance function, making it an excellent choice for our research purpose. Feature importance is a way to understand the contribution of each feature (variable) in predicting the target variable (outcome). RF is an ensemble algorithm that combines multiple decision trees to make predictions. The importance of a feature is determined by how much it contributes across all trees in the forest. Studies have shown that RF is a highly accurate method for modeling diseases. In their respective studies, Özen in Turkey and Kolozali have adeptly employed the RF Regressor for predictive modeling, achieving commendable outcomes [ 24 ]. Concurrently, Kolozali's research in the same year leveraged this algorithm to accurately predict biomarker values indicative of gestational diabetes mellitus, further showcasing the versatility and efficacy of the RF Regressor in diverse medical fields [ 25 ]. After initial feature selection, all remaining features were included in the model, which allowed for a comprehensive understanding of each respective feature contribution. This is feasible with RF because it is robust and less susceptible to overfitting due to the ensemble nature of the algorithm. Thus, we were able to fully examine the contrast between less important features and those making the largest contributions to the model. To prevent data leakage, we used time series splits to divide the training set and testing set [ 26 ]. In the time series split, the dataset was split into 10 consecutive folds for cross-validation purposes. This ensures that only past data are used in the training, making the predictions on future data more robust and accurate. Establishment and validation of the models To evaluate the model’s performance, we employed four common metrics in regression analysis: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R²). These metrics were evaluated with the testing set, providing insights into the model’s ability to capture underlying patterns. Feature Importance graphs, spearman’s correlation graphs, and tree plots were used to visualize the patterns as well as to aid interpretations. Results We analyzed the share of the four selected variants in each HHS region during the study period. The Alpha variant predominantly surfaced in the Southeast. It did not have an absolute dominance over the other variants even at its peak, with less than 80% of share in the Southeast regions and even lower in other regions. The Delta variant initially surged in HHS Region 7 (IA, KS, MO, and NE), with Kansas City at its center, and persisted longest in this region. When the neighboring regions were experiencing a switch to another more prominent variant, the most prevalent variant in Kansas City remained Delta. The Delta variant has the longest time at its peak, having nearly 100% share for over 5 months. The Omicron subvariant BA.5 exhibited a unique nationwide spread, where all HHS regions experienced a surge of BA.5 and it became the most prevalent variant across all regions simultaneously. HHS Region 2 (NY, NJ, PR, VI) with New York City at its center had its BA.5 share decreased first and at the fastest rate. Variant XBB.1.5 showed a higher concentration in the Northeast, but we are not able to see how it behaved along the full span of the variant’s life due to insufficient data (Fig. 1 ). The set of graphs in Fig. 1 presents a temporal comparison of the prevalence of different COVID-19 variants across ten regions. Each graph corresponds to a specific variant, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D), and displays the share of each variant over time, as evidenced by the x-axis denoting time from early 2021 to early 2023. The y-axis represents the proportion of the variant in the population, ranging from 0 to 1 (0–100%). Each line within a graph represents one of the ten regions, with color coding used to differentiate between them. The graphs allow for the observation of trends in variant dominance, showing how quickly each variant became prevalent and subsequently declined in each region. The regional dynamics of the share indicate that the speed and patterns of the virus’ surge and diminishment differ across regions for each variant. To understand the contributing factors, we then used the RF regressor to predict the share of each of the four variants across the HHS regions and identify the important predictors. Our models exhibited excellent predictive accuracy, well surpassing the 0.72 R² value for a mixed-variant baseline. The models for Alpha variant and Omicron subvariant BA.5 both displayed high accuracy rates, with R² = 0.94 and R² = 0.93, respectively, followed by Omicron subvariant XBB.1.5 with R² = 0.92 and Delta variant with R² = 0.89. Table 1 provides a summary of the performance metrics for predictive models of various COVID-19 variants over a specific time frame. Each row represents a different variant, including B.1.1.7, B.1.617.2, BA.5, XBB.1.5, and a category labeled‘Other’ (mixed variants), which served as a baseline for comparison purposes. Table 1 Model Performance Metrics for COVID-19 Variant Predictions Start Date End Date Variant Name Data Length MSE RMSE MAE R-Square 2021-01-02 2021-10-30 B.1.1.7 4398 0.006641 0.081491 0.033258 0.936054 2021-01-30 2023-02-11 В.1.617.2 33357 0.021188 0.145561 0.031952 0.885554 2021-09-25 2023-02-11 BA.5 8032 0.010736 0.103613 0.025812 0.925882 2022-10-22 2023-02-11 XBB.1.5 210 0.006165 0.078517 0.040888 0.920208 2021-01-02 2023-02-11 Other 38751 0.016741 0.129385 0.049647 0.715303 MSE, Mean Squared Error; RMSE, Root Mean Squared Error; MAE, Mean Absolute Error Our findings indicate a complex interplay between environmental factors and the spread of different variants. The top features identified by RF regressor that could affect the spreading patterns include temperature, UV index, ozone value, and air quality index. Each variant has specific favorable environmental conditions. For instance, the Alpha variant showed a strong correlation with the air quality index and the temperature. The Delta variant exhibited a significant relationship with ozone density. Similarly, the Omicron subvariant BA.5 demonstrated a connection with the UV index. Lastly, the Omicron subvariant XBB.1.5 revealed associations with location, land area, and income (Fig. 2 ). Figure 2 presents a machine learning model’s feature importance for predicting the prevalence of four COVID-19 variants, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The bar charts display the top 15 features that contribute to the model’s predictions, with the length of each bar indicating the relative importance of that feature. The x-axis shows the relative importance, quantifying the strength of each feature’s influence on the model’s output, while the y-axis lists the features. Correlation results showed that the Alpha variant had positive correlations with AQI and negative correlations with the temperature. The Delta variant had a negative correlation with OZ value and a positive correlation with the temperature. The BA.5 variant had a positive correlation with the UV index, and the XBB.1.5 variant had a negative correlation with land area and a positive correlation with the income (Fig. 3 ). The set of bar graphs in Fig. 3 represents the Spearman correlation coefficients between various environmental and demographic factors and the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The Spearman correlation is a nonparametric measure that assesses how well the relationship between two variables can be described using a monotonic function. For each variant, the factors are shown on the x-axis, which ranges from − 1 to 1 on the y-axis. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and values closer to 0 indicate a weaker or no linear relationship. A detailed correlation graph of the BA.5 variant and its top related factor - UV index is depicted in Fig. 4 . This collection of scatter plots depicts the relationship between the UV index and the prevalence share of the BA.5 COVID-19 variant across ten different regions. Each plot corresponds to a region, with the x-axis representing the UV index and the y-axis showing the variant’s share within the region. The points on each plot indicate individual observations or measurements. From this plot, we can see there exists a positive correlation between BA.5 variant and UV index, in agreement with Fig. 3 . Decision Tree plots are also generated to illustrate each factor’s impact on each variant, as depicted in Fig. 5 . This visualization depicts decision tree models for the prediction of the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). In a decision tree, each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. This graph provides a detailed explanation of the decision-making process employed by the decision tree, serving as a tool for visualizing the thresholds. Discussion The impact of environmental factors on COVID-19 virus has been reported in the literature. A study conducted by Gunthe et al. in 2022 unveiled a notable correlation between environmental factors, specifically temperature and UV index, and COVID-19 cases. Their findings indicate a generally inverse relationship between the UV index and the occurrence of COVID-19 cases, suggesting that higher UV exposure may contribute to lower transmission rates. Additionally, they observed a clustering of COVID-19 cases in temperature ranges between 3°C and 12°C, highlighting a specific thermal window where the virus may find more favorable conditions for transmission [ 27 ]. In parallel, research conducted by Pérez-Gilaberte in Spain has corroborated these observations, demonstrating that elevated UV index levels and temperatures are associated with a decrease in COVID-19 incidence [ 28 ]. This reinforces the hypothesis that certain environmental conditions may indeed play a role in mitigating the spread of the virus. Beyond climatic factors, there has been an exploration into the therapeutic potential of ozone. Various studies have systematically reviewed the application of ozone therapy, concluding that it may positively affect polymerase chain reaction (PCR) test outcomes and serum lactate dehydrogenase levels, in addition to potentially reducing COVID-19 mortality [ 29 ]. This suggests that ozone therapy could offer a complementary treatment avenue for managing the disease. Moreover, the effectiveness of quarantine measures in curtailing COVID-19 spread has been consistently supported across studies. These findings collectively underline the multifaceted nature of managing the pandemic, where environmental conditions, therapeutic interventions, and public health measures such as quarantine, all converge to influence the trajectory of COVID-19 incidence rates [ 30 , 31 ]. The above studies have highlighted that specific environmental factors may influence the occurrence of COVID-19, as well as the prevalence of its variants. For instance, Gunthe’s research suggested that the UV index could potentially neutralize the COVID-19 virus. However, our investigation has revealed that the BA.5 subvariant exhibits a unique relationship with the UV index, showing a positive correlation. This indicates that BA.5 may be less susceptible to the effects of UV radiation compared to other variants. Our analysis provides a more detailed exploration of these dynamics. Further examination of the correlation between the BA.5 variant’s prevalence and UV index values reveals nuanced patterns (Fig. 4 ). At lower UV index levels (below 2), the share of BA.5 is minimal. At intermediate UV index levels, the distribution of BA.5 shows a distinct hollow pattern, while at higher UV index levels (above 6) the share of BA.5 significantly increases. This suggests that higher UV index levels may markedly inhibit the survival of other variants, while BA.5 manages to thrive. This observation points towards the BA.5 variant’s potential resilience in environments with higher UV radiation, which provides some novel insight into the characteristics of this specific COVID-19 variant. The Alpha and Delta variants exhibit distinct correlations with air quality Index (AQI) and ozone concentration; Alpha correlates strongly with AQI at ozone sites, while Delta correlates with ozone levels. Spearman Correlation (Fig. 3 ) reveals Alpha’s positive correlation and Delta’s negative correlation with ozone, suggesting that high levels of ozone reduce Delta’s share but Alpha remains unaffected. These findings align with Jafari-Oori’s research on ozone therapy’s benefits for COVID patients, with data collected before April 2022, when Delta dominated variant shares. The Omicron subvariant XBB.1.5, on the other hand, was strongly correlated with location, indicating that it is a rather localized variant and the ecology system does not play a role as significant as in the other three variants. There are a lot of previous studies that have found differences in severity among different covid variants. Yuan and team did a comprehensive comparison across several metrics, concluding that the Omicron variant exhibits lower severity in all assessed categories compared to other COVID-19 variants [ 32 ]. A similar finding was also published by Varea-Jiménez, demonstrating a higher severity of COVID-19 cases caused by the Delta variant than by either Alpha or Omicron [ 33 ]. These findings underscore the variability in severity among different COVID-19 variants; therefore, when solving public health issues, an alternative method of disease management based on variants’ characteristics can also be considered. Limitations One potential limitation arises from our focus solely on weather, air quality, travel, and demographics within specific regions at a given time. This narrow scope may not provide a comprehensive understanding of the entire ecosystem in which these variants thrive, potentially leading to overlooked confounding factors. Additionally, our data represents the relative share of variants rather than absolute values, restricting our ability to isolate and analyze each variant across all HHS regions. Moreover, it's important to note that while we identify correlations through feature importance analysis, these correlations do not imply a causal relationship between the factors and the spread of variants. These considerations underscore the need for a cautious interpretation of our findings and the exploration of broader contextual factors in future studies. Conclusions This research provides a comprehensive analysis of the regional distribution of COVID-19 variants, offering critical insights for devising targeted public health strategies. By utilizing machine learning, the study uncovers the complex factors contributing to variant spread and reveals how specific factors contribute to variant prevalence, offering insights crucial for pandemic management. Abbreviations WHO World Health Organization RF Random Forest US United States HHS Health and Human Services MSE Mean Squared Error RMSE Root Mean Squared Error MAE Mean Absolute Error N3C National COVID Cohort Collaborative VOC variants of concern VOI variants of interest Declarations Author Contribution FDS Initiated the idea. XZ directed and designed the study. LH participated in the study design, analyzed the data, and drafted the manuscript. XZ critically reviewed and revised the manuscript. All authors read and approved the final manuscript. Acknowledgement We would like to thank Ian Weimer, MS, a data scientist at Boston Strategic Partners Inc. and Sian Bissell O’Sullivan, MS, a healthcare associate at Boston Strategic Partners Inc. for their expertise and editorial assistance. Data Availability The data sources chosen were the National COVID Cohort Collaborative (N3C) database, the Bureau of Transportation Statistics, World Weather Online, the United States Environmental Protection Agency, and US Census data. Funding This study was funded by Boston Strategic Partners, Inc. References World Health Organization. COVID-19 Dashboard. https://data.who.int/dashboards/covid19/cases?n=c. Accessed 1 April 2024. Wise J. Covid-19: WHO declares end of global health emergency. BMJ. 2023;381:1041. Otto SP, Day T, Arino J, Colijn C, Dushoff J, Li M. The origins and potential future of SARS-CoV-2 variants of concern in the evolving COVID-19 pandemic. Curr Biol. 202;31(14):R918-R929. Choi JY, Smith DM. SARS-CoV-2 Variants of Concern. Yonsei Med J. 2021;62(11):961-968. Fattahi M, Keyvanshokooh E, Kannan D, Govindan K. Resource planning strategies for healthcare systems during a pandemic. Eur J Oper Res. 2023;304(1):192-206. Ivorra B, Ferrández MR, Vela-Pérez M, Ramos AM. Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19) taking into account the undetected infections: The case of China. Commun Nonlinear Sci Numer Simul. 2020;88:105303. Hunter E, Namee BM, Kelleher JD. A Model for the Spread of Infectious Diseases in a Region. Int J Environ Res Public Health. 2020;17(9):3119. Kumaresan V, Balachandar N, Poole SF, Myers LJ, Varghese P, Washington V. Fitting and validation of an agent-based model for COVID-19 case forecasting in workplaces and universities. PLoS One. 2023;18(3):e0283517. Ojokoh BA, Sarumi OA, Salako KV, Gabriel AJ, Taiwo AE, Johnson OV. Modeling and predicting the spread of COVID-19: a continental analysis. Data Science for COVID-19. 2022; doi: 10.1016/B978-0-323-90769-9.00039-6. Nguyen TK, Hoang NH, Currie G, Vu HL. Enhancing Covid-19 virus spread modeling using an activity travel model. Transp Res Part A Policy Pract. 2022;161:186-199. Yang C, Wang J. Modeling the transmission of COVID-19 in the US - A case study. Infect Dis Model. 2020;6:195-211. Bin S, Sun G, Chen CC. Spread of Infectious Disease Modeling and Analysis of Different Factors on Spread of Infectious Disease Based on Cellular Automata. Int J Environ Res Public Health. 2019;16(23):4683. Altieri N, Barter RL, Duncan J, Dwivedi R, Kumbier K, Li X. Curating a COVID-19 Data Repository and Forecasting County-Level Death Counts in the United States. Harvard Data Science Review. Special Issue 1; doi: 10.1162/99608f92.1d4e0dae. Wang J. Mathematical models for COVID-19: applications, limitations, and potentials. J Public Health Emerg. 2020;4:9. Mayer LM, Strich JR, Kadri SS, Lionakis MS, Evans NG, Prevots DR. Machine Learning in Infectious Disease for Risk Factor Identification and Hypothesis Generation: Proof of Concept Using Invasive Candidiasis. Open Forum Infect Dis. 2022;9(8):ofac401. Wiemken TL, Kelley RR. Machine Learning in Epidemiology and Health Outcomes Research. Annu Rev Public Health. 2020;41:21-36. Hernandez Carballo I, Bakola M, Stuckler D. The impact of air pollution on COVID-19 incidence, severity, and mortality: A systematic review of studies in Europe and North America. Environ Res. 2022;215(Pt 1):114155. Prata DN, Rodrigues W, Bermejo PH. Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil. Sci Total Environ. 2020;729:138862. Hamidi S, Sabouri S, Ewing R. Does Density Aggravate the COVID-19 Pandemic?. Journal of the American Planning Association. doi: 10.1080/01944363.2020.1777891 Dowd JB, Andriano L, Brazel DM, Rotondi V, Block P, Ding X. Demographic science aids in understanding the spread and fatality rates of COVID-19. Proc Natl Acad Sci U S A. 2020;117(18):9696-9698. Alfred R, Obit JH. The roles of machine learning methods in limiting the spread of deadly diseases: A systematic review. Heliyon. 2021;7(6):e07371. Gupta P, Gupta V, Singh CM, Singhal L. Emergence of COVID-19 Variants: An Update. Cureus. 2023;15(7):e41295. Islam MR, Shahriar M, Bhuiyan MA. The latest Omicron BA.4 and BA.5 lineages are frowning toward COVID-19 preventive measures: A threat to global public health. Health Sci Rep. 2022;5(6):e884. Özen F. Random forest regression for prediction of Covid-19 daily cases and deaths in Turkey. Heliyon. 2024;10(4):e25746. Kolozali S, White SL, Norris S, Fasli M, Van Heerden A. Explainable Early Prediction of Gestational Diabetes Biomarkers by Combining Medical Background and Wearable Devices: A Pilot Study with a Cohort Group in South Africa. IEEE J Biomed Health Inform. 2024; doi: 10.1109/JBHI.2024.3361505. Cerqueira V, Torgo L, Mozetič I. Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn 109. 2020; doi: 10.1007/s10994-020-05910-7 Gunthe SS, Swain B, Patra SS, Amte A. On the global trends and spread of the COVID-19 outbreak: preliminary assessment of the potential relation between location-specific temperature and UV index. Z Gesundh Wiss. 2022;30(1):219-228. Pérez-Gilaberte JB, Martín-Iranzo N, Aguilera J, Almenara-Blasco M, de Gálvez MV, Gilaberte Y. Correlation between UV Index, Temperature and Humidity with Respect to Incidence and Severity of COVID 19 in Spain. Int J Environ Res Public Health. 2023;20(3):1973. Jafari-Oori M, Vahedian-Azimi A, Ghorbanzadeh K, Sepahvand E, Dehi M, Ebadi A. Efficacy of ozone adjuvant therapy in COVID-19 patients: A meta-analysis study. Front Med (Lausanne). 2022;9:1037749. Feiz AM, Babaei-Pouya A, Poursadeqiyan M. The health effects of quarantine during the COVID-19 pandemic. Work. 2020;67(3):523-527. Auranen K, Shubin M, Erra E. et al. Efficacy and effectiveness of case isolation and quarantine during a growing phase of the COVID-19 epidemic in Finland. Sci Rep. 2023; 13:298. Yuan Z, Shao Z, Ma L, Guo R. Clinical Severity of SARS-CoV-2 Variants during COVID-19 Vaccination: A Systematic Review and Meta-Analysis. Viruses. 2023;15(10):1994. Varea-Jiménez E. Comparative severity of COVID-19 cases caused by Alpha, Delta or Omicron SARS-CoV-2 variants and its association with vaccination. Enfermedades Infecciosas y Microbiología Clínica. EIMC. 2022; doi:10.1016/j.eimc.2022.11.003 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4208741","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":289022456,"identity":"2652663b-dc53-4fba-8f76-87c16513e6d6","order_by":0,"name":"Lejia Hu","email":"","orcid":"","institution":"Boston Strategic Partners, Inc","correspondingAuthor":false,"prefix":"","firstName":"Lejia","middleName":"","lastName":"Hu","suffix":""},{"id":289022457,"identity":"a97556ff-8f5e-42b7-bbce-1a773bd135d6","order_by":1,"name":"Xuan Zhang","email":"","orcid":"","institution":"Boston Strategic Partners, Inc","correspondingAuthor":false,"prefix":"","firstName":"Xuan","middleName":"","lastName":"Zhang","suffix":""},{"id":289022458,"identity":"5879fe6b-998a-4066-b3b2-e3b2c5937bb7","order_by":2,"name":"Fabian D’Souza","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABD0lEQVRIiWNgGAWjYBAC9gYehgNAmrEBzK04QFgLzwEULWeAbDYitDDAtTC2EaOF/ezBAz93MMjOb+8x/vBz3h05+fnNxx4CReT5xbC7kocnL+Fg7xkG4w1nzhgY9m57ZmxwjC3dmPcMg+HM2QlYtdgz5Bgc4G1jSNwgkWOQwLvtcOIGNh4zacY2hgSD29i18PC/MTj4F6hl/owcIGPO4cT5bTxmkj/xaQEafhhkS8ONHMNm3obDiQ3HeMwkePFqeZdwWLZNAuiXY8XMMsdAfklLk+Y9I4HTLzz8uYc/vm2zAYZY8+aPb2qAIdZ8+Jjkzx028vzS2LVAgQQan7EBXYQggCaGUTAKRsEoGAUgAAAeP2NcgNTeGgAAAABJRU5ErkJggg==","orcid":"","institution":"Boston Strategic Partners, Inc","correspondingAuthor":true,"prefix":"","firstName":"Fabian","middleName":"","lastName":"D’Souza","suffix":""}],"badges":[],"createdAt":"2024-04-02 20:59:22","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4208741/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4208741/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":54342992,"identity":"f0aad9ca-2546-4d93-b120-15e90f9789fe","added_by":"auto","created_at":"2024-04-09 06:03:14","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":547689,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eThe Temporal Evolution of COVID-19 Variant Prevalence Across HHS Regions.\u003c/strong\u003e This set of graphs presents a temporal comparison of the prevalence of different COVID-19 variants across ten regions. Each graph corresponds to a specific variant, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D), and displays the share of each variant over time, as evidenced by the x-axis denoting time from early 2021 to early 2023. The y-axis represents the proportion of the variant in the population, ranging from 0 to 1 (0% to 100%). Each line within a graph represents one of the ten regions, with color coding used to differentiate between them. The graphs allow for the observation of trends in variant dominance, showing how quickly each variant became prevalent and subsequently declined in each region.\u003c/p\u003e","description":"","filename":"Figure1.png","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/254549dfaa18efa47d1ff155.png"},{"id":54342990,"identity":"fcda7409-3603-4c74-aa2f-bb0b3527bd33","added_by":"auto","created_at":"2024-04-09 06:03:14","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":368910,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDifferential Feature Impact on COVID-19 Variant Predictive Models.\u003c/strong\u003e This graph presents a machine learning model's feature importance for predicting the prevalence of four COVID-19 variants, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The bar charts display the top 15 features that contribute to the model's predictions, with the length of each bar indicating the relative importance of that feature. The x-axis shows the relative importance, quantifying the strength of each feature's influence on the model's output, while the y-axis lists the features.\u003c/p\u003e","description":"","filename":"Figure2.png","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/648f3df5d440456711555574.png"},{"id":54342991,"identity":"311790a3-ff85-4593-8c64-85a80b75a736","added_by":"auto","created_at":"2024-04-09 06:03:14","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":361051,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eSpearman Correlation Analysis for Environmental and Demographic Factors with COVID-19 Variants.\u003c/strong\u003e This set of bar graphs represents the Spearman’s correlation coefficients between various environmental and demographic factors and the prevalence of four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The Spearman’s correlation is a nonparametric measure that assesses how well the relationship between two variables can be described using a monotonic function. For each variant, the factors are shown on the x-axis, which ranges from -1 to 1 on the y-axis. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and values closer to 0 indicate a weaker or no linear relationship.\u003c/p\u003e","description":"","filename":"Figure3.png","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/c82fe0a906c6ae89884ca6a5.png"},{"id":54342994,"identity":"6c51d772-d5b6-499a-9c86-cdc8fc5c367e","added_by":"auto","created_at":"2024-04-09 06:03:14","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":894475,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCorrelation Scatter Plots of UV Index and BA.5 Variant Prevalence by Region.\u003c/strong\u003e This collection of scatter plots depicts the relationship between the UV index and the prevalence share of the BA.5 COVID-19 variant across ten different regions. Each plot corresponds to a region, with the x-axis representing the UV index and the y-axis showing the variant's share within the region. The points on each plot indicate individual observations or measurements.\u003c/p\u003e","description":"","filename":"Figure4.png","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/ee2b38847acc5e12d6f08cdb.png"},{"id":54342993,"identity":"386e62d7-48e0-4ce1-aec6-8faed3c82a1f","added_by":"auto","created_at":"2024-04-09 06:03:14","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":978851,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDecision Tree Analysis for Predicting COVID-19 Variant Prevalence\u003c/strong\u003e. This visualization depicts decision tree models for the prediction of the prevalence of four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). In a decision tree, each internal node represents a \"test\" on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label.\u003c/p\u003e","description":"","filename":"Figure5.png","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/3e5ad7c84f460e7a05cee504.png"},{"id":54377180,"identity":"71b8869a-6f1d-46a7-a48e-ecf7f738a065","added_by":"auto","created_at":"2024-04-09 14:28:13","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1260817,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4208741/v1/322a8486-c229-438b-a9c3-cccf8f5cc9fb.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Machine Learning Insights into Regional Dynamics and Prevalence of COVID-19 Variants in US Health and Human Services Regions","fulltext":[{"header":"Background","content":"\u003cp\u003eThe COVID-19 pandemic, caused by the novel coronavirus SARS-CoV-2, emerged in late 2019 and rapidly spread globally. Characterized by respiratory symptoms, the virus led to widespread illness, overwhelmed healthcare systems, and resulted in extensive social and economic disruptions. Since its initial emergence, the World Health Organization (WHO) has documented over 774\u0026nbsp;million COVID-19 cases and 7\u0026nbsp;million COVID-related deaths globally as of January 14, 2024 [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Efforts to control the pandemic included lockdowns, vaccination campaigns, and public health measures, with ongoing challenges and impacts across different regions.\u003c/p\u003e \u003cp\u003eAlthough the WHO declared the end of the COVID-19 \u0026ldquo;global health emergency\u0026rdquo; in May 2023 [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e], the virus remains in circulation and exhibits periodic fluctuations characterized by peaks and troughs, representing phases of increased and decreased infection rates. One key factor hampering efforts to contain the virus is the emergence of new mutations. Different variants of the SARS-CoV-2 virus exhibit changes in key genetic material, leading to alterations that lead to specific characteristics. Some variants spread more easily, evade immunity gained through previous infection or vaccination, and cause more severe illness [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Health bodies globally continue to track COVID-19 mutations, designating some mutations as \u0026ldquo;variants of concern\u0026rdquo; (VOC) and \u0026ldquo;variants of interest\u0026rdquo; (VOI). A VOI is a variant with genetic alterations affecting virus characteristics, causing notable community transmission across multiple countries, and showing increasing prevalence and cases over time, posing an emerging risk to global public health [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. A VOC meets the criteria for a VOI and has been established as a threat to health concerning transmissibility, health outcomes, and/or vaccination or therapeutic efficacy [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Monitoring and understanding these variants are crucial for public health efforts to adopt strategies such as vaccine updates and preventive measures to effectively combat the evolving nature of the virus.\u003c/p\u003e \u003cp\u003eIn pandemics, authorities can increase the production of medical supplies, share resources, or redistribute demand to manage healthcare resources [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Regional analysis, in the context of COVID-19, can help fulfill this objective by enabling the efficient allocation of resources, including medical supplies and vaccines, by identifying regions with higher infection rates or vulnerable populations. Additionally, regional data guides policymakers in making informed decisions, allowing for the adoption of nuanced, region-specific policies and resource management that balance virus containment with socioeconomic considerations. Regional analyses may uncover predictors of variant prevalence and contribute to a more data-driven approach to pandemic management by investigating regional distribution and trends of specific COVID-19 variants.\u003c/p\u003e \u003cp\u003eThe primary approach for studying and predicting the spread of infectious diseases involves the creation of mathematical models. The two most commonly used methods for this are equation-based modeling, which uses differential equations to describe population-level dynamics [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], and agent-based modeling [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], which simulates individual interactions [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. A review of modeling papers found most models were compartmental epidemiological models [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], such as susceptible\u0026ndash;infectious\u0026ndash;recover models [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], or a modified version, susceptible-exposed-infected-recovered models, which focus on a human-to-human transmission pathway [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. By establishing models that reflect the process, laws, and trends of infectious disease spread, this approach allows for a thorough analysis of dynamic characteristics, providing a solid foundation for understanding the causes and key factors of disease transmission and devising optimal prevention and control strategies [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eWhile machine learning has been used to model COVID-19 [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], this approach is used far less frequently than mathematical modeling [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Machine learning can enhance traditional mathematical epidemic models by leveraging extensive datasets, such as epidemic, genetic, demographic, geospatial, and mobility data [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. These datasets often surpass the scale that conventional mathematical models can effectively handle, allowing for more comprehensive and improved modeling. Additionally, machine learning does not have to assume any underlying relationships between variables [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThis study analyzed the influence of various regional or temporal factors on the percentage of four key COVID-19 variants over time from different Health and Human Services (HHS) regions in the United States (US) using the machine learning technique of random forest (RF) regression. This represents a novel application of a well-established technique, by using RF to analyze the share of COVID-19 variants at a regional level to identify predictors of variant prevalence. The results of this study aim to contribute to a more data-driven approach to pandemic management.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData sources and research objects\u003c/h2\u003e \u003cp\u003eFeature selection is a key process to enhance machine learning model efficiency, interpretability, and generalization by excluding irrelevant or redundant features and minimizing the likelihood of overfitting [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Therefore, when selecting datasets to be used for model training, they were chosen to reflect factors that are important in disease spread and factors that have been empirically linked to the spread of COVID-19. Important variables highlighted in the literature included pollution and air quality [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], weather [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], travel and transport [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], and population demographics [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Additionally, a systematic review of the literature regarding machine learning modeling for detection and prediction of disease outbreaks found that the most commonly used databases were epidemiological and meteorological databases [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. These factors were considered to determine the appropriate dataset selections for model training.\u003c/p\u003e \u003cp\u003eThe data sources chosen were the National COVID Cohort Collaborative (N3C) database, the Bureau of Transportation Statistics, World Weather Online, the United States Environmental Protection Agency, and US Census data. The N3C database provides a comprehensive view of COVID-19 variants, their share across various HHS regions, and how these variants evolve every week. The term \"share\" represents the relative percentage of a specific variant's prevalence at a particular time within a specific region, allowing tracking of variant evolution and spread. HHS regions divide the United States into ten regions, they are administrative divisions used by the US government to manage and coordinate healthcare and public health activities. These regions serve the following important purposes: coordination of healthcare programs, response to public health emergencies, and healthcare resource allocation. The other databases used in this study consist of publicly available data, including travel, weather, demographics, and air quality information. Collectively, these databases form a rich source of important variables that facilitate an in-depth study of COVID-19 variants and various environmental and societal factors.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eOutcomes and predictor variables\u003c/h2\u003e \u003cp\u003eFour COVID-19 variants, Delta, Alpha, XBB.1.5, and Omicron BA.5, were selected for the analysis. Delta, Alpha, and XBB.1.5 were declared VOCs by the WHO in April 2021, December 2020, and January 2023, respectively [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Omicron BA.5 was categorized as a VOC in May 2022 by the European Centre for Disease Prevention and Control [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. The outcome of interest was the share of each variant in each HHS region.\u003c/p\u003e \u003cp\u003ePredictor variables used for the model training include travel behaviors (\u003cem\u003ee.g. travel frequency and number of people staying at home\u003c/em\u003e), atmospheric levels of NO2, SO2, CO, and Ozone, income levels, educational attainment, population density, air temperature, humidity, atmospheric pressure, and UV index.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eQuality control\u003c/h2\u003e \u003cp\u003eThe primary goal of data preprocessing was to prepare collected datasets for seamless integration into the N3C database. This began with data cleaning on four publicly sourced datasets covering weather, air quality, travel, and demographics. Each dataset underwent merging, cleaning, and aggregation, employing techniques such as summing and averaging across different regions, depending on the nature of the variable. For instance, land areas were summed up for each HHS region, while incomes were averaged.\u003c/p\u003e \u003cp\u003eAfter the preprocessing, each dataset was merged with the N3C dataset based on the HHS region and the time. The resulting dataset comprised 103 variables summarizing regional characteristics.\u003c/p\u003e \u003cp\u003eTo refine the dataset, seven variables including housing units and population estimates from 2000 were removed due to outdated information or observed collinearity. Additionally, two time-related variables\u0026mdash;sun hour and moon illumination\u0026mdash;were eliminated.\u003c/p\u003e \u003cp\u003eNext, datasets were created for each of the four selected variants as well as the \u0026lsquo;Other\u0026rsquo; (a group of mixed variants in N3C) group as a baseline variant.\u003c/p\u003e \u003cp\u003eData normalization was undertaken for each of the five datasets using the min-max scaler method. Once scaled, all values of the features range between 0 and 1, preventing bias in the model's learning process.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eSample size\u003c/h2\u003e \u003cp\u003eN3C data had records of data of variants\u0026rsquo; share over the period from 2021/1/2 to 2023/2/11. Other databases provided records from December 2020 to October 2023 to cover the full period of the N3C database. The numbers of records for the four variants were 210 for XBB 1.5, 33357 for B1.617.2, 4398 for B.1.1.7, and 8032 for BA.5, respectively. The sizes were large enough to provide sufficient training and testing data.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eMachine learning methods\u003c/h2\u003e \u003cp\u003eThe machine learning algorithm used in this study was RF Regressor, which includes a built-in feature importance function, making it an excellent choice for our research purpose. Feature importance is a way to understand the contribution of each feature (variable) in predicting the target variable (outcome). RF is an ensemble algorithm that combines multiple decision trees to make predictions. The importance of a feature is determined by how much it contributes across all trees in the forest.\u003c/p\u003e \u003cp\u003eStudies have shown that RF is a highly accurate method for modeling diseases. In their respective studies, \u0026Ouml;zen in Turkey and Kolozali have adeptly employed the RF Regressor for predictive modeling, achieving commendable outcomes [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Concurrently, Kolozali's research in the same year leveraged this algorithm to accurately predict biomarker values indicative of gestational diabetes mellitus, further showcasing the versatility and efficacy of the RF Regressor in diverse medical fields [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAfter initial feature selection, all remaining features were included in the model, which allowed for a comprehensive understanding of each respective feature contribution. This is feasible with RF because it is robust and less susceptible to overfitting due to the ensemble nature of the algorithm. Thus, we were able to fully examine the contrast between less important features and those making the largest contributions to the model.\u003c/p\u003e \u003cp\u003eTo prevent data leakage, we used time series splits to divide the training set and testing set [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. In the time series split, the dataset was split into 10 consecutive folds for cross-validation purposes. This ensures that only past data are used in the training, making the predictions on future data more robust and accurate.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEstablishment and validation of the models\u003c/h2\u003e \u003cp\u003eTo evaluate the model\u0026rsquo;s performance, we employed four common metrics in regression analysis: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared (R\u0026sup2;). These metrics were evaluated with the testing set, providing insights into the model\u0026rsquo;s ability to capture underlying patterns. Feature Importance graphs, spearman\u0026rsquo;s correlation graphs, and tree plots were used to visualize the patterns as well as to aid interpretations.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cp\u003eWe analyzed the share of the four selected variants in each HHS region during the study period. The Alpha variant predominantly surfaced in the Southeast. It did not have an absolute dominance over the other variants even at its peak, with less than 80% of share in the Southeast regions and even lower in other regions. The Delta variant initially surged in HHS Region 7 (IA, KS, MO, and NE), with Kansas City at its center, and persisted longest in this region. When the neighboring regions were experiencing a switch to another more prominent variant, the most prevalent variant in Kansas City remained Delta. The Delta variant has the longest time at its peak, having nearly 100% share for over 5 months. The Omicron subvariant BA.5 exhibited a unique nationwide spread, where all HHS regions experienced a surge of BA.5 and it became the most prevalent variant across all regions simultaneously. HHS Region 2 (NY, NJ, PR, VI) with New York City at its center had its BA.5 share decreased first and at the fastest rate. Variant XBB.1.5 showed a higher concentration in the Northeast, but we are not able to see how it behaved along the full span of the variant\u0026rsquo;s life due to insufficient data (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The set of graphs in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents a temporal comparison of the prevalence of different COVID-19 variants across ten regions. Each graph corresponds to a specific variant, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D), and displays the share of each variant over time, as evidenced by the x-axis denoting time from early 2021 to early 2023. The y-axis represents the proportion of the variant in the population, ranging from 0 to 1 (0\u0026ndash;100%). Each line within a graph represents one of the ten regions, with color coding used to differentiate between them. The graphs allow for the observation of trends in variant dominance, showing how quickly each variant became prevalent and subsequently declined in each region.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe regional dynamics of the share indicate that the speed and patterns of the virus\u0026rsquo; surge and diminishment differ across regions for each variant. To understand the contributing factors, we then used the RF regressor to predict the share of each of the four variants across the HHS regions and identify the important predictors. Our models exhibited excellent predictive accuracy, well surpassing the 0.72 R\u0026sup2; value for a mixed-variant baseline. The models for Alpha variant and Omicron subvariant BA.5 both displayed high accuracy rates, with R\u0026sup2; = 0.94 and R\u0026sup2; = 0.93, respectively, followed by Omicron subvariant XBB.1.5 with R\u0026sup2; = 0.92 and Delta variant with R\u0026sup2; = 0.89. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e provides a summary of the performance metrics for predictive models of various COVID-19 variants over a specific time frame. Each row represents a different variant, including B.1.1.7, B.1.617.2, BA.5, XBB.1.5, and a category labeled\u0026lsquo;Other\u0026rsquo; (mixed variants), which served as a baseline for comparison purposes.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eModel Performance Metrics for COVID-19 Variant Predictions\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026minus;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eStart Date\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEnd Date\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eVariant Name\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eData Length\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eMSE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eMAE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eR-Square\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2021-01-02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c2\"\u003e \u003cp\u003e2021-10-30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eB.1.1.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e4398\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.006641\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.081491\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.033258\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.936054\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2021-01-30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c2\"\u003e \u003cp\u003e2023-02-11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eВ.1.617.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e33357\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.021188\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.145561\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.031952\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.885554\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2021-09-25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c2\"\u003e \u003cp\u003e2023-02-11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eBA.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e8032\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.010736\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.103613\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.025812\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.925882\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2022-10-22\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c2\"\u003e \u003cp\u003e2023-02-11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eXBB.1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e210\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.006165\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.078517\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.040888\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.920208\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e2021-01-02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026minus;\" colname=\"c2\"\u003e \u003cp\u003e2023-02-11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eOther\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e38751\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.016741\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.129385\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.049647\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.715303\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eMSE, Mean Squared Error; RMSE, Root Mean Squared Error; MAE, Mean Absolute Error\u003c/p\u003e \u003cp\u003eOur findings indicate a complex interplay between environmental factors and the spread of different variants. The top features identified by RF regressor that could affect the spreading patterns include temperature, UV index, ozone value, and air quality index. Each variant has specific favorable environmental conditions. For instance, the Alpha variant showed a strong correlation with the air quality index and the temperature. The Delta variant exhibited a significant relationship with ozone density. Similarly, the Omicron subvariant BA.5 demonstrated a connection with the UV index. Lastly, the Omicron subvariant XBB.1.5 revealed associations with location, land area, and income (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Figure\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents a machine learning model\u0026rsquo;s feature importance for predicting the prevalence of four COVID-19 variants, labeled B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The bar charts display the top 15 features that contribute to the model\u0026rsquo;s predictions, with the length of each bar indicating the relative importance of that feature. The x-axis shows the relative importance, quantifying the strength of each feature\u0026rsquo;s influence on the model\u0026rsquo;s output, while the y-axis lists the features.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eCorrelation results showed that the Alpha variant had positive correlations with AQI and negative correlations with the temperature. The Delta variant had a negative correlation with OZ value and a positive correlation with the temperature. The BA.5 variant had a positive correlation with the UV index, and the XBB.1.5 variant had a negative correlation with land area and a positive correlation with the income (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The set of bar graphs in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e represents the Spearman correlation coefficients between various environmental and demographic factors and the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). The Spearman correlation is a nonparametric measure that assesses how well the relationship between two variables can be described using a monotonic function. For each variant, the factors are shown on the x-axis, which ranges from \u0026minus;\u0026thinsp;1 to 1 on the y-axis. A value of 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and values closer to 0 indicate a weaker or no linear relationship.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eA detailed correlation graph of the BA.5 variant and its top related factor - UV index is depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e. This collection of scatter plots depicts the relationship between the UV index and the prevalence share of the BA.5 COVID-19 variant across ten different regions. Each plot corresponds to a region, with the x-axis representing the UV index and the y-axis showing the variant\u0026rsquo;s share within the region. The points on each plot indicate individual observations or measurements. From this plot, we can see there exists a positive correlation between BA.5 variant and UV index, in agreement with Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eDecision Tree plots are also generated to illustrate each factor\u0026rsquo;s impact on each variant, as depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. This visualization depicts decision tree models for the prediction of the prevalence of the four COVID-19 variants: B.1.1.7 (A), B.1.617.2 (B), BA.5 (C), and XBB.1.5 (D). In a decision tree, each internal node represents a \u0026ldquo;test\u0026rdquo; on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label. This graph provides a detailed explanation of the decision-making process employed by the decision tree, serving as a tool for visualizing the thresholds.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThe impact of environmental factors on COVID-19 virus has been reported in the literature. A study conducted by Gunthe et al. in 2022 unveiled a notable correlation between environmental factors, specifically temperature and UV index, and COVID-19 cases. Their findings indicate a generally inverse relationship between the UV index and the occurrence of COVID-19 cases, suggesting that higher UV exposure may contribute to lower transmission rates. Additionally, they observed a clustering of COVID-19 cases in temperature ranges between 3\u0026deg;C and 12\u0026deg;C, highlighting a specific thermal window where the virus may find more favorable conditions for transmission [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn parallel, research conducted by P\u0026eacute;rez-Gilaberte in Spain has corroborated these observations, demonstrating that elevated UV index levels and temperatures are associated with a decrease in COVID-19 incidence [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. This reinforces the hypothesis that certain environmental conditions may indeed play a role in mitigating the spread of the virus.\u003c/p\u003e \u003cp\u003eBeyond climatic factors, there has been an exploration into the therapeutic potential of ozone. Various studies have systematically reviewed the application of ozone therapy, concluding that it may positively affect polymerase chain reaction (PCR) test outcomes and serum lactate dehydrogenase levels, in addition to potentially reducing COVID-19 mortality [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. This suggests that ozone therapy could offer a complementary treatment avenue for managing the disease.\u003c/p\u003e \u003cp\u003eMoreover, the effectiveness of quarantine measures in curtailing COVID-19 spread has been consistently supported across studies. These findings collectively underline the multifaceted nature of managing the pandemic, where environmental conditions, therapeutic interventions, and public health measures such as quarantine, all converge to influence the trajectory of COVID-19 incidence rates [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe above studies have highlighted that specific environmental factors may influence the occurrence of COVID-19, as well as the prevalence of its variants. For instance, Gunthe\u0026rsquo;s research suggested that the UV index could potentially neutralize the COVID-19 virus. However, our investigation has revealed that the BA.5 subvariant exhibits a unique relationship with the UV index, showing a positive correlation. This indicates that BA.5 may be less susceptible to the effects of UV radiation compared to other variants. Our analysis provides a more detailed exploration of these dynamics.\u003c/p\u003e \u003cp\u003eFurther examination of the correlation between the BA.5 variant\u0026rsquo;s prevalence and UV index values reveals nuanced patterns (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). At lower UV index levels (below 2), the share of BA.5 is minimal. At intermediate UV index levels, the distribution of BA.5 shows a distinct hollow pattern, while at higher UV index levels (above 6) the share of BA.5 significantly increases. This suggests that higher UV index levels may markedly inhibit the survival of other variants, while BA.5 manages to thrive. This observation points towards the BA.5 variant\u0026rsquo;s potential resilience in environments with higher UV radiation, which provides some novel insight into the characteristics of this specific COVID-19 variant.\u003c/p\u003e \u003cp\u003eThe Alpha and Delta variants exhibit distinct correlations with air quality Index (AQI) and ozone concentration; Alpha correlates strongly with AQI at ozone sites, while Delta correlates with ozone levels. Spearman Correlation (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e) reveals Alpha\u0026rsquo;s positive correlation and Delta\u0026rsquo;s negative correlation with ozone, suggesting that high levels of ozone reduce Delta\u0026rsquo;s share but Alpha remains unaffected. These findings align with Jafari-Oori\u0026rsquo;s research on ozone therapy\u0026rsquo;s benefits for COVID patients, with data collected before April 2022, when Delta dominated variant shares. The Omicron subvariant XBB.1.5, on the other hand, was strongly correlated with location, indicating that it is a rather localized variant and the ecology system does not play a role as significant as in the other three variants.\u003c/p\u003e \u003cp\u003eThere are a lot of previous studies that have found differences in severity among different covid variants. Yuan and team did a comprehensive comparison across several metrics, concluding that the Omicron variant exhibits lower severity in all assessed categories compared to other COVID-19 variants [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. A similar finding was also published by Varea-Jim\u0026eacute;nez, demonstrating a higher severity of COVID-19 cases caused by the Delta variant than by either Alpha or Omicron [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. These findings underscore the variability in severity among different COVID-19 variants; therefore, when solving public health issues, an alternative method of disease management based on variants\u0026rsquo; characteristics can also be considered.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eOne potential limitation arises from our focus solely on weather, air quality, travel, and demographics within specific regions at a given time. This narrow scope may not provide a comprehensive understanding of the entire ecosystem in which these variants thrive, potentially leading to overlooked confounding factors. Additionally, our data represents the relative share of variants rather than absolute values, restricting our ability to isolate and analyze each variant across all HHS regions. Moreover, it's important to note that while we identify correlations through feature importance analysis, these correlations do not imply a causal relationship between the factors and the spread of variants. These considerations underscore the need for a cautious interpretation of our findings and the exploration of broader contextual factors in future studies.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eThis research provides a comprehensive analysis of the regional distribution of COVID-19 variants, offering critical insights for devising targeted public health strategies. By utilizing machine learning, the study uncovers the complex factors contributing to variant spread and reveals how specific factors contribute to variant prevalence, offering insights crucial for pandemic management.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eWHO\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eWorld Health Organization\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRF\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eUS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eUnited States\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eHHS\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eHealth and Human Services\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMSE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMean Squared Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRMSE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRoot Mean Squared Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eMAE\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eMean Absolute Error\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eN3C\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eNational COVID Cohort Collaborative\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eVOC\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003evariants of concern\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eVOI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003evariants of interest\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eFDS Initiated the idea. XZ directed and designed the study. LH participated in the study design, analyzed the data, and drafted the manuscript. XZ critically reviewed and revised the manuscript. All authors read and approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eWe would like to thank Ian Weimer, MS, a data scientist at Boston Strategic Partners Inc. and Sian Bissell O\u0026rsquo;Sullivan, MS, a healthcare associate at Boston Strategic Partners Inc. for their expertise and editorial assistance.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe data sources chosen were the National COVID Cohort Collaborative (N3C) database, the Bureau of Transportation Statistics, World Weather Online, the United States Environmental Protection Agency, and US Census data.\u003c/p\u003e\n\u003cp\u003eFunding\u003c/p\u003e\n\u003cp\u003eThis study was funded by Boston Strategic Partners, Inc.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eWorld Health Organization. COVID-19 Dashboard. https://data.who.int/dashboards/covid19/cases?n=c. Accessed 1 April 2024.\u003c/li\u003e\n\u003cli\u003eWise J. Covid-19: WHO declares end of global health emergency. BMJ. 2023;381:1041. \u003c/li\u003e\n\u003cli\u003eOtto SP, Day T, Arino J, Colijn C, Dushoff J, Li M. The origins and potential future of SARS-CoV-2 variants of concern in the evolving COVID-19 pandemic. Curr Biol. 202;31(14):R918-R929.\u003c/li\u003e\n\u003cli\u003eChoi JY, Smith DM. SARS-CoV-2 Variants of Concern. Yonsei Med J. 2021;62(11):961-968. \u003c/li\u003e\n\u003cli\u003eFattahi M, Keyvanshokooh E, Kannan D, Govindan K. Resource planning strategies for healthcare systems during a pandemic. Eur J Oper Res. 2023;304(1):192-206. \u003c/li\u003e\n\u003cli\u003eIvorra B, Ferr\u0026aacute;ndez MR, Vela-P\u0026eacute;rez M, Ramos AM. Mathematical modeling of the spread of the coronavirus disease 2019 (COVID-19) taking into account the undetected infections: The case of China. Commun Nonlinear Sci Numer Simul. 2020;88:105303.\u003c/li\u003e\n\u003cli\u003eHunter E, Namee BM, Kelleher JD. A Model for the Spread of Infectious Diseases in a Region. Int J Environ Res Public Health. 2020;17(9):3119. \u003c/li\u003e\n\u003cli\u003eKumaresan V, Balachandar N, Poole SF, Myers LJ, Varghese P, Washington V. Fitting and validation of an agent-based model for COVID-19 case forecasting in workplaces and universities. PLoS One. 2023;18(3):e0283517. \u003c/li\u003e\n\u003cli\u003eOjokoh BA, Sarumi OA, Salako KV, Gabriel AJ, Taiwo AE, Johnson OV. Modeling and predicting the spread of COVID-19: a continental analysis. Data Science for COVID-19. 2022; doi: 10.1016/B978-0-323-90769-9.00039-6. \u003c/li\u003e\n\u003cli\u003eNguyen TK, Hoang NH, Currie G, Vu HL. Enhancing Covid-19 virus spread modeling using an activity travel model. Transp Res Part A Policy Pract. 2022;161:186-199. \u003c/li\u003e\n\u003cli\u003eYang C, Wang J. Modeling the transmission of COVID-19 in the US - A case study. Infect Dis Model. 2020;6:195-211.\u003c/li\u003e\n\u003cli\u003eBin S, Sun G, Chen CC. Spread of Infectious Disease Modeling and Analysis of Different Factors on Spread of Infectious Disease Based on Cellular Automata. Int J Environ Res Public Health. 2019;16(23):4683. \u003c/li\u003e\n\u003cli\u003eAltieri N, Barter RL, Duncan J, Dwivedi R, Kumbier K, Li X. Curating a COVID-19 Data Repository and Forecasting County-Level Death Counts in the United States. Harvard Data Science Review. Special Issue 1; doi: 10.1162/99608f92.1d4e0dae.\u003c/li\u003e\n\u003cli\u003eWang J. Mathematical models for COVID-19: applications, limitations, and potentials. J Public Health Emerg. 2020;4:9. \u003c/li\u003e\n\u003cli\u003eMayer LM, Strich JR, Kadri SS, Lionakis MS, Evans NG, Prevots DR. Machine Learning in Infectious Disease for Risk Factor Identification and Hypothesis Generation: Proof of Concept Using Invasive Candidiasis. Open Forum Infect Dis. 2022;9(8):ofac401.\u003c/li\u003e\n\u003cli\u003eWiemken TL, Kelley RR. Machine Learning in Epidemiology and Health Outcomes Research. Annu Rev Public Health. 2020;41:21-36. \u003c/li\u003e\n\u003cli\u003eHernandez Carballo I, Bakola M, Stuckler D. The impact of air pollution on COVID-19 incidence, severity, and mortality: A systematic review of studies in Europe and North America. Environ Res. 2022;215(Pt 1):114155. \u003c/li\u003e\n\u003cli\u003ePrata DN, Rodrigues W, Bermejo PH. Temperature significantly changes COVID-19 transmission in (sub)tropical cities of Brazil. Sci Total Environ. 2020;729:138862. \u003c/li\u003e\n\u003cli\u003eHamidi S, Sabouri S, Ewing R. Does Density Aggravate the COVID-19 Pandemic?. Journal of the American Planning Association. doi: 10.1080/01944363.2020.1777891\u003c/li\u003e\n\u003cli\u003eDowd JB, Andriano L, Brazel DM, Rotondi V, Block P, Ding X. Demographic science aids in understanding the spread and fatality rates of COVID-19. Proc Natl Acad Sci U S A. 2020;117(18):9696-9698. \u003c/li\u003e\n\u003cli\u003eAlfred R, Obit JH. The roles of machine learning methods in limiting the spread of deadly diseases: A systematic review. Heliyon. 2021;7(6):e07371. \u003c/li\u003e\n\u003cli\u003eGupta P, Gupta V, Singh CM, Singhal L. Emergence of COVID-19 Variants: An Update. Cureus. 2023;15(7):e41295. \u003c/li\u003e\n\u003cli\u003eIslam MR, Shahriar M, Bhuiyan MA. The latest Omicron BA.4 and BA.5 lineages are frowning toward COVID-19 preventive measures: A threat to global public health. Health Sci Rep. 2022;5(6):e884. \u003c/li\u003e\n\u003cli\u003e\u0026Ouml;zen F. Random forest regression for prediction of Covid-19 daily cases and deaths in Turkey. Heliyon. 2024;10(4):e25746. \u003c/li\u003e\n\u003cli\u003eKolozali S, White SL, Norris S, Fasli M, Van Heerden A. Explainable Early Prediction of Gestational Diabetes Biomarkers by Combining Medical Background and Wearable Devices: A Pilot Study with a Cohort Group in South Africa. IEEE J Biomed Health Inform. 2024; doi: 10.1109/JBHI.2024.3361505. \u003c/li\u003e\n\u003cli\u003eCerqueira V, Torgo L, Mozetič I. Evaluating time series forecasting models: an empirical study on performance estimation methods. Mach Learn 109. 2020; doi: 10.1007/s10994-020-05910-7\u003c/li\u003e\n\u003cli\u003eGunthe SS, Swain B, Patra SS, Amte A. On the global trends and spread of the COVID-19 outbreak: preliminary assessment of the potential relation between location-specific temperature and UV index. Z Gesundh Wiss. 2022;30(1):219-228. \u003c/li\u003e\n\u003cli\u003eP\u0026eacute;rez-Gilaberte JB, Mart\u0026iacute;n-Iranzo N, Aguilera J, Almenara-Blasco M, de G\u0026aacute;lvez MV, Gilaberte Y. Correlation between UV Index, Temperature and Humidity with Respect to Incidence and Severity of COVID 19 in Spain. Int J Environ Res Public Health. 2023;20(3):1973. \u003c/li\u003e\n\u003cli\u003eJafari-Oori M, Vahedian-Azimi A, Ghorbanzadeh K, Sepahvand E, Dehi M, Ebadi A. Efficacy of ozone adjuvant therapy in COVID-19 patients: A meta-analysis study. Front Med (Lausanne). 2022;9:1037749. \u003c/li\u003e\n\u003cli\u003eFeiz AM, Babaei-Pouya A, Poursadeqiyan M. The health effects of quarantine during the COVID-19 pandemic. Work. 2020;67(3):523-527.\u003c/li\u003e\n\u003cli\u003eAuranen K, Shubin M, Erra E. et al. Efficacy and effectiveness of case isolation and quarantine during a growing phase of the COVID-19 epidemic in Finland. Sci Rep. 2023; 13:298. \u003c/li\u003e\n\u003cli\u003eYuan Z, Shao Z, Ma L, Guo R. Clinical Severity of SARS-CoV-2 Variants during COVID-19 Vaccination: A Systematic Review and Meta-Analysis. Viruses. 2023;15(10):1994. \u003c/li\u003e\n\u003cli\u003eVarea-Jim\u0026eacute;nez E. Comparative severity of COVID-19 cases caused by Alpha, Delta or Omicron SARS-CoV-2 variants and its association with vaccination. Enfermedades Infecciosas y Microbiolog\u0026iacute;a Cl\u0026iacute;nica. EIMC. 2022; doi:10.1016/j.eimc.2022.11.003\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"COVID-19 variants, Random Forest Regressor, Regional spread, Factor Importance","lastPublishedDoi":"10.21203/rs.3.rs-4208741/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4208741/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eThe COVID-19 pandemic arising from the emergence of SARS-CoV-2 in late 2019 has led to global devastation with millions of lives lost by January 2024. Despite the WHO's declaration of the end of the global health emergency in May 2023, the virus persists, propelled by mutations. Variants continue to challenge vaccination efforts, underscoring the necessity for ongoing vigilance. This study aimed at contributing to a more data-driven approach to pandemic management by employing random forest regression to analyze regional variant prevalence.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eThis study utilized data from various sources including National COVID Cohort Collaborative database, Bureau of Transportation Statistics, World Weather Online, EPA, and US Census. Key variables include pollution, weather, travel patterns, and demographics. Preprocessing steps involved merging and normalization of datasets. Training data spanned from January 2021 to February 2023. The Random Forest Regressor was chosen for its accuracy in modeling. To prevent data leakage, time series splits were employed. Model performance was evaluated using metrics such as MSE and R-squared.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe Alpha variant was predominant in the Southeast, with less than 80% share even at its peak. Delta surged initially in Kansas City and maintained dominance there for over 5 months. Omicron subvariant BA.5 spread nationwide, becoming predominant across all Health and Human Services regions simultaneously, with New York seeing the earliest and fastest decline in its share. Variant XBB.1.5 concentrated more in the Northeast, but limited data hindered full analysis. Using RF regressor, key features affecting spread patterns were identified, with high predictive accuracy. Each variant showed specific environmental correlations; for instance, Alpha with air quality index and temperature, Delta with ozone density, BA.5 with UV index, and XBB.1.5 with location, land area, and income. Correlation analysis further highlighted variant-specific associations.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eThis research provides a comprehensive analysis of the regional distribution of COVID-19 variants, offering critical insights for devising targeted public health strategies. By utilizing machine learning, the study uncovers the complex factors contributing to variant spread and reveals how specific factors contribute to variant prevalence, offering insights crucial for pandemic management.\u003c/p\u003e","manuscriptTitle":"Machine Learning Insights into Regional Dynamics and Prevalence of COVID-19 Variants in US Health and Human Services Regions","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-04-09 06:03:09","doi":"10.21203/rs.3.rs-4208741/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"c32c9da6-fc98-4fe9-94fa-e9553a7acd60","owner":[],"postedDate":"April 9th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2024-05-03T14:21:37+00:00","versionOfRecord":[],"versionCreatedAt":"2024-04-09 06:03:09","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-4208741","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4208741","identity":"rs-4208741","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00