Using machine learning to overcome mosquito collections missing data for malaria modeling

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
AI-generated deep summary by claude@2026-06, 2026-06-16 · read from full text

This paper applies several quantitative machine learning methods to impute missing mosquito count time series for malaria modeling, using entomological surveillance data from 2009–2016 in Bolívar State, Venezuela. The authors compare Linear Regression, Stochastic Linear Regression, K-nearest neighbor, and Gradient Boosting for reconstructing missing abundance counts of Anopheles mosquitoes with leave-one-out cross-validation, and then feed the imputed series along with climate covariates (El Niño 3.4 Index, rainfall, mean air temperature) into generalized time-series models to forecast Plasmodium vivax and P. falciparum incidence. They report that machine learning imputation significantly improved the accuracy and reliability of P. vivax incidence predictions, while failing to predict P. falciparum incidence, and note that performance depends on reconstruction quality and that data continuity/coverage limitations remain. The paper does not explicitly discuss endometriosis or adenomyosis; it was included in the corpus via a keyword match in the upstream search index.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Full text 112,153 characters · extracted from oa-pdf · click to expand
1 1 Running Title: Machine learning to overcome mosquito missing data 2 Using machine learning to overcome mosquito 3 collections missing data for malaria modeling 4 Yasmin Rubio-Palis1*, Linjia Feng2#a, Karena S. Liang2#b, 5 Chen Song2#c, Songyuan Wang2, Tymon Duchnicki2, Xiao 6 Zhang2#d, Lelys Bravo de Guenni2* 7 8 1 Instituto de Investigaciones Biomédicas "Dr. Francisco J. Triana Alonso" 9 (BIOMED), Universidad de Carabobo, Maracay, Aragua, Venezuela 10 2 Department of Statistics, University of Illinois, Urbana-Champaign, Illinois, United 11 States of America 12 #a Current address: Department of Financial Engineering, Cornell University, New 13 York, New York, United States of America 14 15 #b Current address: Department of Computer Science, University of Illinois, Urbana- 16 Champaign, Illinois, United States of America 17 18 #c Current address: The Grainger College of Engineering, University of Illinois, 19 Urbana-Champaign, Illinois, United States of America 20 21 #d Current address: Department of Statistics, University of California, Santa Cruz, 22 California, United States of America 23 24 * Corresponding author 25 Email: [email protected] 26 Email: [email protected] 27 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 2 28 29 Abstract 30 Entomological surveillance plays a crucial role in areas where malaria remains 31 endemic, yet gathering data on mosquito populations is often expensive and 32 complicated, particularly in remote locations with challenging logistics and 33 inconsistent sampling schedules. Access to extensive time series data on mosquito 34 species at specific sites would greatly enhance insights into seasonal trends and 35 the biting habits of vectors of malaria parasites. Gaps in mosquito count records 36 pose a significant challenge for researchers and public health officials seeking to 37 establish early warning systems and effective vector control programs. In this 38 study, we apply quantitative machine learning techniques to address missing data 39 in estimates of mosquito abundance collected from 2009 to 2016 in Bolívar State, 40 Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K 41 Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of 42 Anopheles mosquitoes, employing a leave-one-out cross-validation strategy. 43 Additionally, we developed a predictive malaria transmission model incorporating 44 mosquito abundance and climate variables (El Niño 3.4 Index, rainfall, and mean 45 air temperature) as covariates. Our generalized time series model forecasts 46 malaria incidence of Plasmodium vivax and Plasmodium falciparum based on 47 climate dynamics and imputed mosquito data. Model performance was assessed 48 using root mean square error, mean absolute error, and mean absolute percentage 49 error. The final results demonstrated that machine learning imputation significantly .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 3 50 improved the accuracy and reliability of P. vivax malaria incidence predictions but 51 failed to predict P. falciparum incidence. The study demonstrates that method 52 choice significantly influences the reconstruction of seasonal abundance patterns 53 and the performance of malaria incidence models. Nevertheless, the proposed 54 models strengthen the foundation for targeted interventions and surveillance in 55 endemic regions. Despite limitations in data continuity and coverage, the findings 56 highlight the value of combining multiyear entomological data sets with robust 57 imputation and sensitivity analyses to improve predictive modeling in resource- 58 constrained, malaria-endemic settings. 59 Introduction 60 Malaria stands as the most significant parasitic disease transmitted by mosquitoes. 61 According to the World Health Organization [1], an estimate of 262 million malaria 62 cases occurred across 80 endemic countries in 2024. Within the Americas, 63 Venezuela reported an estimated incidence of 9.7-12.6 x 1,000 population at risk, 64 significantly higher than the incidence reported by Brazil (3.08-3.43 x 1,000 65 population at risk), accounting for 17 percent of all malaria cases in the region. 66 Notably, more than 70 percent of these cases originated from Bolívar State [1]. 67 Mosquito control is a crucial intervention for preventing the transmission of malaria 68 parasites. Nevertheless, effective surveillance and monitoring of mosquito vector 69 populations are often constrained by the challenges of conducting regular 70 collections across multiple locations, particularly in vast and remote regions. These 71 logistical hurdles frequently result in incomplete data, leading to potentially .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 4 72 inaccurate assumptions. Furthermore, efforts to model malaria transmission based 73 on entomological data encounter significant limitations stemming from these data 74 gaps. 75 A key mathematical approach to addressing this limitation involves leveraging 76 machine learning techniques for imputing missing data. Numerous researchers 77 have adopted such methods to estimate missing values effectively. For instance, 78 Emmanuel et al. [2] provided a comprehensive review of the methodologies used 79 for missing data imputation. Among the techniques employed, the K-nearest 80 neighbor (KNN) method has demonstrated notable success. Their study also 81 highlights that the performance of imputation methods, such as random forest- 82 based algorithms and KNN, is highly dependent on the dataset being analyzed. 83 However, a potential limitation of the KNN method is its inability to consider the 84 correlation between potential predictors, which can lead to unsatisfactory results 85 [3]. To address such shortcomings, boosting algorithms have emerged as 86 promising alternatives. Research by Cauthen et al. [4] demonstrated that boosting 87 methods can outperform non-boosting approaches, especially in scenarios with 88 high rates of missing values. Their study, which focused on imputing the number of 89 dengue cases in India, highlighted the enhanced accuracy and robustness offered 90 by boosting algorithms in handling complex, incomplete datasets. 91 The Extreme Gradient Boosting (XGBoosting) algorithm, an improved boosting 92 method, has demonstrated superior performance in imputing medical data [3]. 93 Additionally, both kernel-based and tree-based machine learning algorithms have 94 been successfully applied to meteorological datasets [5]. In recent years, .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 5 95 Recurrent Neural Network (RNN) models based on deep learning have become 96 increasingly popular for imputing missing values in time series data, owing to their 97 ability to extract and leverage temporal dependencies within datasets [6,7]. These 98 studies typically rely on prediction error measurements, most commonly, root mean 99 square error (RMSE) to evaluate algorithm performance. While machine learning 100 methods are well-established for missing data imputation in various domains, their 101 applicability to imputing mosquito abundance time series data remains largely 102 unexplored. Expanding research in this direction could offer significant advances in 103 handling incomplete entomological datasets, ultimately improving the accuracy of 104 malaria modeling and the effectiveness of vector control strategies. 105 In this study, we systematically compare four distinct approaches for imputing 106 missing data within incomplete time series of mosquito counts, focusing on the 107 most abundant Anopheles species collected in a remote Amerindian community in 108 southern Venezuela. Each imputation method leverages a range of potential 109 climate drivers as predictors, including indices that capture the influence of the El 110 Niño phenomenon. By applying these imputation strategies, we generate complete 111 mosquito abundance time series, subsequently evaluating their performance using 112 cross-validation error metrics. For both the most prevalent mosquito species 113 (Anopheles darlingi) and for the aggregated dataset encompassing all species, we 114 select the imputed time series that yield the lowest cross-validation error. 115 Anopheles darlingi is the most efficient vector of malaria parasites in the neotropics 116 [8,9,10], and the confirmed vector in the study area [11]; all the Anopheles species 117 collected are potential vectors [10,12,13,14], therefore were included in the .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 6 118 analysis. These optimally imputed mosquito abundance datasets are incorporated 119 as potential predictors in our time series models of malaria incidence, with the aim 120 of enhancing the accuracy of malaria transmission modeling in this challenging and 121 data-limited context. 122 Beyond the challenges of imputing mosquito abundance data, numerous studies 123 have explored the impact of climate drivers on malaria incidence across diverse 124 regions. The importance of climate as a driving force of malaria transmission has 125 been known for over a century [15]. The development and survival of Anopheles 126 mosquitoes and the Plasmodium parasites are highly dependent on temperature, 127 rainfall and humidity [16,17]. On the other hand, large scale climatic drivers as El 128 Niño/La Niña (ENSO) phenomena are key in modulating the inter-annual variability 129 of the precipitation worldwide [18]. The relation between ENSO and malaria 130 epidemics has been studied in South America [19,20,21], and particularly in 131 Venezuela [22,23,24,25,26]. The integration of climate variables as potential 132 covariates in malaria prediction models has garnered increasing attention. For 133 instance, Kifle et al. [27] demonstrated that rainfall data served as a reasonably 134 good predictor of malaria incidence in Eritrea, highlighting the role of precipitation 135 in shaping transmission dynamics. Similarly, Kumar et al. [28] introduced a 136 Bayesian regression model incorporating climate predictors to successfully model 137 malaria incidence in India. These findings underscore the potential of climate 138 variables -such as rainfall, temperature, and humidity- as essential components in 139 forecasting malaria risk and improving the precision of epidemiological models 140 tailored to specific local contexts. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 7 141 Importantly, mosquito abundance time series constitute a critical input for 142 elucidating the causal pathways that drive fluctuations in malaria incidence. 143 Previous works, such as Laneri et al. [29] have largely defined the causal 144 landscape of malaria transmission in terms of climate variables, focusing on factors 145 such as rainfall, temperature, and humidity. While this approach has yielded 146 valuable insights, it may overlook the direct role that vector population dynamics 147 play in modulating malaria risk. 148 The present study describes the missing-data imputation methods for Anopheles 149 abundance time series, and the generalized time series models for the 150 Plasmodium vivax and P. falciparum incidence in a malaria endemic region in 151 southern Venezuela. We present the application of the different imputation 152 approaches and compare their performance using a leave-one-out cross-validation 153 (LOOCV) approach, together with an assessment of the generalized time series 154 models for P. vivax and P. falciparum malaria. 155 After applying a variable-selection strategy based on predictive performance in a 156 training set, the generalized time series models can be used to forecast future 157 malaria cases in a testing set. These predictive models enhance our understanding 158 of the interactions between climate, vector abundance and malaria incidence in 159 remote areas. 160 Materials and Methods .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 8 161 Data Sources 162 Mosquito collection methods 163 Entomological surveillance in collaboration with local leaders from remote 164 Amerindian communities was implemented in 2009 [30]. This action was 165 demanded by the chief and members of the Ye’kwana community of Boca de 166 Nichare (lat 06∘34.44’N, long 64∘49.39’W, altitude 59 msnm), Sucre Municipality, 167 Bolívar State. In this area several studies have been conducted since 2005 with 168 local leaders, and the demographic, ecologic, epidemiologic, hydrologic and 169 geomorphologic characteristics have been described [11,30,31,32,33,34]. Local 170 leaders were trained and supervised by one of the authors (YRP) in mosquito 171 collections using the Mosquito Magnet Liberty Plus Trap (MMLPT) (American 172 Biophysics Corporation, North Kingstown, RI). The training topics included 173 mosquito identification, data collection, reporting, preservation, and shipping of 174 specimens collected to the central laboratory in Maracay (Aragua State, 175 Venezuela), where identification to species level and numbers collected were 176 confirmed. The MMLPT device is battery operated and utilizes Carbon Dioxide 177 (CO2), one of the products of the catalysis of propane gas, as an attractant 178 together with Octenol. This trap has been previously evaluated in this region of 179 southern Venezuela, and its efficiency to collect anophelines compared to human 180 landing catches was determined [35]. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 9 181 Mosquito Data 182 Mosquitoes were collected monthly between 2009 and 2016, for 4 to 10 nights per 183 month around the full moon from 18:00 to 06:00 hours. To estimate the biting 184 pattern of the most abundant species, the trap was checked and the net with 185 trapped mosquitoes was removed every four hours: 18:00 to 22:00 hrs; 22:00 to 186 02:00 hrs, 02:00 to 06:00 hrs. The total number of individuals was recorded, and 187 mosquito species were identified. The most predominant species were Anopheles 188 darlingi (darl.1), Anopheles oswaldoi (osw.5), Anopheles goeldii (goel.6), and other 189 species with a less important presence in the area included An. braziliensis, An. 190 triannulatus, An. benarrochi and An. mattogrossensis. A time series of the 191 estimated mean number of mosquitoes by combining all Anopheles species 192 collected was also considered in the analysis and will be referred to as All in the 193 text. The main assumption is that Anopheles abundance and species diversity is 194 representative of the municipality. 195 Mosquito Abundance Time Series 196 The monthly mosquito abundance time series for the most abundant species (An. 197 darlingi, An. goeldii, and An. oswaldoi), and the monthly time series combining all 198 Anopheles species collected (All) is presented in Fig 1. 199 Fig 1. Estimated total number of mosquitoes (ETNM) collected per month. 200 All = total number of mosquitoes collected of all Anopheles species; darl.1 = 201 Anopheles darlingi; goel.6 = Anopheles goeldii; osw.5 = Anopheles oswaldoi; Other 202 = total number of other Anopheles species collected. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 10 203 204 All Anopheles species identified during the surveys are considered potential 205 vectors [9,10,11,36] and, hence were included in the time series and other 206 analysis. These values were calculated by adding the total number of mosquitoes 207 collected per night during three consecutive four-hour periods from 18:00 to 06:00 208 hours. The average number of mosquitoes per night was calculated by adding the 209 total number of mosquitoes for all nights of observation and dividing by the total 210 number of nights in a particular month. This estimate was converted into a monthly 211 estimate by multiplying the mean number of mosquitoes per night by 𝑛𝑑𝑎𝑦𝑠, where 212 𝑛𝑑𝑎𝑦𝑠 is the total number of days in each month. 213 A distinctive feature of this data set is the important proportion of missing data. The 214 proportion of missing observations (or months without collections) is 60.4% of the 215 total number of observations. As shown in Fig 1, there are important data gaps, 216 especially during the years 2012-2013 due to travel limitations to the collection site, 217 including economic restrictions and the shortage of fuel. 218 Mosquito Abundance and Biting Activity 219 Mosquito abundance and biting activity can vary from year to year, depending on 220 the climatic drivers. It can vary between species and could also vary between 221 different collection periods (night hours) [11,37,38,39]. To determine the 222 significance of these factors on the mosquito biting behavior, a three-way ANOVA 223 model was fitted to the mean number of mosquitoes by year, species and 224 collection time, including also the interactions among all the three different factors. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 11 225 In this case, a normality Shapiro-Wilks test was not rejected. The variable 226 representing collection times (with three intervals: 18:00 to 22:00 hrs; 22:00 to 227 02:00 hrs, 02:00 to 06:00 hrs) and its interaction with the other two factors, resulted 228 as non-significant, indicating that there are not significant differences in the number 229 of mosquitoes among these distinct mosquito collection periods. For this reason, 230 this factor was not accounted for in the analysis herein. 231 Malaria Data Cases 232 Monthly malaria cases due to P. vivax (PV) and P. falciparum (PF) were obtained 233 from the Malaria Program report for the Sucre Municipality. Cases resulting from 234 mixed infections associated with both parasite species were also available and 235 added to the corresponding number of malaria cases for each parasite. Data on 236 malaria cases for the study location (Boca de Nichare) was not available. 237 Population data for the Sucre municipality was available from 2009 to 2017 238 (Instituto de Salud Pública, Bolívar State). The monthly P. vivax and P. falciparum 239 incidence were estimated by dividing the total number of cases of each parasite 240 species by the population per 1,000 population. The variables P. falciparum 241 incidence (PF), and P. vivax incidence (PV) were used as response variables in 242 the malaria model described in the methodology section. Time series of this data 243 set are shown in Fig 2. 244 245 Fig 2. Plasmodium vivax and Plasmodium falciparum incidence, Sucre 246 Municipality, Bolívar State, Venezuela, 2009-2017. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 12 247 Climate data 248 The climate of the study region has been classified as a Tropical Savanna Climate 249 (Aw) according to Köppen-Geiger climate classification system, with annual 250 average temperatures of approximately 23.46°C and average annual precipitation 251 of 2,764 mm/yr. 252 Nearby meteorological stations with long term time series data are unavailable for 253 the study site. Historical climate information is readily available from open data 254 sources, including large scale oceanic-atmospheric indices depicting the temporal 255 variability of the El Niño-Southern Oscillation (ENSO) index. 256 Rainfall and mean air temperature data were downloaded from the World Bank 257 data portal https://climateknowledgeportal.worldbank.org/. Fig 3 shows the monthly 258 historical time series for the study region. 259 Fig 3. Rainfall and Mean Air temperature time series during the period 2009- 260 2016. A. Total Rainfall in mm. B. Mean Air Temperature in degrees Celsius. Boca 261 de Nichare, Bolívar State, Venezuela. 262 The El Niño 3.4 Index (Nino3.4) was downloaded from the National Oceanic and 263 Atmospheric Administration (NOAA) website https://psl.noaa.gov/enso/data.html. 264 This index depicts the values of the Sea Surface Temperature (SST) Anomalies for 265 the Pacific Ocean region with extents (5N-5S, 120W-170W) (S1 Fig). 266 Fig 4 shows the estimated total number of mosquitoes (ETNM) per month time 267 series together with the anomalized rainfall and the anomalized temperature. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 13 268 Fig 4. Estimated total number of mosquitoes (ETNM) per month for All 269 Anopheles species (red line) compared with climate data. A. Anomalized mean 270 monthly rainfall in mm (blue line). B. Anomalized mean monthly air temperature in 271 degrees Celsius (green line). Boca de Nichare, Bolívar State, Venezuela (2009- 272 2016). 273 Anomalized rainfall and temperature time series were calculated by subtracting the 274 long-term monthly means from each monthly value. The anomalization procedure 275 filters out the series’ seasonal components while the monthly and inter-annual 276 variability remain as dominant components. The long-term monthly means were 277 calculated using data from the period 1981 to 2010. 278 To compare the estimated mean number of mosquito time series with the ENSO 279 data, Fig 5 overlaps the different ENSO occurrence modes with the mosquito data 280 before imputation. The highlighted peak in the mosquito times series occurs during 281 a weak La Niña phase. These maxima occur during a positive rainfall anomaly 282 period, also preceded by a positive air temperature anomaly. 283 Fig 5. Estimated Total Number of Mosquitoes (ETNM) per Month and El Niño 284 3.4 index for the period 2009 – 2016. Boca de Nichare, Bolívar State, 285 Venezuela. 286 A summary of the climatic variables during the study period 2009-2016 is 287 presented in Table 1. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 14 288 Table 1. Summary monthly statistics for climatic variables Rainfall, Mean Air 289 Temperature and El Niño 3.4 index during years 2009-2016. 290 __________________________________________________________________ Variable Mean Median Standard Deviation Max Min Rainfall (mm) 176.68 168.95 130.77 469.1 1.6 Mean Air Temperature (⁰C) 26.44 26.4 0.69 28.6 25.2 El Niño 3.4 Index 0.1176 -0.085 0.9871 2.57 -1.7 291 292 Methodology 293 In this section, we describe the different methods used for imputation of 294 mosquitoes (An. darlingi and all Anopheles species) time series missing data and 295 present the generalized time series model used to model malaria incidence. 296 Missing Data Imputation Methods 297 Missing data handling methods are usually adapted to the nature of missing data 298 mechanisms. The missing data mechanisms are related to the dependence or not 299 of the missing data on the observed values. Emmanuel et al. [2] and Sterne et al. 300 [40] propose classifying missing data according to their type: missing completely at 301 random, missing at random and missing not at random. The type of missing data .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 15 302 we are dealing with in this study can be classified as missing completely at random 303 because missing values do not depend on the observed or unobserved data, but 304 rather on the ability to collect information in a timely manner if economic and/or 305 logistic resources (gasoline, propane gas, trap functioning) are readily available. 306 Deleting missing data is one method of handling missing values called the 307 Complete Case Analysis. This procedure is rarely applied because deleting 308 missing data points (Listwise deletion) can significantly reduce our sample size for 309 model training purposes, and therefore, accuracy would be a concern. 310 Using imputation methods as regression imputation allows for estimating missing 311 values by using other variables as covariates or predictors. In this research, 312 different approaches were used for missing data imputation. The following methods 313 will be discussed and compared: 314 Linear Regression (LR), Stochastic Linear Regression (SLR), K Nearest-Neighbor 315 (KNN) and Gradient Boosting (GB). 316 The imputation methods LR, SLR, KNN and GB were implemented in Python 317 (Google Colab tool) using packages fancyimpute, sklearn and xgboost. 318 A brief description of each method and the steps for their implementation are 319 described below. 320 Linear Regression and Stochastic Linear Regression Imputation 321 Regression imputation is a technique used to replace missing values 𝑌𝑚 in a data 322 set based on a set of predictors or attributes 𝑋1,𝑋2,…,𝑋𝑝. It can be classified into .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 16 323 two types: deterministic regression imputation and stochastic regression 324 imputation. Deterministic regression imputation predicts missing values using the 325 estimated regression model: 326 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝 327 where 𝛽𝑖 are the parameters of the estimated regression model using least 328 squares, and 𝑦𝑚 are the estimated mean values of the response variable 𝑌 given a 329 set of values for the predictors 𝑥1,𝑥2,…,𝑥𝑝. Estimation of the parameters 𝛽0,𝛽1,…,𝛽𝑝 330 is based on the observed values 𝑌𝑜 and the corresponding observations for the 331 predictors (complete cases). However, this method uses a point estimate that does 332 not consider any random variation (i.e., an error term) around the regression 333 model, leading to imputed values that may be too precise resulting in an 334 overestimation of the correlation between 𝑋𝑖 and 𝑌. 335 To address this limitation, stochastic regression imputation was used. This method 336 incorporates a random error term into the predicted value: 337 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝 + 𝜖 338 where 𝜖 is a simulated random error term from a normal distribution with zero mean 339 and variance estimated using least squares. By incorporating this error term, 340 stochastic regression imputation can reproduce the correlation between 𝑋𝑖 and 𝑌 341 more accurately. 342 Using regression imputation, we used the Leave-One-Out Cross-Validation 343 (LOOCV) method to calculate the average RMSE for the evaluation of imputation 344 performance. LOOCV is a special type of k-fold cross-validation in which the .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 17 345 number of folds k is equal to the number of observations N in the data set. Under 346 this approach, each observation will be treated as a validation set once, and the 347 rest (N-1) observations will be used as a training set. This method aims at reducing 348 bias and preventing over-fitting, and it is useful for data sets with a small number of 349 observations. 350 K-Nearest Neighbor method 351 This algorithm falls under the umbrella of supervised learning methods. It can be 352 used for both regression and classification problems. It is considered a Lazy 353 learner algorithm [41] and does not assume a probability distribution of the data. 354 Instead, KNN assumes that similar inputs have similar outputs, and it will impute 355 the missing data with the average value amongst its k most similar inputs. KNN 356 depends heavily on the parameter k, which is the number of nearest neighbors to 357 be used for imputation. The range of k values used in our imputation process 358 varied from 1 to 40. 359 The KNN algorithm uses a distance measure to detect the closest data points to a 360 target vector 𝑥0 in a neighborhood, for which a missing characteristic 𝑦0 should be 361 imputed. It is also important to note that the K neighbors do not have any missing 362 features. Examples of distance measures are the Minkowski distance, Manhattan 363 Distance, Cosine Distance, Jaccard Distance, Hamming Distance and Euclidean 364 distance [2]. Among these distance measures, Euclidean distance is commonly 365 used in practice. The equation is: 366 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 18 367 𝐷(𝑝,𝑞) = 𝛴𝑛 𝑑=1(𝑝𝑑 ― 𝑞𝑑)2 368 369 where 𝐷(𝑝,𝑞) is the Euclidean distance, 𝑝 and 𝑞 are two data points in an n- 370 dimensional space and 𝑛 is the dimension of the space. 371 The imputed value at 𝑥0 (𝑦0) is the average of all responses 𝑦𝑖 in the 372 neighborhood: 373 𝑦0 = 1 𝐾 𝐾 𝑖=1 𝑦𝑖 374 375 The LOOCV method (Leave-One-Out Cross-Validation) is used to select the 376 optimum value of 𝐾 by comparing the value of the root mean square error (RMSE) 377 for each number of neighbors 𝐾 and selecting 𝐾 that minimizes RMSE. 378 Gradient Boosting Method 379 The Gradient Boosting method (GBoosting) is also a supervised learning algorithm 380 where a training data set is used to predict a response 𝑦0 from an input vector of 381 features 𝑥0. It can be used both in a regression context and as a classifier. The 382 final prediction is an ensemble of decision trees constructed in an iterative way. 383 Each tree is considered a weak learner, and the algorithm uses the steepest 384 descent method to move into the direction of minimizing a loss function at each 385 step. The Gradient Boosting method employs regularization techniques to prevent 386 overfitting, and it can handle missing data automatically. The Gradient Boosting .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 19 387 model is flexible for data with a large missing data rate and small size, and it is also 388 useful for minimizing bias error. 389 A wide range of hyperparameters must be specified for the Gradient Boosting 390 model. The hyperparameters used in the imputation process are max_depth 391 (maximum depth of a tree), min_samples_leaf (minimum number of observations in 392 a leaf), and learning_rate (how quickly the model learns). The LOOCV method is 393 used to select the best combination of the three parameters with the lowest RMSE. 394 The gradient boost algorithm objective is to find an approximation of the function 395 F(x) that can minimize the expected value of the loss function L(y, F(x)) [42]: 396 In the first step, we want to minimize the loss function 𝐿(𝑦𝑖,𝛼), which represents the 397 difference between the actual values and the predicted values. A constant 398 prediction of the model 𝐹0(𝑥) is given by: 399 𝐹0(𝑥) = 𝑎𝑟𝑔 𝑚𝑖𝑛𝛼 𝑁 𝑖=1 𝐿(𝑦𝑖,𝛼) 400 Where 𝛼 is the value that minimizes the loss function ∑𝑁 𝑖=1 𝐿(𝑦𝑖,𝛼). 401 In subsequent steps (𝑚 = 1,…,𝑀) the following expression is minimized: 402 𝛼𝑗𝑚 = 𝑥𝑖∈𝑅𝑗𝑚 𝐿(𝑦𝑖 ,𝐹𝑚―1 (𝑥𝑖) + 𝛼 ) 403 At each step, several regression trees are created on the residuals of the model. 404 These residuals are calculated as follows: 405 𝑟𝑖𝑚 = ―[ ∂𝐿(𝑦𝑖,𝐹(𝑥𝑖)) ∂𝐹(𝑥𝑖) ]𝐹(𝑥)=𝐹𝑚―1(𝑥) .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 20 406 where 𝑖 = 1,…,𝑁; and m is the index of each tree. 407 The derivative is taken with respect to the previous prediction 𝐹𝑚―1 , and the result 408 is proportional to the residual value at the step 𝑚 ― 1. 409 In step 𝑚 the prediction 𝐹𝑚(𝑥) is updated as: 410 𝐹𝑚(𝑥) = 𝐹𝑚―1(𝑥) + 𝑣 𝐽𝑚 𝑗=1 𝛼𝑗𝑚1(𝑥 ∈ 𝑅𝑗𝑚) 411 𝑗 is the index of each terminal node (leaf); 1(.) is the indicator function and 𝐽𝑚 is the 412 total number of leaves in the tree. 𝑅𝑗𝑚 represents each terminal node and 𝑣 ∈ (0,1) 413 is the learning rate. 414 Generalized Time Series Model for Malaria Incidence 415 One method for modeling malaria cases with covariates is using a generalized 416 linear model for count time series. Let {𝑌𝑡:𝑡 ∈ 𝑁} be a count time series and {𝑋𝑡 417 :𝑡 ∈ 𝑁} be a time-varying 𝑟 dimensional covariate vector. We then model the 418 conditional mean 𝐸(𝑌𝑡|𝐹𝑡―1) of the time series by a process {𝜆𝑡:𝑡 ∈ 𝑁} such that 𝐸( 419 𝑌𝑡|𝐹𝑡―1) = 𝜆𝑡, where 𝐹𝑡 is the history of the joint process {𝑌𝑡,𝜆𝑡,𝑋𝑡+1:𝑡 ∈ 𝑁} up to 420 time 𝑡 and includes the information of covariates at time 𝑡 + 1. Furthermore, let 𝑔: 421 𝑅+→𝑅 be a link function and 𝑔:𝑁0→𝑅 be a transformation function. The count time 422 series model, as proposed in Liboschik et al. [43], is of the form: 423 𝑔(𝜆𝑡) = 𝛽0 + 𝑝 𝑘=1 𝛽𝑘 𝑔(𝑌𝑡―𝑖𝑘) + 𝑞 𝑙=1 𝛼𝑙𝑔(𝜆𝑡―𝑗𝑙) + 𝜂𝑇𝑋𝑡 424 Here 𝜂 is the parameter vector of the covariate effects, and 𝛽0 is the intercept. We 425 can regress on past observations of the time series by defining a set 𝑃 = {𝑖1,𝑖2,…,𝑖𝑝} .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 21 426 of integers where 0 < 𝑖1 < 𝑖2… < 𝑖𝑝 < ∞ and then regressing on lagged 427 observations 𝑌𝑡―𝑖1,𝑌𝑡―𝑖2,…,𝑌𝑡―𝑖𝑝 with effects 𝛽. We are also able to regress on 428 lagged conditional means by defining a similar set 𝑄 = {𝑗1,𝑗2,…,𝑗𝑞} and regressing 429 on conditional means 𝜆𝑡―𝑗1,𝜆𝑡―𝑗2,…,𝜆𝑡―𝑗𝑞 with effects 𝛼. 430 The model is fitted using quasi-conditional maximum likelihood estimation. If the 431 link function is the identity function, then 𝑔(𝑥) = 𝑔(𝑥) = 𝑥. Otherwise, if the link 432 function is logarithmic, then 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥) and 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥 + 1). Furthermore, it 433 must be assumed that 𝑌𝑡|𝐹𝑡―1 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆𝑡) or 𝑌𝑡|𝐹𝑡―1 ∼ 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜆𝑡,𝜙) 434 where 𝜙 ∈ (0,∞) is an additional dispersion parameter. For details on parameter 435 estimation, refer to the documentation of the R package tscount [43]. 436 Results 437 Climate data and mosquito time series dependence 438 Sample cross-correlation functions were calculated to understand the association 439 between the estimated total number of mosquitoes per month and the climatic 440 variables. As mentioned, the rainfall and temperature data time series were 441 anomalized to partially remove the strong seasonal pattern from these climate 442 variables. 443 Fig 6 shows the sample cross-correlation function between the estimated total 444 number of mosquitoes (ETNM) per month and the monthly values of the El Niño .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 22 445 index (Niño3.4), the anomalized mean temperature, and mean monthly rainfall for 446 all Anopheles species. 447 Fig 6. Cross-correlation between the estimated total number of mosquitoes 448 (ETNM) for all Anopheles species time series and the Niño3.4 index (A), the 449 monthly mean temperature (B), and monthly rainfall (C). 450 By inspecting the negative lag portion of the figures, we can detect the leading lag 451 for which the mosquito data is significantly correlated with the climate variables. 452 The maximum correlation in absolute value within a year occurs for lags of 9, 6, 453 and 2 months respectively. There is a significant positive association between 454 mosquito abundance and El Niño index, temperature, and rainfall after 9, 6, and 2 455 months, respectively. 456 A similar calculation was repeated for the most abundant species, An. darlingi 457 (darl.1) and the same lagged significant effect of climate drivers on mosquito 458 abundance were observed. 459 Imputation methods comparison 460 Linear regression and stochastic linear regression 461 The linear regression model was fitted by least-squares, using all non-missing data 462 values (referred to as complete cases) as the training data set. This fitted model 463 was subsequently used to impute missing values. The model was based on 464 predictors including temperature and rainfall anomalies and the El Niño3.4 index. 465 These predictors were initially considered without any lagged effects (i.e., no time .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 23 466 delays). After analyzing the cross-correlations (Fig 6), lagged versions of the 467 climate predictors were incorporated into the model, using the selected lags 468 exhibiting the highest cross-correlation with the estimated total number of 469 mosquitoes (ETNM) response variable. We then made a comparative analysis of 470 model performance under these two scenarios: using lagged predictors and using 471 non-lagged predictors. 472 Our primary focus was on the imputation of data associated with all mosquito 473 species, obtained by aggregating mosquito counts from all species. In addition, we 474 examine data imputation for each species separately. For assessing the prediction 475 error after imputation of missing values, the following approach was employed: 476 1. Due to the limited availability of non-missing data, a simple random 477 sampling of the observed data with replacement was implemented. The 478 number of samples from this sampling process was equivalent to the 479 number of missing data values. Multiple training data sets were initially 480 generated by repeatedly applying the random sampling method for 𝐾 481 iterations. 482 2. A linear regression model was fitted to each training sample using least- 483 squares. To assess model performance, a leave-one-out cross-validation 484 (LOOCV) error was calculated for each of these training data sets, and the 485 average root mean square error (RMSE) was obtained. 486 3. In our analysis, we observed that using lagged climatic variables as 487 predictors resulted in a better model performance (RMSE= 141.40) .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 24 488 compared to using climatic variables without lagging (RMSE=142.69) (S2 489 Fig). 490 4. The same procedure was followed for the stochastic model: climatic 491 variables as predictors were considered before lagging (no lag) and after 492 lagging, and model performance was compared in these two cases. The 493 stochastic regression approach used the same random sampling function as 494 the linear regression model to impute missing values. It generated random 495 predictions based on the regression model, considering the prediction 496 standard error to fill in the missing values. Random predictions were 497 generated from a normal distribution centered at the estimated mean 498 predicted value, with a standard error equal to the standard deviation of the 499 residuals for each estimate. If the random prediction was greater than zero, 500 it was used as the imputed value for that specific missing value. To evaluate 501 the performance of the stochastic imputation method, we used the leave- 502 one-out cross-validation (LOOCV) method, resulting in a list of RMSE 503 values for each training sample that were averaged for all iterations. Our 504 analysis indicated that utilizing lagged climatic variables mean (RMSE= 505 0.13) led to a slightly better performance compared to using climatic 506 variables without lagging mean (RMSE=0.10) (S3 Fig). 507 K-Nearest Neighbor (KNN) 508 The model was trained using the leave-one-out cross-validation method (LOOCV) 509 to adjust the parameter 𝐾 (number of nearest-neighbors). Climatic variables as 510 predictors were considered before lagging (no lag) and after lagging, and the .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 25 511 model performance was compared in these two cases. The following steps were 512 followed: 513 The LOOCV method was used to select the best value of the 𝐾 parameter. A KNN 514 model was fitted to all five species for values of 𝐾 in the range of 1 to 40. The value 515 of 𝐾 with the lowest RMSE for the LOOCV method was 𝐾 = 16 in An. darlingi 516 (darl.1). Fig 7 shows a comparison of the RMSE for each value of 𝐾 for An. darlingi 517 (darl.1) and All mosquito species. 518 Fig 7. Comparison of the leave-one-out cross-validation (LOOCV) root mean 519 square error (RMSE) for each 𝐾 value for Anopheles darlingi (darl.1) and for 520 All Anopheles mosquito species. 521 The KNN imputation model was constructed for An. darlingi (darl.1) and for All 522 species with the number of neighbors 𝑛𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠=16 and a data 𝑡𝑟𝑎𝑖𝑛:𝑡𝑒𝑠𝑡 split 523 ratio of 8:2. The model was run 1,000 times, and the root mean square error 524 (RMSE) was calculated for each iteration. A better performance of the 525 methodology was found when using lagged climatic variables (RMSE= 126.83) in 526 comparison to using climatic variables with no lagging (RMSE= 164.47) for darl.1. 527 Fig 8 shows a comparison of the RMSE for each iteration using climatic variables 528 before and after lagging. These results confirm better predictability of the fitted 529 model when using lagged climate predictors. 530 Fig 8. Comparison of root mean square error (RMSE) for K-nearest neighbor 531 (KNN) method in each run, using climatic variables before and after lagging. 532 A: Anopheles darlingi (darl.1); B: All Anopheles species collected. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 26 533 Gradient Boosting (GB) 534 The model was trained using the leave-one-out cross-validation (LOOCV) method 535 by tuning three parameters in the Gradient Boosting algorithm, which include max 536 depth, min samples leaf, and learning rate. Climatic variables as predictors were 537 considered before lagging (no lags) and after lagging in the same way as with 538 previous imputation methods, and model performance was compared in these two 539 cases. The following steps were followed: 540 1. The LOOCV method was used to select the best value of the parameters 541 mentioned above. The GB model was fitted to species An. darlingi (darl.1) 542 and All species for different values of the hyperparameters. We used a grid 543 search to tune the hyperparameters, aiming to find the optimal combination 544 of three hyperparameters. The hyperparameter values with the lowest 545 RMSE for LOOCV were min samples leaf = 1, learning rate = 0.1, max 546 depth = 10. 547 2. The Gradient Boosting imputation model with the default hyperparameters 548 of GB was constructed for An. darlingi (darl.1) to compare the performance 549 between lagged and no-lagged climate variables. The dataset was split into 550 training and testing sets at an 8:2 ratio. The model was run 100 times and 551 the root mean square error (RMSE) was calculated for each iteration. The 552 methodology performed better when using lagged climatic variables 553 (RMSE= 163.75) compared to non-lagged climatic variables (RMSE= 554 217.44) for darl.1. Fig 9 shows a comparison of the RMSE for each iteration .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 27 555 using climatic variables before and after lagging. These results again 556 highlight the advantage of using lagged predictors over non-lagged ones. 557 Fig 9. Comparison of root mean square error (RMSE) for the Gradient 558 Boosting (GB) method in each run. Using climatic variables before and after 559 lagging. A: Anopheles darlingi (darl.1); B: All Anopheles species collected. 560 Fig 10 show the contrast between the original data (in blue) and the imputed data 561 (in yellow) for the four different imputation methods used for the species An. 562 darlingi (darl.1). 563 Fig 10. Estimated total number of Anopheles darlingi abundance (ETNM) time 564 series imputation methods. A. Linear Regression (LR). B. Stochastic Linear 565 Regression (SLR). C. K-Nearest Neighbor (KNN). D. Gradient Boosting (GB). 566 Important differences can be distinguished among the different methods, especially 567 between linear regression imputation and the remaining methods. These observed 568 differences are not surprising given the variety of imputation approaches used. A 569 comparison of the prediction errors based on the non-missing or complete case 570 data values can be assessed by estimating the RMSE using a LOOCV method for 571 each imputation approach. These results are presented in Table 2. Table entries 572 are relative numbers, calculated after dividing by the mean number of mosquitoes 573 for each species. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 28 574 Table 2. Leave-One-Out Cross Validation (LOOCV) method errorsa for the 575 different imputation methods and mosquito species. 576 _______________________________________________________________ Anopheles Species Imputation Method KNN G. Boosting Linear R. Stochastic R. Allb 0.7638104005 0.0000000297 0.968946389 0.000917769 An. darlingi 0.7727904964 0.0000000539 0.972812828 0.001148532 An. goeldii 0.8078696262 0.00000012 1.308744024 0.002081277 An. oswaldoi 1.345153286 0.000000627 1.503379349 0.001657576 Other 1.073280571 0.00004271 1.120539972 0.001491594 577 aThe root mean square error (RMSE) values were calculated using the complete 578 case observations (non-missing cases). 579 bAll Anopheles species collected. 580 c Other than An. darlingi, An. goeldii and An. oswaldoi species collected. 581 The GB methodology shows the best predictive performance with the lowest 582 testing errors in all cases. The same analysis was done for All Anopheles species, 583 with similar results (S4 Fig). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 29 584 Malaria Incidence Time Series Model 585 Malaria Incidence Model Covariates 586 For the generalized linear time series count model described above, we consider 587 several potential covariates for inclusion in the model: monthly rainfall, mean 588 monthly temperature, ENSO index (El Niño 3.4), and an imputed mosquito count 589 time series. Initially, we selected the imputed mosquito time series with the lowest 590 LOOCV error which corresponds to the Gradient Boosting approach (Table 2). 591 Additionally, we considered lagged versions of each predictor. To select the 592 optimal lag, we calculated the rank cross-correlation of the malaria incidence time 593 series and each climatic anomalized predictor after removing the long-term 594 monthly means. The selected lag was the one that resulted in the highest cross- 595 correlation between the anomalized climatic time series and malaria incidence. For 596 rainfall, a two-month lagged time series was considered based on the time lag that 597 takes into account the time for creating habitats for the development of the aquatic 598 stages of Anopheles mosquitoes (from eggs to adults), the buildup of mosquito 599 populations, blood feeding interval and the development of the parasite inside the 600 mosquito (sporogony) [37,44,45,46,47]. 601 Figs 11, 12, 13, and 14 show the sample rank cross correlation functions of the P. 602 vivax and P. falciparum incidence with rainfall, mean air temperature, ENSO index, 603 and imputed All Anopheles species time series. The lag values for the negative 604 portion of the graphs with the highest rank cross-correlations in absolute value are 605 shown in Table 3. Note that cross-correlations between rainfall and P. vivax .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 30 606 incidence are nonsignificant at any lag. However, the negative lag corresponding to 607 the highest cross-correlation (in absolute value) was selected as a model 608 covariate. This lag varied between 4 and 5 months for rainfall, 8 months for 609 temperature, 16 and 17 months for the ENSO index, and 4 to 6 months for 610 mosquito counts (Table 3). 611 Fig 11. Anomalized rainfall and malaria incidence Rank Cross-Correlation. 612 A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum incidence (PF). 613 ACF= Sample Cross-correlation Function. 614 Fig 12. Anomalized temperature and malaria incidence Rank Cross- 615 Correlation. A. Plasmodium vivax incidence. B. Plasmodium falciparum 616 incidence. ACF= Sample Cross-correlation Function. 617 Fig 13. Anomalized El Niño 3.4 (ENSO) and malaria incidence Rank Cross- 618 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum 619 incidence (PF). ACF= Sample Cross-correlation Function. 620 Fig 14. All Anopheles species counts and malaria incidence Rank Cross- 621 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum 622 incidence (PF). ACF= Sample Cross-correlation Function. 623 Table 3. Selected lags for climate and mosquito count covariates to be 624 included in the Plasmodium vivax and Plasmodium falciparum malaria 625 incidence models for All Anopheles species and for Anopheles darlingi. 626 ________________________________________________________________ .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 31 Malaria Model Mosquito Species Rainfall Temperature ENSO Mosquito count P. vivax All species 2, 4 8 17 4 P. falciparum 2, 5 8 16 5 P. vivax An. darlingi 2, 4 8 17 4 P. falciparum 2, 5 8 16 6 627 628 629 Finally, we considered regressing the malaria incidence on past observations and 630 past conditional means. The Sample Autocorrelation Function (ACF) and Partial 631 Autocorrelation Function (PACF) plots of the malaria time series shown in Fig 15 632 was used to determine the lag of past observations that had the greatest impact on 633 the malaria incidence cases at time 𝑡. Due to the highly significant partial 634 autocorrelation at lag 1, we included past malaria observations at time 𝑡 ― 1 only. 635 Furthermore, after including the past observation, we also consider adding the 636 conditional mean at 𝑡 ― 12 to account for possible seasonal effects . 637 Fig 15. Sample Autocorrelation Function (ACF) and Partial Autocorrelation 638 Function (PACF) for Plasmodium vivax incidence time series. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 32 639 For all models, the logarithmic link function 𝑔(𝜆𝑡) = 𝑙𝑜𝑔(𝜆𝑡) was used. The identity 640 function cannot fit models with negative covariates (such as the ENSO index) and 641 cannot estimate effect parameters less than 0. Since some effects may be 642 negative, the identity function is too restrictive in this case. 643 Malaria Model Fitting and Prediction 644 The malaria model was fitted to the incidence of P. vivax and P. falciparum. To 645 select the best Time Series Generalized Linear Model (TSGLM), we calculated 646 different accuracy metrics for each unique combination of lagged and non-lagged 647 covariates (rainfall, temperature, ENSO, and mosquito counts), an autoregressive 648 term of malaria incidence at 𝑡 ― 1, and the conditional mean malaria incidence at 649 𝑡 ― 12. This term considers any seasonal dependence that might be present in the 650 data. The conditional mean was only considered if the autoregressive term was 651 already in the model. Mosquito counts for All Anopheles species imputed with the 652 GB method were proposed as potential covariates. Additional results were also 653 produced using An. darlingi mosquito counts as a predictor. We also included an 654 additional trend shift component to account for the observed trend increase of 655 malaria incidence from year 2015 onward. 656 We focus on selecting the model that gives the best prediction accuracy when 657 compared to observations. For this purpose, we used an 80-20 training-testing 658 sample split. The training set was used to fit models according to the above 659 algorithm, and each model was used for prediction in the testing set. Using these 660 predictions, we calculated a variety of error metrics: the root mean square error .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 33 661 (RMSE), the mean absolute error (MAE), and the mean absolute percentage error 662 (MAPE). Lower error values are preferred. The best models, as indicated by the 663 lowest values of RMSE, MAE, and MAPE, are presented in Table 4. 664 Table 4. Selected covariates for the Plasmodium vivax and Plasmodium 665 falciparum malaria incidence models that minimize the prediction error in a 666 testing set using different accuracy measures. 667 Plasmodium vivax Plasmodium falciparum Covariates RMSE MAE MAPE Covariates RMSE MAE MAPE Rainfall Xa X Xa Rainfall Xa Xa Xa Rainfall (lag 2) X X Rainfall (lag 2) X Xa Xa Rainfall (lag 4) X X X Rainfall (lag 5) Temperature X X X Temperature X X Temp (lag 8) Temp (lag 8) X ENSO X X X ENSO Xa Xa Xa ENSO (lag 17) ENSO (lag 17) Xa Xa Xa Mosquito Countb X X Mosquito Countb .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 34 Mosq Count (lag 4) X X X Mosq Count (lag 4) Past Obsc (lag 1) Xa Xa Xa Past Obsc (lag 1) Xa Xa Xa Past Mean (lag 12) Past Mean (lag 12) X X X Time Shiftd Xa Xa Xa Time Shiftd X X X 668 669 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean 670 absolute percentage error. 671 a Coefficient is significant at 5% 672 b Mosquito counts include All Anopheles species imputed using Gradient Boosting 673 c Past Observation 674 d Time Shift Post-2015 675 The general malaria incidence model for P. vivax or P. falciparum can be written 676 as: 677 𝑙𝑜𝑔(𝜆𝑡) = 𝛽0 + 𝛽1𝑙𝑜𝑔(𝑌𝑡―1) + 𝛼1𝑙𝑜𝑔(𝜆𝑡―12) + 𝜂𝑅⊺𝑅𝑡 + 𝜂𝑇⊺𝑇𝑡 + 𝜂𝐸𝑛⊺𝐸𝑛𝑡 + 𝜂𝑀𝐶⊺𝑀𝐶𝑡 + 𝜂𝐼𝐼𝑡―2015 678 where 𝜆𝑡 is the expected malaria incidence rate and 𝑌𝑡 is the observed incidence 679 rate at time 𝑡 [(cases/population) × 1,000]. 𝑅𝑡, 𝑇𝑡 and 𝐸𝑛𝑡 are the climatic variables 680 (rainfall, temperature, ENSO respectively) vector with non-lagged and lagged 681 values at time 𝑡. 𝑀𝐶𝑡 is the mosquito counts vector with non-lagged and lagged 682 values at time 𝑡. 𝐼𝑡―2015 is an indicator variable which is zero if the year is lower 683 than 2015 (constant trend), and it is proportional to time if time is greater than .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 35 684 2015. Model parameters 𝛽0 (intercept), 𝛽1 (𝑡 ― 1 observation), 𝛼1 (𝑡 ― 12 mean), 685 and parameter vectors 𝜂𝑅,𝜂𝑇,𝜂𝐸𝑛,𝜂𝐼 are estimated by maximum likelihood using the 686 R library tscount [43]. 687 Selected covariates in Table 4 (marked with “X”) minimize the different prediction 688 error metrics and may vary depending on the prediction criteria used. For the P. 689 vivax malaria model, rainfall, lagged rainfall (lags 2 (MAE and MAPE) and 4), 690 temperature, ENSO, mosquito counts (RMSE, MAE), lagged mosquito counts (lag 691 4) and the previous time-step expected malaria incidence are selected as the most 692 important covariates in the sense of minimizing the predictive errors for the testing 693 set, with a significant trend shift after the year 2015. Only rainfall, past month 694 observations, and the trend shift component are significant at the 95% confidence 695 level. 696 For the P. falciparum malaria incidence model, rainfall, lagged rainfall (lag 2), 697 temperature (MAE and MAPE), lagged temperature (lag 8, RMSE), ENSO, lagged 698 ENSO (lag 16), the previous time step malaria incidence observation, past year 699 mean incidence and a linear trend shift are selected as the most important 700 covariates that minimize the model predictive errors over the testing data set. Only 701 rainfall and lagged rainfall, ENSO and lagged ENSO, and the previous month 702 observation are statistically significant according to the 95% confidence intervals 703 calculated using the normal approximation for the model parameter estimates. 704 Predictors selected by minimizing the predictive errors are not necessarily going to 705 be statistically significant according to the Wald test statistic (see Table 4). .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 36 706 However, predictability performance was preferred over statistical significance in 707 this case. 708 Fig 16 shows the model fitting and forecast results for the P. vivax malaria 709 incidence model for All Anopheles species and An. darlingi counts as predictors. 710 Model forecasts also include the 80% and 95% confidence intervals. Two model 711 versions were fitted: the first one used imputed An. darlingi as a predictor; and the 712 second one used All Anopheles species mosquito counts as a predictor. 713 Fig 16. Model fit and predictions for Plasmodium vivax malaria incidence 714 time series. A: All Anopheles species used as a covariate for mosquito counts. B: 715 Anopheles darlingi used as a covariate for mosquito counts. 716 Fig 16 shows a better forecasting performance when using All mosquito species 717 as a predictor (Fig 16A) in comparison when using An. darlingi (Fig 16B). In this 718 case most of the observations in the testing set are included within the 80% 719 confidence intervals with a clear upward trend from year 2015. 720 Fig 17 shows the model fitting and forecast results for the P. falciparum malaria 721 incidence model. Based on the lowest prediction errors, the mosquito count 722 variable was not included as a predictor (Table 4). 723 Fig 17. Model fit and predictions for Plasmodium falciparum malaria 724 incidence time series. 725 According to Fig 17, data observations in the testing set are included within the 726 95% and 80% confidence intervals of the predicted values. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 37 727 Malaria Model sensitivity to mosquito data imputation methods 728 An important question to be addressed is the sensitivity of the mosquito counts 729 data imputation methods in the malaria model configuration and prediction 730 accuracy. Although the Gradient Boosting approach resulted in the best imputation 731 method performance according to the results presented in Table 2, mosquito 732 counts imputed with all methods were tested as potential predictors, and variable 733 selection was recorded in each case. We explored how predictability changes 734 when using different mosquito imputed time series in the model selection process. 735 Model configuration (or best predictors set) can also change when different 736 imputed mosquito count time series are used in the model selection process. 737 Gradient Boosting (GB) imputed mosquito counts and lagged mosquito counts for 738 All Anopheles species were not included as potential model covariates in the P. 739 falciparum best predictive model; however, mosquito counts were selected as 740 covariates in the P. vivax malaria model by all prediction accuracy criteria. 741 In Table 5, we compare the different accuracy measures in the 20% testing data 742 set for the P. vivax and P. falciparum malaria models and the four different time 743 series of imputed mosquito counts for All Anopheles species. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 38 744 Table 5. Plasmodium vivax and Plasmodium falciparum malaria models. 745 Comparison of prediction errors for different imputation methods for 746 mosquito counts time series for All Anopheles species. 747 Plasmodium vivax Plasmodium falciparum Imp Method RMSE MAE MAPE RMSE MAE MAPE LR 17.537 14.207 63.244 0.855 1.168 21.941 SRL 16.813 14.543 58.895 1.023 0.711 23.94 KNN 5.295 4.268 26.392 0.879 0.697 22.707 GB 5.368 4.137 25.838 0.879 0.697 22.707 748 749 LR= Linear Regression; SLR= Stochastic Linear Regression; KNN= K-Nearest 750 Neighbor; GB= Gradient Boosting. 751 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean 752 absolute percentage error. 753 The results show that for the P. vivax malaria incidence model, the lowest 754 prediction errors are obtained when imputed K-Nearest Neighbor (KNN) and 755 Gradient Boosting (GB) All Anopheles species counts are used as covariates. 756 Larger prediction errors are obtained when using imputed LR and SLR mosquito 757 time series as covariates in the model, compared with using imputed GB and KNN 758 mosquito time series. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 39 759 The best predictive P. falciparum malaria incidence models exclude the covariate 760 mosquito counts when K-Nearest Neighbor (KNN) and Gradient Boosting (GB) 761 imputed time series are used as predictors. In this case, prediction errors for the 762 malaria incidence model are the same for both time series. However, this result 763 changed when Linear Regression (LR) or Stochastic Linear Regression (SLR) 764 were used as imputation methods. All Anopheles species-imputed time series were 765 included as potential covariates in the malaria incidence model in this case. 766 Prediction errors across each predictive measure (RMSE, MAE, MAPE) do not 767 show high variability among different imputation methods. 768 Results from Table 5 indicate a greater sensitivity to the imputation methods for the 769 P. vivax malaria incidence model than for the P. falciparum model. 770 Discussion 771 The elimination of malaria requires a strong surveillance system. A key component 772 is entomological surveillance which relies on monitoring anopheline populations 773 mainly in terms of species composition, abundance, biting behavior, and 774 insecticide resistance [48]. The collection of entomological data requires a huge 775 investment in regular mosquito collections in various sentinel sites throughout 776 endemic areas. Systematic data collection approaches as the one presented by 777 Drakou et al [49] provides a good example of the implementation of mosquito 778 network surveillance with adequate spatial coverage, while in our study 779 implementing a comprehensive spatial surveillance effort was logistically 780 unfeasible. In the study by Drakou et al [49], surveillance was limited to a single .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 40 781 year. However, a multiyear analysis, such as the one we are presenting, is always 782 valuable for understanding the impact of environmental factors on mosquito 783 bionomics and to understand year-to-year variability. 784 In remote locations with difficult access, a way to contribute to providing relevant 785 entomological data to the malaria program is through well-trained local leaders. 786 This approach was attempted in an Amerindian village of the Caura River basin in 787 southern Venezuela. Nevertheless, the political, economic, and social crisis [50,51] 788 jeopardized the success of such an initiative, resulting in irregular interruptions of 789 mosquito collections with the end result of large missing data, making it difficult to 790 interpret the data and provide reliable information for decision-making for vector 791 control. 792 To overcome this limitation, the application of data imputation methods arise as a 793 potential solution. Machine learning approaches, specifically Boosting methods 794 have been proposed by other authors when there is a significant percentage of 795 missingness [4]. However, imputing missing data under the assumption that it is 796 missing at random may not be entirely accurate, as intrinsic seasonal and 797 operational factors may have influenced the sampling effort. 798 Given the large percentage of missing data in the mosquito abundance time series, 799 the application of different imputation methods provides a good alternative when 800 mosquito surveillance is highly fragmented, and continuous time series are not 801 available. However, different imputation methods might capture different data 802 features, and testing the feasibility of the imputed values still remains a challenging .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 41 803 task. Four imputation methods were compared: Linear Regression, Stochastic 804 Linear Regression, K‑Nearest Neighbor, and Gradient Boosting. 805 Considering the limited data availability and rather short time series, the LOOCV 806 approach was the best option to evaluate predictability among the different 807 methods. Gradient Boosting and the Stochastic Linear Regression methods 808 demonstrated better efficiency in terms of accuracy, reporting the lowest LOOCV 809 errors (Table 2). In contrast, the Linear Regression approach was less efficient 810 than the other methods, showing the highest LOOCV errors and a poorer 811 representation of the seasonal cycle. Comparison among different imputation 812 methods for mosquito data and any other time series data is a necessary first step 813 in the evaluation of potential biases in the new data information provided [52]. If 814 this new data is used as input to an impact model, conducting a sensitivity analysis 815 of the imputation methods on the expected response would provide additional 816 insights into the usefulness of the imputed data. In general, the seasonal cycle was 817 well represented by all imputation methods, with marked seasonal peaks between 818 August and September, two to three months after the beginning of the rainy 819 season. Similar results were recorded in other malaria endemic areas of 820 Venezuela [13,39], while in other regions the seasonal peaks are related to the 821 transition period between rainy and dry season [38,53,54]. The large spike 822 observed in 2010 (Fig 5), has been associated to a weak La Niña condition. A 823 similar spike to the one observed in 2010 was estimated for 2016 when applying 824 the GB imputation method to the An. darlingi time series, with a maximum 825 comparable to the one recorded in 2010. However, mosquito measurements were .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 42 826 only available through July 2016, so direct validation for this period is not possible. 827 Furthermore, we cannot conclude that there is a direct association of this spike 828 with a weak La Niña event, since the next event was reported during the Fall 2017 829 up to Spring 2018 [55]. It is important to point out that although there are abundant 830 references worldwide relating malaria and other diseases to ENSO events, the 831 available data on mosquito abundance and ENSO is scant. El Niño is associated 832 with an increase in rainfall in the Colombia Pacific coast and East Africa, and La 833 Niña with drought. A three-year study conducted in the Colombian Pacific coast 834 reported no relation between ENSO and the abundance of An. albimanus and An. 835 darlingi [56] while a one-year study in Tanzania showed a decline in the 836 abundance of An. arabiensis and An. funestus associated with the drought due to 837 La Niña 2016-2017 [57]. 838 The relationship between climate and malaria is powerful, complex, and highly 839 localized. Several studies clearly show that factors like temperature and rainfall, 840 and the global climate patterns like El Niño that influence them, are critical drivers 841 of malaria epidemics [17,58,59]). Previous studies in Venezuela have shown that 842 El Niño is associated with drought, resulting in an increase in malaria cases the 843 following year [21,22,23]). In the present study the significant lags for rainfall and 844 ENSO were one month earlier for the P. vivax model than for the P. falciparum 845 model (Table 3), which is reasonable considering that the sporogonic cycle of P. 846 falciparum is larger than that of P. vivax [60,61]. Although variations in temperature 847 have a great impact on the developmental time of the mosquitoes and the 848 parasites [60,61,62], the lag of 8 months was the most significant for both models. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 43 849 When fitting the P. vivax and P. falciparum malaria incidence models, the main 850 issue is the non-significance or exclusion of mosquito abundance as covariates for 851 the P. falciparum malaria incidence model. These findings suggest that it is 852 insufficient to treat mosquito abundance and diversity data collected from a single 853 locality as representative for the entire municipality, because these ecological 854 characteristics can differ significantly between localities within the same region 855 [11,38,39,53]. 856 However, the proposed malaria model is still useful for predictive purposes, since 857 the model accounts for potential seasonal effects and significant changes in the 858 incidence trend, lagged and non-lagged environmental factors, and the impact of 859 the previous month's malaria incidence. The better model performance observed 860 when using All Anopheles species as a mosquito count predictor, compared with 861 using An. darlingi (Fig 16) might be due to the unexpected spikes obtained in the 862 imputed mosquito count time series of An. darlingi for the Boosting method (GB) 863 (Fig 10D). 864 The sensitivity analysis using different imputation methods is a key component in 865 the results presented in this work. These results are demonstrated through the 866 process of evaluating both model accuracy and model structure, as these aspects 867 are sensitive to the imputation method employed. Therefore, it is important to 868 explore all available methods to ensure robust and reliable conclusions. When 869 assessing the sensitivity of the different imputation methods on the malaria model 870 predictive performance, the MAPE results shown in Table 5 confirms a reasonably 871 good predictive performance of the P. vivax model when mosquito counts time .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 44 872 series for All Anopheles species imputed with KNN and GB methods are used as 873 predictors (20% <MAPE<30%), while values of MAPE are greater than 50% when 874 mosquito count time series imputed with LR and SLR methods are used instead. 875 Thresholds to evaluate model predictive performance according to the MAPE 876 criteria have been discussed by Ma and Liu [63].These results might be an 877 indication that these two imputation methods (LR and SLR) do not accurately 878 represent the mosquito abundance variability for All Anopheles species, and as a 879 consequence might fail to reproduce the association between mosquito abundance 880 and malaria incidence. In the case of the P. falciparum model predictive 881 performance according to MAPE is more homogeneous across the different 882 imputation methods (Table 5). But in this case mosquito abundance does not yield 883 a better model accuracy when the KNN and GB imputed mosquito counts are 884 considered as predictors. Differences in model accuracy between P. vivax and P. 885 falciparum are driven by the selection of the sets of predictors that minimize 886 prediction errors. Mosquito counts imputed with the regression models for All 887 Anopheles species preserve the seasonality in the mosquito abundance cycle but 888 fail to represent important year-to-year variability fluctuations exhibiting a higher 889 degree of smoothing. However, when used as predictors they can explain a good 890 percentage of the malaria time series variability in the P. falciparum model. This is 891 not the case for the GB and KNN imputed mosquito time series, for which the more 892 complex fluctuations do not contribute to explain the observed variance in the 893 malaria time series model. We can speculate that data limitations may be an 894 important factor underlying these results. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 45 895 Despite data limitations, this work provides a template for integrating multiyear 896 entomological observations, climate information, and robust imputation techniques 897 to improve malaria prediction systems in remote settings. The approaches 898 developed here can support more timely and geographically informed 899 decision‑making, enabling public health authorities to anticipate high‑risk periods, 900 optimize vector control interventions, and better allocate scarce resources. 901 Continued investment in local capacity for mosquito monitoring—combined with 902 modern data‑analytic tools—can strengthen the foundations of malaria surveillance 903 and contribute to more effective control strategies in the Amazon and similar 904 hard‑to‑reach regions. 905 Finally, a key entomological parameter for the implementation of a particular 906 intervention for control is to determine the vector biting hourly pattern. This pattern 907 is species specific; nevertheless, An. darlingi exhibits different patterns along its 908 range of distribution, even among locations within a few kilometers apart 909 [11,64,65,66]. During the present study, the mosquito abundance at different hour 910 intervals was not significant, contrasting with previous studies in Sucre municipality 911 along the Caura River where An. darlingi showed a biting peak around midnight in 912 Las Majadas [64] whereas in Jabillal, there were two distinctive peaks at sunset 913 and sunrise [11]). The lack of significance in the hour intervals might be due to the 914 low abundance of mosquitoes collected throughout the study. Similar results were 915 reported for An. calderoni in Colombia [67]. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 46 916 Conclusion 917 This study demonstrates that machine learning–based imputation methods offer a 918 powerful tool for reconstructing highly incomplete mosquito abundance time series 919 in remote, resource‑limited malaria‑endemic regions. Incorporating the imputed 920 mosquito series and climate predictors into generalized time‑series models of 921 malaria incidence revealed important differences between P. vivax and P. 922 falciparum. Models for P. vivax responded strongly to mosquito abundance, 923 rainfall, temperature, ENSO anomalies, and recent malaria history, with Gradient 924 Boosting and KNN imputations yielding the most accurate forecasts. In contrast, P. 925 falciparum incidence was largely insensitive to mosquito abundance, likely 926 reflecting mismatches between the geographic scale of entomological and 927 epidemiological datasets, lower case counts, or intrinsic differences in 928 transmission. Climate drivers—particularly rainfall and ENSO—retained predictive 929 power for both parasites, underscoring their central role in shaping transmission 930 potential. 931 Acknowledgment 932 We are most grateful to Simón Caura, the local leader, Hernán Guzmán, Yarys 933 Estrada, Victor Sánchez, Jorge Moreno and Jesús González, who participated in 934 field activities. To the people of Boca de Nichare, for their support and friendship. 935 To Ana María Ibañez and Horacio Vargas for logistic support and friendship. 936 Mariapia Bevilacqua, Lya Cárdenas, and Zenaida Muria (ACOANA) and Dirección 937 de Salud Ambiental (MPPSalud) for logistic support. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 47 938 References 939 1. World Health Organization (WHO). World Malaria Report 2025: addressing the 940 threat of antimalaria drug resistance. Geneva: World Health Organization; 2025. 941 2. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A 942 survey on missing data in machine learning. J Big Data. 2021; 8: 140. 943 3. Zhang X, Yan C, Gao C, Malin BA, Chen Y. Predicting missing values in medical 944 data via XGBoost regression. J Health Inform Res. 2020; 4: 383–394. 945 4. Cauthen KR, Lambert G, Ray J, Lefantzi S. Imputing data that are missing at 946 high rates using a boosting algorithm. Sandia National Laboratories. SAND-2016. 947 Available from: https://www.sandia.gov/app/uploads/sites/203/2022/06/sand2016- 948 9430J.pdf. 949 5. Sattari MT, Roshan E, Bakhshandeh M, Pourghasemi HR. Potential of kernel 950 and tree-based machine-learning models for estimating missing data of rainfall. 951 Eng Appl Comput Fluid Mech. 2020; 14(1): 1078-1094. 952 6. Yoon J, Zame WR, van der Schaar M. Multi-directional recurrent neural 953 networks: A novel method for estimating missing data. In: Time series workshop in 954 International Conference on Machine Learning; 2017. p. 1-9. 955 7. Fang C, Wang C. Time series data imputation: A survey on deep learning 956 approaches. 2020; arXiv preprint arXiv: 2011.11347. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 48 957 8. Rubio-Palis Y, Zimmerman RH. Ecoregional classification of malaria vectors in 958 the Neotropics. J Med Entomol. 1997; 34: 499-510. 959 9. Sinka ME, Rubio-Palis Y, Manguin S, Patil AP, Temperley WH, Gething PW, et 960 al. The dominant Anopheles vectors of human malaria in the Americas: occurrence 961 data, distribution maps and bionomic precis. Parasites Vectors. 2010; 3: 1-72. doi: 962 10.1186/1756-3305-3-72. 963 10. Sinka ME, Bangs MJ, Manguin S, Rubio-Palis Y, Chareonviriyaphap T, 964 Coetzee M, et al. A global map of dominant malaria vectors. Parasites Vectors. 965 2012; 5: 69. doi: 10.1186/1756-3305-5-69. 966 11. Rubio-Palis Y, Bevilacqua M, Medina DA, Moreno JE, Cárdenas L, Sánchez V, 967 et al. Malaria entomological risk factors in relation to land cover in the Lower Caura 968 River Basin, Venezuela. Mem Inst Oswaldo Cruz. 2013; 108: 220-228. doi: 969 10.1590/0074-0276108022013015. 970 12. Arruda M, Carvalho MB, Nussenzweig RS, Maracic M, Ferreira AW, Cochrane 971 AH. Potential Vectors of Malaria and Their Different Susceptibility to Plasmodium 972 falciparum and Plasmodium vivax in Northern Brazil Identified by Immunoassay. 973 Am J Trop Med Hyg. 1986; 35: 873–881. doi: 10.4269/ajtmh.1986.35.873. 974 13. Rubio-Palis Y, Wirtz RA, Curtis CF. Malaria entomological inoculation rates in 975 western Venezuela. Acta Trop. 1992; 52: 167-174. doi: 10.1016/0001- 976 706x(92)90033-t. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 49 977 14. Galardo AK, Arruda M, D’Almeida Couto AA, Wirtz R, Lounibos LP, et al. 978 Malaria vector incrimination in three rural riverine villages in the Brazilian Amazon. 979 Am J Trop Med Hyg. 2007; 76(3): 461–469. 980 15. Macdonald G: The epidemiology and control of malaria London: Oxford 981 University Press; 1957. 982 16. Garret-Jones C, Shidrawi G. Malaria vectorial capacity of a population of 983 Anopheles gambiae - An exercise in Epidemiological Entomology. Bull Wld Hlth 984 Org. 1969; 40: 531-545. 985 17. Magersa DM, Lou X-S. Effects of Climate Change on Malaria Risk to Human 986 Health: A Review. Atmosphere. 2025; 16: 71. doi: 10.3390/atmos16010071. 987 18. Stern P, Easterling WE. Making climate forecast matter. Washington DC: 988 National Academy Press; 1999. 989 19. Kavats SR. El Niño and health. Bull Wld Hlth Org. 2000; 78:1127-1134. 990 20. Poveda G, Rojas W, Quiñones ML, Vélez ID, Mantilla RI, Ruíz D, et al. 991 Coupling between annual and ENSO timescales in the malaria climate association 992 in Colombia. Environ Hlth Perspect. 2001; 109: 489-493. 993 21. Gagnon AS, Smoyer-Tomic KE, Bush ABG. The El Niño Southern Oscillation 994 and malaria epidemics in South America. Int J Biometeorol. 2002; 46: 81–89. doi: 995 10.1007/s00484-001-0119-6. 996 22. Bouma MJ, Dye C. Cycles of malaria associated with El Niño in Venezuela. 997 JAMA. 1997; 278(21): 1772-1774. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 50 998 23. Sáez-Sáez V, Rubio-Palis Y, Pino JC. Variabilidad climática y malaria estudio 999 regional: municipio Sifontes, estado Bolívar, Venezuela. Terra Nueva Etapa. 2009; 1000 25: 93-112. 1001 24. Grillet ME, El Souki M, Laguna F, León JR. The periodicity of Plasmodium 1002 vivax and Plasmodium falciparum in Venezuela. Acta Trop. 2014; 129: 52-60. 1003 25. Laguna F, Grillet ME, León JR, Ludeña C. Modelling malaria incidence by an 1004 autoregressive distributed lag model with spatial component. Spat Spatiotemporal 1005 Epidemiol. 2017; 22: 27-37. doi: 10.1016/j.sste.2017.05.001. 1006 26. Lynn MK, Bossak BH. Analysis of a Secular Trend in Malaria Incidence: 1007 Venezuela, 1959-2015. J Epidemiol Glob Health. 2017; 1:54-59. 1008 27. Kifle MM, Teklemariam TT, Teweldeberhan AM, Tesfamariam EH, 1009 Andegiorgish AK, Kidane EA. Malaria Risk Stratification and Modeling the Effect of 1010 Rainfall on Malaria Incidence in Eritrea. J Environ Public Health. 2019; Article ID 1011 7314129. doi: 10.1155/2019/7314129. 1012 28. Kumar P, Vatsa R, Parth Sarthi P, Kumar M, Gangare V. Modeling an 1013 association between malaria cases and climate variables for Keonjhar district of 1014 Odisha, India: a Bayesian approach. J Parasit Des. 2020; 44(2): 319–331. 1015 29. Laneri K, Cabella B, Prado PI, Mendes Coutinho R, Kraenkel RA. Climate 1016 drivers of malaria at its southern fringe in the Americas. PLoS ONE. 2019; 14(7): 1017 e0219249. doi: 10.1371/journal.pone.0219249. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 51 1018 30. Bevilacqua M, Medina DA, Cárdenas L, Rubio-Palis Y, Moreno J, Martínez A. 1019 Orientaciones para Fortalecer el Programa de Malaria en Zonas Remotas con 1020 Población Indígena en el Caura, Venezuela. Bol Mal Salud Amb. 2009; 49(1): 53- 1021 71. Available from: http://www.iaes.edu.ve/index.php/centro-de- 1022 descargas/viewcategory/16-vol49-no1-ano-2009. 1023 31. Rubio-Palis Y, Moreno JE, Bevilacqua M, Medina D, Martínez A, Cárdenas L, 1024 et al. Caracterización ecológica de los anofelinos y otros culícidos en territorio 1025 indígena del Bajo Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2010; 1026 50(1): 95-107. Available from: http://www.iaes.edu.ve/index.php/centro-de- 1027 descargas/viewcategory/12-vol50-no1-ano-2010. 1028 32. Medina D, Bevilacqua M, Cárdenas L, Morales LG, Rubio-Palis Y, Martínez A, 1029 et al. Mapa de riesgo de transmisión de malaria en la cuenca del río Caura, 1030 Venezuela. Bol Mal Salud Amb. 2011; 51(2): 23-38. Available from: 1031 http://www.iaes.edu.ve/index.php/centro-de-descargas/viewcategory/105-vol51- 1032 no2-ano-2011. 1033 33. Bevilacqua M, Rubio-Palis Y, Medina DA, Cárdenas L. Malaria control in 1034 Amerindian communities of Venezuela. Ecohealth. 2015; 12: 253-266. doi: 1035 10.1007/s10393-015-1026-3. 1036 34. Moreno JE, Rubio-Palis Y, Bevilacqua M, Sánchez V, Guzmán H. 1037 Caracterización de hábitats larvales de anofelinos en la cuenca baja del Río 1038 Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2018; 58: 32-45. Available 1039 from: http://iaes.edu.ve/iaespro/ojs/index.php/bmsa/article/view/57/29. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 52 1040 35. Rubio-Palis Y, Moreno JE, Sánchez V, Estrada Y, Anaya W, Bevilacqua M, et 1041 al. Can Mosquito Magnet® substitute for human-landing catches to sample 1042 anopheline populations? Mem Inst Oswaldo Cruz. 2012;107: 546-549. doi: 1043 10.1590/S0074-02762012000400017. 1044 36. Pimenta PFP, Orfano AS, Bahia AC, Duarte APM, Ríos-Velásquez CM, Melo 1045 FF, et al. An overview of malaria transmission from the perspective of Amazon 1046 Anopheles vectors. Mem Inst Oswaldo Cruz. 2015; 110(1): 23-47. dor: 1047 10.1590/0074-02760140266. 1048 37. Magris M, Rubio-Palis Y, Menare C, Villegas L. Vector bionomics and malaria 1049 transmission in the Upper Orinoco River, southern Venezuela. Mem Inst Oswaldo 1050 Cruz. 2007; 102: 303-311. doi: 10.1590/s0074-02762007005000049. 1051 38. Galardo AKR, Zimmerman RH, Lounibos LP, Young LJ, Galardo CD, Arruda 1052 M, et al. Seasonal abundance of anopheline mosquitoes and their association with 1053 rainfall and malaria along the Matapí River, Amapá, Brazil. Med Vet Entomol. 1054 2009; 23(4): 335-349. doi: 10.1111/j.1365-2915.2009.00839.x. 1055 39. Rubio-Palis Y, Curtis CF. Biting and resting behavior of anophelines in western 1056 Venezuela and implications for control of malaria transmission. Med Vet Entomol. 1057 1992; 6: 325-334. 1058 40. Sterne AC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. 1059 Multiple imputation for missing data in epidemiological and clinical research: 1060 potential and pitfalls. BMJ. 2009; 338. doi: 10.1136/bmj.b2393. 1061 41. Zhang ML, Zhou ZH. ML-KNN: A lazy learning approach to multi-label learning, .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 53 1062 Pattern Recog. 2007; 40(7): 2038-2048. doi: 10.1016/j.patcog.2006.12.019. 1063 42. Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient 1064 boosting algorithms. Artif Intell Rev. 2021; 54:1937–1967. doi: 0.1007/s10462-020- 1065 09896-5. 1066 43. Liboschik T, Fokianos K, Fried R. tscount: An R package for analysis of count 1067 time series following generalized linear models. J Stat Softw. 2017; 82(5): 1–51. 1068 doi: 10.18637/jss.v082.i05. 1069 44. Cox FE. History of the discovery of malaria parasites and their vectors. Parasit 1070 Vectors. 2010; 3:1-9. 1071 45. Smith M, Mackli MG, Thomas CJ. Hydrological and geomorphological controls 1072 of malaria transmission. Earth Sci Rev. 2013; 116: 109-127. 1073 46. Ikeda T, Behera SK, Morioka Y, Minakawa H, Hashizume M, Tsuzuki A, et al. 1074 Seasonally lagged effects climatic factor son malaria incidence in South Africa. Sci 1075 Rep. 2017; 7: 2458. doi: 10.1038/s41598-017-02680-6. 1076 47. Okeneye K, Gumel AB. Analysis of a temperature and rainfall dependant model 1077 for malaria transmission dynamics. Math Biosci. 2017; 287: 72-92. 1078 48. World Health Organization. Guidelines for malaria vector control. Geneva: 1079 World Health Organization; 2019. 1080 49. Drakou K, Nikolaou T, Vasquez M, Petric D, Michaelakis A, Kapranas A, et al. 1081 The Effect of Weather Variables on Mosquito Activity: A Snapshot of the Main 1082 Point of Entry of Cyprus. Int J Environ Res Public Health. 2020; 17: 1403. doi: .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 54 1083 10.3390/ijerph17041403. 1084 50. Grillet ME, Hernández-Villena JV, Llewellyn MS, et al. Venezuela’s 1085 humanitarian crisis, resurgence of vector-borne diseases, and implications for 1086 spillover in the region. Lancet Infect Dis. 2019; 19: e149-161. 1087 51. Briceño-León R, Guevara M, Terán I. Politics as Disease in Venezuela. Vector 1088 control before and after the Bolivarian revolution. In: Locating Zika. Social Change 1089 and Governance in an Age of Mosquito Pandemics. Kevin Bardosh Editor. 1090 Routledge, London. 238 p. doi: 10.4324/9780429456558. 1091 52. Batista GEAPA, Monard MC. An analysis of four missing data treatment 1092 methods for supervised learning. Appl Artif Intell. 2003; 17:5-6, 519-533. doi: 1093 10.1080/713827181. 1094 53. Moreno JE, Rubio-Palis Y, Páez E, Pérez E, Sánchez V. Abundance, biting behavior 1095 and parous rate of anopheline mosquito species in relation to malaria incidence in gold- 1096 mining areas of southern Venezuela. Med Vet Entom. 2007; 21: 339-349. doi: 1097 10.1111/j.1365-2915.2007.00904.x. 1098 54. Dos Santos F, Xu M, Bravo de Guenni L, Lourenço-de-Oliveira R, Rubio-Palis 1099 Y. Characterization of larval habitats of Anopheles (Nyssorhynchus) darlingi and 1100 associated species in malaria areas in western Brazilian Amazon. Mem Inst 1101 Oswaldo Cruz. 2024; 119: e240116. 1102 55. NOAA. Physical Sciences Laboratory. Monthly Climate/Ocean Indices (Time- 1103 Series) at PSL. Available from: https://psl.noaa.gov/data/timeseries/month. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 55 1104 56. Poveda G, Quiñones ML, Vélez ID, Rojas W, Rúa-Uribe G, Ruíz CD et al. 1105 Desarrollo de un sistema de alerta temprana para la malaria en Colombia. 1106 Universidad Internacional de Andalucía, Sevilla, España. 2008; 181 p. 1107 57. Kreppel K, Caminade C, Govella N, Morse AP, Ferguson HM, Baylis M. Impact 1108 of ENSO 2016-2017 on regional climate and malaria vector dynamics in Tanzania. 1109 Environ Res Lett. 2019; 14: 075009. 1110 58. Mandal S, Sarkar RP, Sinha S. Mathematical models of malaria – a review. Mal 1111 J. 2011; 10: 202. Available from: https://www.malariajournal.com/content/10/1/202. 1112 59. Yiga V, Nampala H, Tumwiine J. Analysis of the modelo n the effect of 1113 seasonal factor son malaria transmission dynamics. J Appl Math. 2020; doi: 1114 10.1155/2020/8885558. 1115 60. World Health Organization. Manual on Practical Entomology in Malaria. Part I 1116 and II. Geneva: World Health Organization; 1975. 1117 61. Martens WJM, Niessen LW, Rotmans J, Jetten TH, McMichael AJ. Potential 1118 impact of global climate change on malaria risk. Environ Health Perspect. 1995; 1119 103: 458-464. 1120 62. Yadav CP, Hussain SSA, Mullick R, Rahi M, Sharma A. Climate zones are a 1121 key component of the heterogeneous presentation of malaria and should be added 1122 as a malariometric for the planning of malaria elimination. PLOS Glob Public 1123 Health. 2023; 3(6): e0001878. doi:10.1371/journal.pgph.0001878. 1124 63. Ma X, Liu Z. Application of a novel time-delayed polynomial grey model to 1125 predict the natural gas consumption in China. J Comput Appl Math. 2017; 324: 17- 1126 24. doi: 10.1016/j.cam.2017.04.020. .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 56 1127 64. Rubio-Palis Y. Observaciones sobre el patrón de actividad hematofágica del 1128 vector de malaria Anopheles darlingi en poblaciones del sur de Venezuela. Bol Dir 1129 Malariol Sal Amb. 1995; 35(2): 66-70. 1130 65. Rosa-Freitas MG, Broomfield G, Priestman A, Milligan PJM, Momen H, 1131 Molyneux DH. Cuticular hydrocarbons, isoenzymes and behavior of three 1132 populations of Anopheles darlingi from Brazil. J Amer Mosq Control Assoc. 1992; 1133 8:357-366. 1134 66. Moreno M, Saavedra MP, Bickersmith SA, Lainhart W, Tong C, Alava F, et al. 1135 Implications for changes in Anopheles darlingi biting behaviour in three 1136 communities in the peri-Iquitos region in Amazonian Perú. Malar J. 2015; 14:290. 1137 doi: 10.1186/s12936-015-0804-2. 1138 67. Orjuela LI, Ahumada ML, Avila I, Herrera S, Beier JC, Quiñones ML. Human 1139 biting activity, spatial-temporal distribution and malaria vector role of Anopheles 1140 calderoni in the southwest of Colombia. Malar J. 2015; 14:256. doi: 1141 10.1186/s12936-015-0764-6. 1142 1143 Supporting information 1144 S1 Fig. El Niño 3.4 index for the period 2009 – 2016. 1145 S2 Fig. Root mean square prediction error (RMSE) after 1,000 iterations of 1146 the random sampling approach [leave-one-out cross-validation (LOOCV)] for 1147 each Anopheles species, showcasing the impact of lagging/no-lagging of .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint 57 1148 climatic variables using an approach with linear regression imputation. All= 1149 all Anopheles species collected; darl.1= Anopheles darlingi; goel.6= Anopheles 1150 goeldii; osw.5= Anopheles oswaldoi; other= other Anopheles species collected. 1151 S3 Fig. Root mean square prediction error (RMSE) for each species using a 1152 leave-one-out cross-validation (LOOCV) approach with stochastic linear 1153 regression imputation. All= all Anopheles species collected; darl.1= Anopheles 1154 darlingi; goel.6= Anopheles goeldii; osw.5= Anopheles oswaldoi; other= other 1155 Anopheles species collected. 1156 S4 Fig. Estimated total number of All Anopheles species abundance (ETNM) 1157 time series Gradient Boosting (GB) imputation method. 1158 .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint .CC-BY 4.0 International licenseavailable under a (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0