{"paper_id":"14bb577d-2f32-47d0-859d-6c2d4ecbd38a","body_text":"1\n1 Running Title: Machine learning to overcome mosquito missing data\n2 Using machine learning to overcome mosquito \n3 collections missing data for malaria modeling\n4 Yasmin Rubio-Palis1*, Linjia Feng2#a, Karena S. Liang2#b, \n5 Chen Song2#c, Songyuan Wang2, Tymon Duchnicki2, Xiao \n6 Zhang2#d, Lelys Bravo de Guenni2*\n7\n8 1 Instituto de Investigaciones Biomédicas \"Dr. Francisco J. Triana Alonso\" \n9 (BIOMED), Universidad de Carabobo, Maracay, Aragua, Venezuela\n10 2 Department of Statistics, University of Illinois, Urbana-Champaign, Illinois, United \n11 States of America\n12 #a Current address: Department of Financial Engineering, Cornell University, New \n13 York, New York, United States of America\n14\n15 #b Current address: Department of Computer Science, University of Illinois, Urbana-\n16 Champaign, Illinois, United States of America\n17\n18 #c Current address: The Grainger College of Engineering, University of Illinois, \n19 Urbana-Champaign, Illinois, United States of America\n20\n21 #d Current address: Department of Statistics, University of California, Santa Cruz, \n22 California, United States of America\n23\n24 * Corresponding author\n25 Email: lbravo@illinois.edu\n26 Email: rubiopalis@gmail.com\n27\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n2\n28\n29 Abstract\n30 Entomological surveillance plays a crucial role in areas where malaria remains \n31 endemic, yet gathering data on mosquito populations is often expensive and \n32 complicated, particularly in remote locations with challenging logistics and \n33 inconsistent sampling schedules. Access to extensive time series data on mosquito \n34 species at specific sites would greatly enhance insights into seasonal trends and \n35 the biting habits of vectors of malaria parasites. Gaps in mosquito count records \n36 pose a significant challenge for researchers and public health officials seeking to \n37 establish early warning systems and effective vector control programs. In this \n38 study, we apply quantitative machine learning techniques to address missing data \n39 in estimates of mosquito abundance collected from 2009 to 2016 in Bolívar State, \n40 Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K \n41 Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of \n42 Anopheles mosquitoes, employing a leave-one-out cross-validation strategy. \n43 Additionally, we developed a predictive malaria transmission model incorporating \n44 mosquito abundance and climate variables (El Niño 3.4 Index, rainfall, and mean \n45 air temperature) as covariates. Our generalized time series model forecasts \n46 malaria incidence of Plasmodium vivax and Plasmodium falciparum based on \n47 climate dynamics and imputed mosquito data. Model performance was assessed \n48 using root mean square error, mean absolute error, and mean absolute percentage \n49 error. The final results demonstrated that machine learning imputation significantly \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n3\n50 improved the accuracy and reliability of P. vivax malaria incidence predictions but \n51 failed to predict P. falciparum incidence. The study demonstrates that method \n52 choice significantly influences the reconstruction of seasonal abundance patterns \n53 and the performance of malaria incidence models. Nevertheless, the proposed \n54 models strengthen the foundation for targeted interventions and surveillance in \n55 endemic regions. Despite limitations in data continuity and coverage, the findings \n56 highlight the value of combining multiyear entomological data sets with robust \n57 imputation and sensitivity analyses to improve predictive modeling in resource-\n58 constrained, malaria-endemic settings.\n59 Introduction\n60 Malaria stands as the most significant parasitic disease transmitted by mosquitoes. \n61 According to the World Health Organization [1], an estimate of 262 million malaria \n62 cases occurred across 80 endemic countries in 2024. Within the Americas, \n63 Venezuela reported an estimated incidence of 9.7-12.6 x 1,000 population at risk, \n64 significantly higher than the incidence reported by Brazil (3.08-3.43 x 1,000 \n65 population at risk), accounting for 17 percent of all malaria cases in the region. \n66 Notably, more than 70 percent of these cases originated from Bolívar State [1].\n67 Mosquito control is a crucial intervention for preventing the transmission of malaria \n68 parasites. Nevertheless, effective surveillance and monitoring of mosquito vector \n69 populations are often constrained by the challenges of conducting regular \n70 collections across multiple locations, particularly in vast and remote regions. These \n71 logistical hurdles frequently result in incomplete data, leading to potentially \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n4\n72 inaccurate assumptions. Furthermore, efforts to model malaria transmission based \n73 on entomological data encounter significant limitations stemming from these data \n74 gaps.\n75 A key mathematical approach to addressing this limitation involves leveraging \n76 machine learning techniques for imputing missing data. Numerous researchers \n77 have adopted such methods to estimate missing values effectively. For instance, \n78 Emmanuel et al. [2]  provided a comprehensive review of the methodologies used \n79 for missing data imputation. Among the techniques employed, the K-nearest \n80 neighbor (KNN) method has demonstrated notable success. Their study also \n81 highlights that the performance of imputation methods, such as random forest-\n82 based algorithms and KNN, is highly dependent on the dataset being analyzed.\n83 However, a potential limitation of the KNN method is its inability to consider the \n84 correlation between potential predictors, which can lead to unsatisfactory results \n85 [3]. To address such shortcomings, boosting algorithms have emerged as \n86 promising alternatives. Research by Cauthen et al. [4]  demonstrated that boosting \n87 methods can outperform non-boosting approaches, especially in scenarios with \n88 high rates of missing values. Their study, which focused on imputing the number of \n89 dengue cases in India, highlighted the enhanced accuracy and robustness offered \n90 by boosting algorithms in handling complex, incomplete datasets.\n91 The Extreme Gradient Boosting (XGBoosting) algorithm, an improved boosting \n92 method, has demonstrated superior performance in imputing medical data [3]. \n93 Additionally, both kernel-based and tree-based machine learning algorithms have \n94 been successfully applied to meteorological datasets [5]. In recent years, \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n5\n95 Recurrent Neural Network (RNN) models based on deep learning have become \n96 increasingly popular for imputing missing values in time series data, owing to their \n97 ability to extract and leverage temporal dependencies within datasets [6,7]. These \n98 studies typically rely on prediction error measurements, most commonly, root mean \n99 square error (RMSE) to evaluate algorithm performance. While machine learning \n100 methods are well-established for missing data imputation in various domains, their \n101 applicability to imputing mosquito abundance time series data remains largely \n102 unexplored. Expanding research in this direction could offer significant advances in \n103 handling incomplete entomological datasets, ultimately improving the accuracy of \n104 malaria modeling and the effectiveness of vector control strategies.\n105 In this study, we systematically compare four distinct approaches for imputing \n106 missing data within incomplete time series of mosquito counts, focusing on the \n107 most abundant Anopheles species collected in a remote Amerindian community in \n108 southern Venezuela. Each imputation method leverages a range of potential \n109 climate drivers as predictors, including indices that capture the influence of the El \n110 Niño phenomenon. By applying these imputation strategies, we generate complete \n111 mosquito abundance time series, subsequently evaluating their performance using \n112 cross-validation error metrics. For both the most prevalent mosquito species \n113 (Anopheles darlingi) and for the aggregated dataset encompassing all species, we \n114 select the imputed time series that yield the lowest cross-validation error. \n115 Anopheles darlingi is the most efficient vector of malaria parasites in the neotropics \n116 [8,9,10], and the confirmed vector in the study area [11]; all the Anopheles species \n117 collected are potential vectors [10,12,13,14], therefore were included in the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n6\n118 analysis. These optimally imputed mosquito abundance datasets are incorporated \n119 as potential predictors in our time series models of malaria incidence, with the aim \n120 of enhancing the accuracy of malaria transmission modeling in this challenging and \n121 data-limited context.\n122 Beyond the challenges of imputing mosquito abundance data, numerous studies \n123 have explored the impact of climate drivers on malaria incidence across diverse \n124 regions. The importance of climate as a driving force of malaria transmission has \n125 been known for over a century [15]. The development and survival of Anopheles \n126 mosquitoes and the Plasmodium parasites are highly dependent on temperature, \n127 rainfall and humidity [16,17]. On the other hand, large scale climatic drivers as El \n128 Niño/La Niña (ENSO) phenomena are key in modulating the inter-annual variability \n129 of the precipitation worldwide [18]. The relation between ENSO and malaria \n130 epidemics has been studied in South America [19,20,21], and particularly in \n131 Venezuela [22,23,24,25,26]. The integration of climate variables as potential \n132 covariates in malaria prediction models has garnered increasing attention. For \n133 instance, Kifle et al. [27] demonstrated that rainfall data served as a reasonably \n134 good predictor of malaria incidence in Eritrea, highlighting the role of precipitation \n135 in shaping transmission dynamics. Similarly, Kumar et al. [28] introduced a \n136 Bayesian regression model incorporating climate predictors to successfully model \n137 malaria incidence in India. These findings underscore the potential of climate \n138 variables -such as rainfall, temperature, and humidity- as essential components in \n139 forecasting malaria risk and improving the precision of epidemiological models \n140 tailored to specific local contexts. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n7\n141 Importantly, mosquito abundance time series constitute a critical input for \n142 elucidating the causal pathways that drive fluctuations in malaria incidence. \n143 Previous works, such as Laneri et al. [29] have largely defined the causal \n144 landscape of malaria transmission in terms of climate variables, focusing on factors \n145 such as rainfall, temperature, and humidity. While this approach has yielded \n146 valuable insights, it may overlook the direct role that vector population dynamics \n147 play in modulating malaria risk.\n148 The present study describes the missing-data imputation methods for Anopheles \n149 abundance time series, and the generalized time series models for the \n150 Plasmodium vivax and P. falciparum incidence in a malaria endemic region in \n151 southern Venezuela. We present the application of the different imputation \n152 approaches and compare their performance using a leave-one-out cross-validation \n153 (LOOCV) approach, together with an assessment of the generalized time series \n154 models for P. vivax and P. falciparum malaria. \n155 After applying a variable-selection strategy based on predictive performance in a \n156 training set, the generalized time series models can be used to forecast future \n157 malaria cases in a testing set. These predictive models enhance our understanding \n158 of the interactions between climate, vector abundance and malaria incidence in \n159 remote areas.  \n160 Materials and Methods\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n8\n161 Data Sources\n162 Mosquito collection methods\n163 Entomological surveillance in collaboration with local leaders from remote \n164 Amerindian communities was implemented in 2009 [30]. This action was \n165 demanded by the chief and members of the Ye’kwana community of Boca de \n166 Nichare (lat 06∘34.44’N, long 64∘49.39’W, altitude 59 msnm), Sucre Municipality, \n167 Bolívar State. In this area several studies have been conducted since 2005 with \n168 local leaders, and the demographic, ecologic, epidemiologic, hydrologic and \n169 geomorphologic characteristics have been described [11,30,31,32,33,34]. Local \n170 leaders were trained and supervised by one of the authors (YRP) in mosquito \n171 collections using the Mosquito Magnet Liberty Plus Trap (MMLPT) (American \n172 Biophysics Corporation, North Kingstown, RI). The training topics included \n173 mosquito identification, data collection, reporting, preservation, and shipping of \n174 specimens collected to the central laboratory in Maracay (Aragua State, \n175 Venezuela), where identification to species level and numbers collected were \n176 confirmed. The MMLPT device is battery operated and utilizes Carbon Dioxide \n177 (CO2), one of the products of the catalysis of propane gas, as an attractant \n178 together with Octenol. This trap has been previously evaluated in this region of \n179 southern Venezuela, and its efficiency to collect anophelines compared to human \n180 landing catches was determined [35].\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n9\n181 Mosquito Data\n182 Mosquitoes were collected monthly between 2009 and 2016, for 4 to 10 nights per \n183 month around the full moon from 18:00 to 06:00 hours. To estimate the biting \n184 pattern of the most abundant species, the trap was checked and the net with \n185 trapped mosquitoes was removed every four hours: 18:00 to 22:00 hrs; 22:00 to \n186 02:00 hrs, 02:00 to 06:00 hrs. The total number of individuals was recorded, and \n187 mosquito species were identified. The most predominant species were Anopheles \n188 darlingi (darl.1), Anopheles oswaldoi (osw.5), Anopheles goeldii (goel.6), and other \n189 species with a less important presence in the area included An. braziliensis, An. \n190 triannulatus, An. benarrochi and An. mattogrossensis. A time series of the \n191 estimated mean number of mosquitoes by combining all Anopheles species \n192 collected was also considered in the analysis and will be referred to as All in the \n193 text. The main assumption is that Anopheles abundance and species diversity is \n194 representative of the municipality.\n195 Mosquito Abundance Time Series\n196 The monthly mosquito abundance time series for the most abundant species (An. \n197 darlingi, An. goeldii, and An. oswaldoi), and the monthly time series combining all \n198 Anopheles species collected (All) is presented in Fig 1. \n199 Fig 1. Estimated total number of mosquitoes (ETNM) collected per month. \n200 All = total number of mosquitoes collected of all Anopheles species; darl.1 = \n201 Anopheles darlingi; goel.6 = Anopheles goeldii; osw.5 = Anopheles oswaldoi; Other \n202 = total number of other Anopheles species collected.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n10\n203\n204 All Anopheles species identified during the surveys are considered potential \n205 vectors [9,10,11,36] and, hence were included in the time series and other \n206 analysis. These values were calculated by adding the total number of mosquitoes \n207 collected per night during three consecutive four-hour periods from 18:00 to 06:00 \n208 hours. The average number of mosquitoes per night was calculated by adding the \n209 total number of mosquitoes for all nights of observation and dividing by the total \n210 number of nights in a particular month. This estimate was converted into a monthly \n211 estimate by multiplying the mean number of mosquitoes per night by 𝑛𝑑𝑎𝑦𝑠, where \n212 𝑛𝑑𝑎𝑦𝑠 is the total number of days in each month.\n213 A distinctive feature of this data set is the important proportion of missing data. The \n214 proportion of missing observations (or months without collections) is 60.4% of the \n215 total number of observations. As shown in Fig 1, there are important data gaps, \n216 especially during the years 2012-2013 due to travel limitations to the collection site, \n217 including economic restrictions and the shortage of fuel.\n218 Mosquito Abundance and Biting Activity\n219 Mosquito abundance and biting activity can vary from year to year, depending on \n220 the climatic drivers. It can vary between species and could also vary between \n221 different collection periods (night hours) [11,37,38,39]. To determine the \n222 significance of these factors on the mosquito biting behavior, a three-way ANOVA \n223 model was fitted to the mean number of mosquitoes by year, species and \n224 collection time, including also the interactions among all the three different factors. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n11\n225 In this case, a normality Shapiro-Wilks test was not rejected. The variable \n226 representing collection times (with three intervals: 18:00 to 22:00 hrs; 22:00 to \n227 02:00 hrs, 02:00 to 06:00 hrs) and its interaction with the other two factors, resulted \n228 as non-significant, indicating that there are not significant differences in the number \n229 of mosquitoes among these distinct mosquito collection periods. For this reason, \n230 this factor was not accounted for in the analysis herein.\n231 Malaria Data Cases\n232 Monthly malaria cases due to P. vivax (PV) and P. falciparum (PF) were obtained \n233 from the Malaria Program report for the Sucre Municipality. Cases resulting from \n234 mixed infections associated with both parasite species were also available and \n235 added to the corresponding number of malaria cases for each parasite. Data on \n236 malaria cases for the study location (Boca de Nichare) was not available. \n237 Population data for the Sucre municipality was available from 2009 to 2017 \n238 (Instituto de Salud Pública, Bolívar State). The monthly P. vivax and P. falciparum \n239 incidence were estimated by dividing the total number of cases of each parasite \n240 species by the population per 1,000 population. The variables P. falciparum \n241 incidence (PF), and P. vivax incidence (PV) were used as response variables in \n242 the malaria model described in the methodology section. Time series of this data \n243 set are shown in Fig 2.\n244\n245 Fig 2. Plasmodium vivax and Plasmodium falciparum incidence, Sucre \n246 Municipality, Bolívar State, Venezuela, 2009-2017.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n12\n247 Climate data\n248 The climate of the study region has been classified as a Tropical Savanna Climate \n249 (Aw) according to Köppen-Geiger climate classification system, with annual \n250 average temperatures of approximately 23.46°C and average annual precipitation \n251 of 2,764 mm/yr.\n252 Nearby meteorological stations with long term time series data are unavailable for \n253 the study site. Historical climate information is readily available from open data \n254 sources, including large scale oceanic-atmospheric indices depicting the temporal \n255 variability of the El Niño-Southern Oscillation (ENSO) index.\n256 Rainfall and mean air temperature data were downloaded from the World Bank \n257 data portal https://climateknowledgeportal.worldbank.org/. Fig 3 shows the monthly \n258 historical time series for the study region.\n259 Fig 3. Rainfall and Mean Air temperature time series during the period 2009-\n260 2016. A. Total Rainfall in mm. B. Mean Air Temperature in degrees Celsius. Boca \n261 de Nichare, Bolívar State, Venezuela.\n262 The El Niño 3.4 Index (Nino3.4) was downloaded from the National Oceanic and \n263 Atmospheric Administration (NOAA) website https://psl.noaa.gov/enso/data.html. \n264 This index depicts the values of the Sea Surface Temperature (SST) Anomalies for \n265 the Pacific Ocean region with extents (5N-5S, 120W-170W) (S1 Fig).\n266 Fig 4 shows the estimated total number of mosquitoes (ETNM) per month time \n267 series together with the anomalized rainfall and the anomalized temperature.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n13\n268 Fig 4. Estimated total number of mosquitoes (ETNM) per month for All \n269 Anopheles species (red line) compared with climate data. A. Anomalized mean \n270 monthly rainfall in mm (blue line). B. Anomalized mean monthly air temperature in \n271 degrees Celsius (green line). Boca de Nichare, Bolívar State, Venezuela (2009-\n272 2016).\n273 Anomalized rainfall and temperature time series were calculated by subtracting the \n274 long-term monthly means from each monthly value. The anomalization procedure \n275 filters out the series’ seasonal components while the monthly and inter-annual \n276 variability remain as dominant components. The  long-term monthly means were \n277 calculated using data from the period 1981 to 2010.\n278 To compare the estimated mean number of mosquito time series with the ENSO \n279 data, Fig 5 overlaps the different ENSO occurrence modes with the mosquito data \n280 before imputation. The highlighted peak in the mosquito times series occurs during \n281 a weak La Niña phase. These maxima occur during a positive rainfall anomaly \n282 period, also preceded by a positive air temperature anomaly.\n283 Fig 5. Estimated Total Number of Mosquitoes (ETNM) per Month and El Niño \n284 3.4 index for the period 2009 – 2016. Boca de Nichare, Bolívar State, \n285 Venezuela.\n286 A summary of the climatic variables during the study period 2009-2016 is \n287 presented in Table 1.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n14\n288 Table 1. Summary monthly statistics for climatic variables Rainfall, Mean Air \n289 Temperature and El Niño 3.4 index during years 2009-2016.\n290 __________________________________________________________________\nVariable Mean Median\nStandard \nDeviation Max Min\nRainfall (mm) 176.68 168.95 130.77 469.1 1.6\nMean Air Temperature (⁰C) 26.44 26.4 0.69 28.6 25.2\nEl Niño 3.4 Index 0.1176 -0.085 0.9871 2.57 -1.7\n291\n292 Methodology\n293 In this section, we describe the different methods used for imputation of \n294 mosquitoes (An. darlingi and all Anopheles species) time series missing data and \n295 present the generalized time series model used to model malaria incidence. \n296 Missing Data Imputation Methods\n297 Missing data handling methods are usually adapted to the nature of missing data \n298 mechanisms. The missing data mechanisms are related to the dependence or not \n299 of the missing data on the observed values. Emmanuel et al. [2]  and Sterne et al. \n300 [40] propose classifying missing data according to their type: missing completely at \n301 random, missing at random and missing not at random. The type of missing data \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n15\n302 we are dealing with in this study can be classified as missing completely at random \n303 because missing values do not depend on the observed or unobserved data, but \n304 rather on the ability to collect information in a timely manner if economic and/or \n305 logistic resources (gasoline, propane gas, trap functioning) are readily available.\n306 Deleting missing data is one method of handling missing values called the \n307 Complete Case Analysis. This procedure is rarely applied because deleting \n308 missing data points (Listwise deletion) can significantly reduce our sample size for \n309 model training purposes, and therefore, accuracy would be a concern.\n310 Using imputation methods as regression imputation allows for estimating missing \n311 values by using other variables as covariates or predictors. In this research, \n312 different approaches were used for missing data imputation. The following methods \n313 will be discussed and compared:\n314 Linear Regression (LR), Stochastic Linear Regression (SLR), K Nearest-Neighbor \n315 (KNN) and Gradient Boosting (GB).\n316 The imputation methods LR, SLR, KNN and GB were implemented in Python \n317 (Google Colab tool) using packages fancyimpute, sklearn and xgboost.\n318 A brief description of each method and the steps for their implementation are \n319 described below.\n320 Linear Regression and Stochastic Linear Regression Imputation\n321 Regression imputation is a technique used to replace missing values 𝑌𝑚 in a data \n322 set based on a set of predictors or attributes 𝑋1,𝑋2,…,𝑋𝑝. It can be classified into \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n16\n323 two types: deterministic regression imputation and stochastic regression \n324 imputation. Deterministic regression imputation predicts missing values using the \n325 estimated regression model:\n326 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝\n327 where 𝛽𝑖 are the parameters of the estimated regression model using least \n328 squares, and 𝑦𝑚 are the estimated mean values of the response variable 𝑌 given a \n329 set of values for the predictors 𝑥1,𝑥2,…,𝑥𝑝. Estimation of the parameters 𝛽0,𝛽1,…,𝛽𝑝 \n330 is based on the observed values 𝑌𝑜 and the corresponding observations for the \n331 predictors (complete cases). However, this method uses a point estimate that does \n332 not consider any random variation  (i.e., an error term) around the regression \n333 model, leading to imputed values that may be too precise resulting in an \n334 overestimation of the correlation between 𝑋𝑖 and 𝑌.\n335 To address this limitation, stochastic regression imputation was used. This method \n336 incorporates a random error term into the predicted value:\n337 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝 + 𝜖\n338 where 𝜖 is a simulated random error term from a normal distribution with zero mean \n339 and variance estimated using least squares. By incorporating this error term, \n340 stochastic regression imputation can reproduce the correlation between 𝑋𝑖 and 𝑌 \n341 more accurately.\n342 Using regression imputation, we used the Leave-One-Out Cross-Validation \n343 (LOOCV) method to calculate the average RMSE for the evaluation of imputation \n344 performance. LOOCV is a special type of k-fold cross-validation in which the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n17\n345 number of folds k is equal to the number of observations N in the data set. Under \n346 this approach, each observation will be treated as a validation set once, and the \n347 rest (N-1) observations will be used as a training set. This method aims at reducing \n348 bias and preventing over-fitting, and it is useful for data sets with a small number of \n349 observations.\n350 K-Nearest Neighbor method\n351 This algorithm falls under the umbrella of supervised learning methods. It can be \n352 used for both regression and classification problems. It is considered a Lazy \n353 learner algorithm [41] and does not assume a probability distribution of the data. \n354 Instead, KNN assumes that similar inputs have similar outputs, and it will impute \n355 the missing data with the average value amongst its k most similar inputs. KNN \n356 depends heavily on the parameter k, which is the number of nearest neighbors to \n357 be used for imputation. The range of k values used in our imputation process \n358 varied from 1 to 40.\n359 The KNN algorithm uses a distance measure to detect the closest data points to a \n360 target vector 𝑥0 in a neighborhood, for which a missing characteristic 𝑦0 should be \n361 imputed. It is also important to note that the K neighbors do not have any missing \n362 features. Examples of distance measures are the Minkowski distance, Manhattan \n363 Distance, Cosine Distance, Jaccard Distance, Hamming Distance and Euclidean \n364 distance [2]. Among these distance measures, Euclidean distance is commonly \n365 used in practice. The equation is:\n366\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n18\n367 𝐷(𝑝,𝑞) = 𝛴𝑛\n𝑑=1(𝑝𝑑 ― 𝑞𝑑)2\n368\n369 where 𝐷(𝑝,𝑞) is the Euclidean distance, 𝑝 and 𝑞 are two data points in an n-\n370 dimensional space and 𝑛 is the dimension of the space.\n371 The imputed value at 𝑥0 (𝑦0) is the average of all responses 𝑦𝑖 in the \n372 neighborhood:\n373 𝑦0 = 1\n𝐾\n𝐾\n𝑖=1\n𝑦𝑖\n374\n375 The LOOCV method (Leave-One-Out Cross-Validation) is used to select the \n376 optimum value of 𝐾 by comparing the value of the root mean square error (RMSE) \n377 for each number of neighbors 𝐾 and selecting 𝐾 that minimizes RMSE.\n378 Gradient Boosting Method\n379 The Gradient Boosting method (GBoosting) is also a supervised learning algorithm \n380 where a training data set is used to predict a response 𝑦0 from an input vector of \n381 features 𝑥0. It can be used both in a regression context and as a classifier. The \n382 final prediction is an ensemble of decision trees constructed in an iterative way. \n383 Each tree is considered a weak learner, and the algorithm uses the steepest \n384 descent method to move into the direction of minimizing a loss function at each \n385 step. The Gradient Boosting method employs regularization techniques to prevent \n386 overfitting, and it can handle missing data automatically. The Gradient Boosting \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n19\n387 model is flexible for data with a large missing data rate and small size, and it is also \n388 useful for minimizing bias error.\n389 A wide range of hyperparameters must be specified for the Gradient Boosting \n390 model. The hyperparameters used in the imputation process are max_depth \n391 (maximum depth of a tree), min_samples_leaf (minimum number of observations in \n392 a leaf), and learning_rate (how quickly the model learns). The LOOCV method is \n393 used to select the best combination of the three parameters with the lowest RMSE.\n394 The gradient boost algorithm objective is to find an approximation of the function  \n395 F(x) that can minimize the expected value of the loss function L(y, F(x)) [42]:\n396 In the first step, we want to minimize the loss function 𝐿(𝑦𝑖,𝛼), which represents the \n397 difference between the actual values and the predicted values. A constant \n398 prediction of the model 𝐹0(𝑥) is given by:\n399 𝐹0(𝑥) = 𝑎𝑟𝑔 𝑚𝑖𝑛𝛼 \n𝑁\n𝑖=1\n𝐿(𝑦𝑖,𝛼)\n400 Where 𝛼 is the value that minimizes the loss function ∑𝑁\n𝑖=1 𝐿(𝑦𝑖,𝛼).\n401 In subsequent steps (𝑚 = 1,…,𝑀) the following expression is minimized:\n402 𝛼𝑗𝑚 =\n𝑥𝑖∈𝑅𝑗𝑚\n𝐿(𝑦𝑖 ,𝐹𝑚―1 (𝑥𝑖) + 𝛼 )\n403 At each step, several regression trees are created on the residuals of the model. \n404 These residuals are calculated as follows:\n405 𝑟𝑖𝑚 = ―[ ∂𝐿(𝑦𝑖,𝐹(𝑥𝑖))\n∂𝐹(𝑥𝑖) ]𝐹(𝑥)=𝐹𝑚―1(𝑥)\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n20\n406 where 𝑖 = 1,…,𝑁; and m is the index of each tree.\n407 The derivative is taken with respect to the previous prediction 𝐹𝑚―1 , and the result \n408 is proportional to the residual value at the step 𝑚 ― 1.\n409 In step 𝑚 the prediction 𝐹𝑚(𝑥) is updated as:\n410 𝐹𝑚(𝑥) = 𝐹𝑚―1(𝑥) + 𝑣\n𝐽𝑚\n𝑗=1\n𝛼𝑗𝑚1(𝑥 ∈ 𝑅𝑗𝑚)\n411 𝑗 is the index of each terminal node (leaf); 1(.) is the indicator function and 𝐽𝑚 is the \n412 total number of leaves in the tree. 𝑅𝑗𝑚 represents each terminal node and 𝑣 ∈ (0,1) \n413 is the learning rate.\n414 Generalized Time Series Model for Malaria Incidence\n415 One method for modeling malaria cases with covariates is using a generalized \n416 linear model for count time series. Let {𝑌𝑡:𝑡 ∈ 𝑁} be a count time series and {𝑋𝑡\n417 :𝑡 ∈ 𝑁} be a time-varying 𝑟 dimensional covariate vector. We then model the \n418 conditional mean 𝐸(𝑌𝑡|𝐹𝑡―1) of the time series by a process {𝜆𝑡:𝑡 ∈ 𝑁} such that 𝐸(\n419 𝑌𝑡|𝐹𝑡―1) = 𝜆𝑡, where 𝐹𝑡 is the history of the joint process {𝑌𝑡,𝜆𝑡,𝑋𝑡+1:𝑡 ∈ 𝑁} up to \n420 time 𝑡 and includes the information of covariates at time 𝑡 + 1. Furthermore, let 𝑔:\n421 𝑅+→𝑅 be a link function and 𝑔:𝑁0→𝑅 be a transformation function. The count time \n422 series model, as proposed in Liboschik et al. [43],  is of the form:\n423 𝑔(𝜆𝑡) =  𝛽0 +\n𝑝\n𝑘=1\n𝛽𝑘 𝑔(𝑌𝑡―𝑖𝑘) +  \n𝑞\n𝑙=1\n𝛼𝑙𝑔(𝜆𝑡―𝑗𝑙)  +  𝜂𝑇𝑋𝑡 \n424 Here 𝜂 is the parameter vector of the covariate effects, and 𝛽0 is the intercept. We \n425 can regress on past observations of the time series by defining a set 𝑃 = {𝑖1,𝑖2,…,𝑖𝑝} \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n21\n426 of integers where 0 < 𝑖1 < 𝑖2… < 𝑖𝑝 < ∞ and then regressing on lagged \n427 observations 𝑌𝑡―𝑖1,𝑌𝑡―𝑖2,…,𝑌𝑡―𝑖𝑝 with effects 𝛽. We are also able to regress on \n428 lagged conditional means by defining a similar set 𝑄 = {𝑗1,𝑗2,…,𝑗𝑞} and regressing \n429 on conditional means 𝜆𝑡―𝑗1,𝜆𝑡―𝑗2,…,𝜆𝑡―𝑗𝑞 with effects 𝛼.\n430 The model is fitted using quasi-conditional maximum likelihood estimation. If the \n431 link function is the identity function, then 𝑔(𝑥) = 𝑔(𝑥) = 𝑥. Otherwise, if the link \n432 function is logarithmic, then 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥) and 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥 + 1). Furthermore, it \n433 must be assumed that 𝑌𝑡|𝐹𝑡―1 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆𝑡) or 𝑌𝑡|𝐹𝑡―1 ∼ 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜆𝑡,𝜙) \n434 where 𝜙 ∈ (0,∞) is an additional dispersion parameter. For details on parameter \n435 estimation, refer to the documentation of the R package tscount [43].\n436 Results\n437 Climate data and mosquito time series dependence\n438 Sample cross-correlation functions were calculated to understand the association \n439 between the estimated total number of mosquitoes per month and the climatic \n440 variables. As mentioned, the rainfall and temperature data time series were \n441 anomalized to partially remove the strong seasonal pattern from these climate \n442 variables.\n443 Fig 6 shows the sample cross-correlation function between the estimated total \n444 number of mosquitoes (ETNM) per month and the monthly values of the El Niño \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n22\n445 index (Niño3.4), the anomalized mean temperature, and mean monthly rainfall for \n446 all Anopheles species.\n447 Fig 6. Cross-correlation between the estimated total number of mosquitoes \n448 (ETNM) for all Anopheles species time series and the Niño3.4 index (A), the \n449 monthly mean temperature (B), and monthly rainfall (C).\n450 By inspecting the negative lag portion of the figures, we can detect the leading lag \n451 for which the mosquito data is significantly correlated with the climate variables.\n452 The maximum correlation in absolute value within a year occurs for lags of 9, 6, \n453 and 2 months respectively. There is a significant positive association between \n454 mosquito abundance and El Niño index, temperature, and rainfall after 9, 6, and 2 \n455 months, respectively.\n456 A similar calculation was repeated for the most abundant species, An. darlingi \n457 (darl.1) and the same lagged significant effect of climate drivers on mosquito \n458 abundance were observed.\n459 Imputation methods comparison\n460 Linear regression and stochastic linear regression\n461 The linear regression model was fitted by least-squares, using all non-missing data \n462 values (referred to as complete cases) as the training data set. This fitted model \n463 was subsequently used to impute missing values. The model was based on \n464 predictors including temperature and rainfall anomalies and the El Niño3.4 index. \n465 These predictors were initially considered without any lagged effects (i.e., no time \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n23\n466 delays). After analyzing the cross-correlations (Fig 6), lagged versions of the \n467 climate predictors were incorporated into the model, using the selected lags \n468 exhibiting the highest cross-correlation with the estimated total number of \n469 mosquitoes (ETNM) response variable. We then made a comparative analysis of \n470 model performance under these two scenarios: using lagged predictors and using \n471 non-lagged predictors.\n472 Our primary focus was on the imputation of data associated with all mosquito \n473 species, obtained by aggregating mosquito counts from all species. In addition, we \n474 examine data imputation for each species separately. For assessing the prediction \n475 error after imputation of missing values, the following approach was employed:\n476 1. Due to the limited availability of non-missing data, a simple random \n477 sampling of the observed data with replacement was implemented. The \n478 number of samples from this sampling process was equivalent to the \n479 number of missing data values. Multiple training data sets were initially \n480 generated by repeatedly applying the random sampling method for 𝐾 \n481 iterations.\n482 2. A linear regression model was fitted to each training sample using least-\n483 squares. To assess model performance, a leave-one-out cross-validation \n484 (LOOCV) error was calculated for each of these training data sets, and the \n485 average root mean square error (RMSE) was obtained.\n486 3. In our analysis, we observed that using lagged climatic variables as \n487 predictors resulted in a better model performance (RMSE= 141.40) \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n24\n488 compared to using climatic variables without lagging (RMSE=142.69)  (S2 \n489 Fig). \n490 4. The same procedure was followed for the stochastic model: climatic \n491 variables as predictors were considered before lagging (no lag) and after \n492 lagging, and model performance was compared in these two cases. The \n493 stochastic regression approach used the same random sampling function as \n494 the linear regression model to impute missing values. It generated random \n495 predictions based on the regression model, considering the prediction \n496 standard error to fill in the missing values. Random predictions were \n497 generated from a normal distribution centered at the estimated mean \n498 predicted value, with a standard error equal to the standard deviation of the \n499 residuals for each estimate. If the random prediction was greater than zero, \n500 it was used as the imputed value for that specific missing value. To evaluate \n501 the performance of the stochastic imputation method, we used the leave-\n502 one-out cross-validation (LOOCV) method, resulting in a list of RMSE \n503 values for each training sample that were averaged for all iterations. Our \n504 analysis indicated that utilizing lagged climatic variables mean (RMSE= \n505 0.13) led to a slightly better performance compared to using climatic \n506 variables without lagging mean (RMSE=0.10) (S3 Fig). \n507 K-Nearest Neighbor (KNN)\n508 The model was trained using the leave-one-out cross-validation method (LOOCV) \n509 to adjust the parameter 𝐾 (number of nearest-neighbors). Climatic variables as \n510 predictors were considered before lagging (no lag) and after lagging, and the \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n25\n511 model performance was compared in these two cases. The following steps were \n512 followed:\n513 The LOOCV method was used to select the best value of the 𝐾 parameter. A KNN \n514 model was fitted to all five species for values of 𝐾 in the range of 1 to 40. The value \n515 of 𝐾 with the lowest RMSE for the LOOCV method was 𝐾 = 16 in An. darlingi \n516 (darl.1). Fig 7 shows a comparison of the RMSE for each value of 𝐾 for An. darlingi \n517 (darl.1) and All mosquito species. \n518 Fig 7. Comparison of the leave-one-out cross-validation (LOOCV) root mean \n519 square error (RMSE) for each 𝐾 value for Anopheles darlingi (darl.1) and for \n520 All Anopheles mosquito species. \n521 The KNN imputation model was constructed for An. darlingi (darl.1) and for All \n522 species with the number of neighbors 𝑛𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠=16 and a data 𝑡𝑟𝑎𝑖𝑛:𝑡𝑒𝑠𝑡 split \n523 ratio of 8:2. The model was run 1,000 times, and the root mean square error \n524 (RMSE) was calculated for each iteration. A better performance of the \n525 methodology was found when using lagged climatic variables (RMSE= 126.83) in \n526 comparison to using climatic variables with no lagging (RMSE= 164.47) for darl.1. \n527 Fig 8 shows a comparison of the RMSE for each iteration using climatic variables \n528 before and after lagging. These results confirm better predictability of the fitted \n529 model when using lagged climate predictors. \n530 Fig 8. Comparison of root mean square error (RMSE) for K-nearest neighbor \n531 (KNN) method in each run, using climatic variables before and after lagging. \n532 A: Anopheles darlingi (darl.1); B: All Anopheles species collected.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n26\n533 Gradient Boosting (GB)\n534 The model was trained using the leave-one-out cross-validation (LOOCV) method \n535 by tuning three parameters in the Gradient Boosting algorithm, which include max \n536 depth, min samples leaf, and learning rate. Climatic variables as predictors were \n537 considered before lagging (no lags) and after lagging in the same way as with \n538 previous imputation methods, and model performance was compared in these two \n539 cases. The following steps were followed:\n540 1. The LOOCV method was used to select the best value of the parameters \n541 mentioned above. The GB model was fitted to species An. darlingi (darl.1) \n542 and All species for different values of the hyperparameters. We used a grid \n543 search to tune the hyperparameters, aiming to find the optimal combination \n544 of three hyperparameters. The hyperparameter values with the lowest \n545 RMSE for LOOCV were min samples leaf = 1, learning rate = 0.1, max \n546 depth = 10.\n547 2. The Gradient Boosting imputation model with the default hyperparameters \n548 of GB was constructed for An. darlingi (darl.1) to compare the performance \n549 between lagged and no-lagged climate variables. The dataset was split into \n550 training and testing sets at an 8:2 ratio. The model was run 100 times and \n551 the root mean square error (RMSE) was calculated for each iteration. The \n552 methodology performed better when using lagged climatic variables \n553 (RMSE= 163.75) compared to non-lagged climatic variables (RMSE= \n554 217.44) for darl.1. Fig 9 shows a comparison of the RMSE for each iteration \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n27\n555 using climatic variables before and after lagging. These results again \n556 highlight the advantage of using lagged predictors over non-lagged ones.\n557 Fig   9. Comparison of root mean square error (RMSE) for the Gradient \n558 Boosting (GB) method in each run. Using climatic variables before and after \n559 lagging. A: Anopheles darlingi (darl.1); B: All Anopheles species collected.\n560 Fig 10 show the contrast between the original data (in blue) and the imputed data \n561 (in yellow) for the four different imputation methods used for the species An. \n562 darlingi (darl.1).\n563 Fig 10. Estimated total number of Anopheles darlingi abundance (ETNM) time \n564 series imputation methods. A. Linear Regression (LR). B. Stochastic Linear \n565 Regression (SLR). C. K-Nearest Neighbor (KNN). D. Gradient Boosting (GB).\n566 Important differences can be distinguished among the different methods, especially \n567 between linear regression imputation and the remaining methods. These observed \n568 differences are not surprising given the variety of imputation approaches used. A \n569 comparison of the prediction errors based on the non-missing or complete case \n570 data values can be assessed by estimating the RMSE using a LOOCV method for \n571 each imputation approach. These results are presented in Table 2. Table entries \n572 are relative numbers, calculated after dividing by the mean number of mosquitoes \n573 for each species.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n28\n574 Table 2. Leave-One-Out Cross Validation (LOOCV) method errorsa for the \n575 different imputation methods and mosquito species.\n576 _______________________________________________________________\nAnopheles \nSpecies Imputation Method\nKNN G. Boosting Linear R. Stochastic R.\nAllb 0.7638104005 0.0000000297  0.968946389 0.000917769\nAn. darlingi 0.7727904964 0.0000000539   0.972812828 0.001148532\nAn. goeldii 0.8078696262 0.00000012   1.308744024 0.002081277\nAn. oswaldoi 1.345153286 0.000000627  1.503379349 0.001657576\nOther 1.073280571 0.00004271  1.120539972 0.001491594\n577 aThe root mean square error (RMSE) values were calculated using the complete \n578 case observations (non-missing cases).\n579 bAll Anopheles species collected. \n580 c Other than An. darlingi, An. goeldii and An. oswaldoi species collected. \n581 The GB methodology shows the best predictive performance with the lowest \n582 testing errors in all cases. The same analysis was done for All Anopheles species, \n583 with similar results (S4 Fig). \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n29\n584 Malaria Incidence Time Series Model\n585 Malaria Incidence Model Covariates\n586 For the generalized linear time series count model described above, we consider \n587 several potential covariates for inclusion in the model: monthly rainfall, mean \n588 monthly temperature, ENSO index (El Niño 3.4), and an imputed mosquito count \n589 time series. Initially, we selected the imputed mosquito time series with the lowest \n590 LOOCV error which corresponds to the Gradient Boosting approach (Table 2).\n591 Additionally, we considered lagged versions of each predictor. To select the \n592 optimal lag, we calculated the rank cross-correlation of the malaria incidence time \n593 series and each climatic anomalized predictor after removing the long-term \n594 monthly means. The selected lag was the one that resulted in the highest cross-\n595 correlation between the anomalized climatic time series and malaria incidence. For \n596 rainfall, a two-month lagged time series was considered based on the time lag that \n597 takes into account the time for creating habitats for the development of the aquatic \n598 stages of Anopheles mosquitoes (from eggs to adults), the buildup of mosquito \n599 populations, blood feeding interval and the development of the parasite inside the \n600 mosquito (sporogony) [37,44,45,46,47]. \n601 Figs 11, 12, 13, and 14 show the sample rank cross correlation functions of the P. \n602 vivax and P. falciparum incidence with rainfall, mean air temperature, ENSO index, \n603 and imputed All Anopheles species time series. The lag values for the negative \n604 portion of the graphs with the highest rank cross-correlations in absolute value are \n605 shown in Table 3. Note that cross-correlations between rainfall and P. vivax \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n30\n606 incidence are nonsignificant at any lag. However, the negative lag corresponding to \n607 the highest cross-correlation (in absolute value) was selected as a model \n608 covariate. This lag varied between 4 and 5 months for rainfall, 8 months for \n609 temperature, 16 and 17 months for the ENSO index, and 4 to 6 months for \n610 mosquito counts (Table 3).\n611 Fig 11. Anomalized rainfall and malaria incidence Rank Cross-Correlation. \n612 A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum incidence (PF). \n613 ACF= Sample Cross-correlation Function.\n614 Fig 12. Anomalized temperature and malaria incidence Rank Cross-\n615 Correlation.  A. Plasmodium vivax incidence. B. Plasmodium falciparum \n616 incidence. ACF= Sample Cross-correlation Function.\n617 Fig 13. Anomalized El Niño 3.4 (ENSO) and malaria incidence Rank Cross-\n618 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum \n619 incidence (PF). ACF= Sample Cross-correlation Function.\n620 Fig 14. All Anopheles species counts and malaria incidence Rank Cross-\n621 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum \n622 incidence (PF). ACF= Sample Cross-correlation Function.\n623 Table 3. Selected lags for climate and mosquito count covariates to be \n624 included in the Plasmodium vivax and Plasmodium falciparum malaria \n625 incidence models for All Anopheles species and for Anopheles darlingi.\n626 ________________________________________________________________\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n31\nMalaria \nModel\nMosquito \nSpecies Rainfall Temperature ENSO\nMosquito \ncount\nP. vivax All species 2, 4 8 17 4\nP. falciparum  2, 5 8 16 5\nP. vivax An. darlingi 2, 4 8 17 4\nP. falciparum  2, 5 8 16 6\n627\n628\n629 Finally, we considered regressing the malaria incidence on past observations and \n630 past conditional means. The Sample Autocorrelation Function (ACF) and Partial \n631 Autocorrelation Function (PACF) plots of the malaria time series shown in Fig 15 \n632 was used to determine the lag of past observations that had the greatest impact on \n633 the malaria incidence cases at time 𝑡. Due to the highly significant partial \n634 autocorrelation at lag 1, we included past malaria observations at time 𝑡 ― 1 only. \n635 Furthermore, after including the past observation, we also consider adding the \n636 conditional mean at 𝑡 ― 12 to account for possible seasonal  effects .\n637 Fig  15. Sample Autocorrelation Function (ACF) and Partial Autocorrelation \n638 Function (PACF) for Plasmodium vivax incidence time series. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n32\n639 For all models, the logarithmic link function 𝑔(𝜆𝑡) = 𝑙𝑜𝑔(𝜆𝑡) was used. The identity \n640 function cannot fit models with negative covariates (such as the ENSO index) and \n641 cannot estimate effect parameters less than 0. Since some effects may be \n642 negative, the identity function is too restrictive in this case.\n643 Malaria Model Fitting and Prediction\n644 The malaria model was fitted to the incidence of P. vivax and P. falciparum. To \n645 select the best Time Series Generalized Linear Model (TSGLM), we calculated \n646 different accuracy metrics for each unique combination of lagged and non-lagged \n647 covariates (rainfall, temperature, ENSO, and mosquito counts), an autoregressive \n648 term of malaria incidence at 𝑡 ― 1, and the conditional mean malaria incidence at \n649 𝑡 ― 12. This term considers any seasonal dependence that might be present in the \n650 data. The conditional mean was only considered if the autoregressive term was \n651 already in the model. Mosquito counts for All Anopheles species imputed with the \n652 GB method were proposed as potential covariates. Additional results were also \n653 produced using An. darlingi mosquito counts as a predictor. We also included an \n654 additional trend shift component to account for the observed trend increase of \n655 malaria incidence from year 2015 onward.\n656 We focus on selecting the model that gives the best prediction accuracy when \n657 compared to observations. For this purpose, we used an 80-20 training-testing \n658 sample split. The training set was used to fit models according to the above \n659 algorithm, and each model was used for prediction in the testing set. Using these \n660 predictions, we calculated a variety of error metrics: the root mean square error \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n33\n661 (RMSE), the mean absolute error (MAE), and the mean absolute percentage error \n662 (MAPE). Lower error values are preferred. The best models, as indicated by the \n663 lowest values of RMSE, MAE, and MAPE, are presented in Table 4.\n664 Table 4. Selected covariates for the Plasmodium vivax and Plasmodium \n665 falciparum malaria incidence models that minimize the prediction error in a \n666 testing set using different accuracy measures.\n667                 Plasmodium vivax                             Plasmodium falciparum\nCovariates RMSE MAE MAPE Covariates RMSE MAE MAPE\nRainfall Xa X Xa Rainfall Xa Xa Xa\nRainfall (lag 2) X X Rainfall (lag 2) X Xa Xa\nRainfall (lag 4) X X X Rainfall (lag 5)\nTemperature X X X Temperature X X\nTemp (lag 8) Temp (lag 8) X\nENSO X X X ENSO Xa Xa Xa\nENSO (lag 17) ENSO (lag 17) Xa Xa Xa\nMosquito Countb X X Mosquito Countb\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n34\nMosq Count (lag 4) X X X Mosq Count (lag 4)\nPast Obsc (lag 1) Xa Xa Xa Past Obsc (lag 1) Xa Xa Xa\nPast Mean (lag 12) Past Mean (lag 12) X X X\nTime Shiftd Xa Xa Xa Time Shiftd X X X\n668\n669 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean \n670 absolute percentage error.\n671 a Coefficient is significant at 5%\n672 b Mosquito counts include All Anopheles species imputed using Gradient Boosting\n673 c Past Observation\n674 d Time Shift Post-2015\n675 The general malaria incidence model for P. vivax or P. falciparum can be written \n676 as:\n677 𝑙𝑜𝑔(𝜆𝑡) = 𝛽0 + 𝛽1𝑙𝑜𝑔(𝑌𝑡―1) + 𝛼1𝑙𝑜𝑔(𝜆𝑡―12) + 𝜂𝑅⊺𝑅𝑡 + 𝜂𝑇⊺𝑇𝑡 + 𝜂𝐸𝑛⊺𝐸𝑛𝑡 + 𝜂𝑀𝐶⊺𝑀𝐶𝑡\n+ 𝜂𝐼𝐼𝑡―2015\n678 where 𝜆𝑡 is the expected malaria incidence rate and 𝑌𝑡 is the observed incidence \n679 rate at time 𝑡 [(cases/population) × 1,000]. 𝑅𝑡, 𝑇𝑡 and 𝐸𝑛𝑡 are the climatic variables \n680 (rainfall, temperature, ENSO respectively) vector with non-lagged and lagged \n681 values at time 𝑡.  𝑀𝐶𝑡 is the mosquito counts vector with non-lagged and lagged \n682 values at time 𝑡.  𝐼𝑡―2015 is an indicator variable which is zero if the year is lower \n683 than 2015 (constant trend), and it is proportional to time if time is greater than \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n35\n684 2015. Model parameters 𝛽0 (intercept), 𝛽1 (𝑡 ― 1 observation), 𝛼1 (𝑡 ― 12 mean), \n685 and parameter vectors 𝜂𝑅,𝜂𝑇,𝜂𝐸𝑛,𝜂𝐼 are estimated by maximum likelihood using the \n686 R library tscount [43].\n687 Selected covariates in Table 4 (marked with “X”) minimize the different prediction \n688 error metrics and may vary depending on the prediction criteria used. For the P. \n689 vivax malaria model, rainfall, lagged rainfall (lags 2 (MAE and MAPE) and 4), \n690 temperature, ENSO, mosquito counts (RMSE, MAE), lagged mosquito counts (lag \n691 4) and the previous time-step expected malaria incidence are selected as the most \n692 important covariates in the sense of minimizing the predictive errors for the testing \n693 set, with a significant trend shift after the year 2015. Only rainfall, past month \n694 observations, and the trend shift component are significant at the 95% confidence \n695 level.\n696 For the P. falciparum malaria incidence model, rainfall, lagged rainfall (lag 2), \n697 temperature (MAE and MAPE), lagged temperature (lag 8, RMSE), ENSO, lagged \n698 ENSO (lag 16), the previous time step malaria incidence observation, past year \n699 mean incidence and a linear trend shift are selected as the most important \n700 covariates that minimize the model predictive errors over the testing data set. Only \n701 rainfall and lagged rainfall, ENSO and lagged ENSO, and the previous month \n702 observation are statistically significant according to the 95% confidence intervals \n703 calculated using the normal approximation for the model parameter estimates. \n704 Predictors selected by minimizing the predictive errors are not necessarily going to \n705 be statistically significant according to the Wald test statistic (see Table 4). \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n36\n706 However, predictability performance was preferred over statistical significance in \n707 this case.\n708 Fig 16 shows the model fitting and forecast results for the P. vivax malaria \n709 incidence model for All Anopheles species and An. darlingi counts as predictors. \n710 Model forecasts also include the 80% and 95% confidence intervals. Two model \n711 versions were fitted: the first one used imputed An. darlingi as a predictor; and the \n712 second one used All Anopheles species mosquito counts as a predictor.\n713 Fig 16. Model fit and predictions for Plasmodium vivax malaria incidence \n714 time series. A: All Anopheles species used as a covariate for mosquito counts. B: \n715 Anopheles darlingi used as a covariate for mosquito counts. \n716 Fig 16  shows a better forecasting performance when using All mosquito species \n717 as a predictor (Fig 16A) in comparison when using An. darlingi (Fig 16B). In this \n718 case  most of the observations in the testing set are included within the 80% \n719 confidence intervals with a clear upward trend from year 2015.\n720 Fig 17 shows the model fitting and forecast results for the P. falciparum malaria  \n721 incidence model. Based on the lowest prediction errors, the mosquito count \n722 variable was not included as a predictor (Table 4).\n723 Fig 17. Model fit and predictions for Plasmodium falciparum malaria \n724 incidence time series. \n725 According to Fig 17, data observations in the testing set are included within the \n726 95% and 80% confidence intervals of the predicted values.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n37\n727 Malaria Model sensitivity to mosquito data imputation methods\n728 An important question to be addressed is the sensitivity of the mosquito counts \n729 data imputation methods in the malaria model configuration and prediction \n730 accuracy. Although the Gradient Boosting approach resulted in the best imputation \n731 method performance according to the results presented in Table 2, mosquito \n732 counts imputed with all methods were tested as potential predictors, and variable \n733 selection was recorded in each case. We explored how predictability changes \n734 when using different mosquito imputed time series in the model selection process. \n735 Model configuration (or best predictors set) can also change when different \n736 imputed mosquito count time series are used in the model selection process.\n737  Gradient Boosting (GB) imputed mosquito counts and lagged mosquito counts for \n738 All Anopheles species were not included as potential model covariates in the P. \n739 falciparum best predictive model; however, mosquito counts were selected as \n740 covariates in the P. vivax malaria model by all prediction accuracy criteria.      \n741 In Table 5, we compare the different accuracy measures in the 20% testing data \n742 set for the P. vivax and P. falciparum malaria models and the four different time \n743 series of imputed mosquito counts for All Anopheles species.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n38\n744 Table 5. Plasmodium vivax and Plasmodium falciparum malaria models. \n745 Comparison of prediction errors for different imputation methods for \n746 mosquito counts time series for All Anopheles species.\n747                                      Plasmodium vivax                 Plasmodium falciparum\nImp Method RMSE MAE MAPE RMSE MAE MAPE\nLR 17.537 14.207 63.244 0.855 1.168 21.941\nSRL 16.813 14.543 58.895 1.023 0.711 23.94\nKNN  5.295  4.268 26.392 0.879 0.697 22.707\nGB  5.368  4.137 25.838 0.879 0.697 22.707\n748\n749 LR= Linear Regression; SLR= Stochastic Linear Regression; KNN= K-Nearest \n750 Neighbor; GB= Gradient Boosting.\n751 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean \n752 absolute percentage error.\n753 The results show that for the P. vivax malaria incidence model, the lowest \n754 prediction errors are obtained when imputed K-Nearest Neighbor (KNN) and \n755 Gradient Boosting (GB) All Anopheles species counts are used as covariates. \n756 Larger prediction errors are obtained when using imputed LR and SLR mosquito \n757 time series as covariates in the model, compared with using imputed GB and KNN \n758 mosquito time series.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n39\n759 The best predictive P. falciparum malaria incidence models exclude the covariate \n760 mosquito counts when K-Nearest Neighbor (KNN) and Gradient Boosting (GB) \n761 imputed time series are used as predictors. In this case, prediction errors for the \n762 malaria incidence model are the same for both time series. However, this result \n763 changed when Linear Regression (LR) or Stochastic Linear Regression (SLR) \n764 were used as imputation methods. All Anopheles species-imputed time series were \n765 included as potential covariates in the malaria incidence model in this case. \n766 Prediction errors across each predictive measure (RMSE, MAE, MAPE) do not \n767 show high variability among different imputation methods.\n768 Results from Table 5 indicate a greater sensitivity to the imputation methods for the \n769 P. vivax malaria incidence model than for the P. falciparum model.\n770 Discussion\n771 The elimination of malaria requires a strong surveillance system. A key component \n772 is entomological surveillance which relies on monitoring anopheline populations \n773 mainly in terms of species composition, abundance, biting behavior, and \n774 insecticide resistance [48]. The collection of entomological data requires a huge \n775 investment in regular mosquito collections in various sentinel sites throughout \n776 endemic areas. Systematic data collection approaches as the one presented by \n777 Drakou et al [49] provides a good example of the implementation of mosquito \n778 network surveillance with adequate spatial coverage, while in our study \n779 implementing a comprehensive spatial surveillance effort was logistically \n780 unfeasible. In the study by Drakou et al [49], surveillance was limited to a single \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n40\n781 year. However, a multiyear analysis, such as the one we are presenting, is always \n782 valuable for understanding the impact of environmental factors on mosquito \n783 bionomics and to understand year-to-year variability.\n784  In remote locations with difficult access, a way to contribute to providing relevant \n785 entomological data to the malaria program is through well-trained local leaders. \n786 This approach was attempted in an Amerindian village of the Caura River basin in \n787 southern Venezuela. Nevertheless, the political, economic, and social crisis [50,51] \n788 jeopardized the success of such an initiative, resulting in irregular interruptions of \n789 mosquito collections with the end result of large missing data, making it difficult to \n790 interpret the data and provide reliable information for decision-making for vector \n791 control. \n792 To overcome this limitation, the application of data imputation methods arise as a \n793 potential solution. Machine learning approaches, specifically Boosting methods \n794 have been proposed by other authors when there is a significant percentage of \n795 missingness [4]. However, imputing missing data under the assumption that it is \n796 missing at random may not be entirely accurate, as intrinsic seasonal and \n797 operational factors may have influenced the sampling effort.\n798 Given the large percentage of missing data in the mosquito abundance time series, \n799 the application of different imputation methods provides a good alternative when \n800 mosquito surveillance is highly fragmented, and continuous time series are not \n801 available. However, different imputation methods might capture different data \n802 features, and testing the feasibility of the imputed values still remains a challenging \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n41\n803 task. Four imputation methods were compared: Linear Regression, Stochastic \n804 Linear Regression, K‑Nearest Neighbor, and Gradient Boosting.\n805 Considering the limited data availability and rather short time series, the LOOCV \n806 approach was the best option to evaluate predictability among the different \n807 methods. Gradient Boosting and the Stochastic Linear Regression methods \n808 demonstrated better efficiency in terms of accuracy, reporting the lowest LOOCV \n809 errors (Table 2). In contrast, the Linear Regression approach was less efficient \n810 than the other methods, showing the highest LOOCV errors and a poorer \n811 representation of the seasonal cycle. Comparison among different imputation \n812 methods for mosquito data and any other time series data is a necessary first step \n813 in the evaluation of potential biases in the new data information provided [52]. If \n814 this new data is used as input to an impact model, conducting a sensitivity analysis \n815 of the imputation methods on the expected response would provide additional \n816 insights into the usefulness of the imputed data. In general, the seasonal cycle was \n817 well represented by all imputation methods, with marked seasonal peaks between \n818 August and September, two to three months after the beginning of the rainy \n819 season. Similar results were recorded in other malaria endemic areas of \n820 Venezuela [13,39], while in other regions the seasonal peaks are related to the \n821 transition period between rainy and dry season [38,53,54]. The large spike \n822 observed in 2010 (Fig 5), has been associated to a weak La Niña condition. A \n823 similar spike to the one observed in 2010 was estimated for 2016 when applying \n824 the GB imputation method to the An. darlingi time series, with a maximum \n825 comparable to the one recorded in 2010. However, mosquito measurements were \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n42\n826 only available through July 2016, so direct validation for this period is not possible. \n827 Furthermore, we cannot conclude that there is a direct association of this spike \n828 with a weak La Niña event, since the next event was reported during the Fall 2017 \n829 up to Spring 2018 [55]. It is important to point out that although there are abundant \n830 references worldwide relating malaria and other diseases to ENSO events, the \n831 available data on mosquito abundance and ENSO is scant. El Niño is associated \n832 with an increase in rainfall in the Colombia Pacific coast and East Africa, and La \n833 Niña with drought. A three-year study conducted in the Colombian Pacific coast \n834 reported no relation between ENSO and the abundance of An. albimanus and An. \n835 darlingi [56] while a one-year study in Tanzania showed a decline in the \n836 abundance of An. arabiensis and An. funestus associated with the drought due to \n837 La Niña 2016-2017 [57]. \n838 The relationship between climate and malaria is powerful, complex, and highly \n839 localized. Several studies clearly show that factors like temperature and rainfall, \n840 and the global climate patterns like El Niño that influence them, are critical drivers \n841 of malaria epidemics [17,58,59]). Previous studies in Venezuela have shown that \n842 El Niño is associated with drought, resulting in an increase in malaria cases the \n843 following year [21,22,23]). In the present study the significant lags for rainfall and \n844 ENSO were one month earlier for the P. vivax model than for the P. falciparum \n845 model (Table 3), which is reasonable considering that the sporogonic cycle of P. \n846 falciparum is larger than that of P. vivax [60,61]. Although variations in temperature \n847 have a great impact on the developmental time of the mosquitoes and the \n848 parasites [60,61,62], the lag of 8 months was the most significant for both models.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n43\n849 When fitting the P. vivax and P. falciparum malaria incidence models, the main \n850 issue is the non-significance or exclusion of mosquito abundance as covariates for \n851 the P. falciparum malaria incidence model. These findings suggest that it is \n852 insufficient to treat mosquito abundance and diversity data collected from a single \n853 locality as representative for the entire municipality, because these ecological \n854 characteristics can differ significantly between localities within the same region \n855 [11,38,39,53].\n856  However, the proposed malaria model is still useful for predictive purposes, since \n857 the model accounts for potential seasonal effects and significant changes in the \n858 incidence trend, lagged and non-lagged environmental factors, and the impact of \n859 the previous month's malaria incidence. The better model performance observed \n860 when using All Anopheles species as a mosquito count predictor, compared with \n861 using An. darlingi  (Fig 16) might be due to the unexpected spikes obtained in the \n862 imputed mosquito count time series of An. darlingi for the Boosting method (GB) \n863 (Fig 10D). \n864 The sensitivity analysis using different imputation methods is a key component in \n865 the results presented in this work. These results are demonstrated through the \n866 process of evaluating both model accuracy and model structure, as these aspects \n867 are sensitive to the imputation method employed. Therefore, it is important to \n868 explore all available methods to ensure robust and reliable conclusions.  When \n869 assessing the sensitivity of the different imputation methods on the malaria model \n870 predictive performance, the MAPE results shown in Table 5 confirms a reasonably \n871 good predictive performance of the P. vivax model when mosquito counts time \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n44\n872 series for All Anopheles species imputed with KNN and GB methods are used as \n873 predictors  (20% <MAPE<30%), while values of MAPE are greater than 50% when \n874 mosquito count time series imputed with LR and SLR methods are used instead. \n875 Thresholds to evaluate model predictive performance according to the MAPE \n876 criteria have been discussed by Ma and Liu [63].These results might be an \n877 indication that these two imputation methods (LR and SLR) do not accurately \n878 represent the mosquito abundance variability for All Anopheles species, and as a \n879 consequence might fail to reproduce the association between mosquito abundance \n880 and malaria incidence. In the case of the P. falciparum model predictive \n881 performance according to MAPE is more homogeneous across the different \n882 imputation methods (Table 5). But in this case mosquito abundance does not yield \n883 a better model accuracy when the KNN and GB imputed mosquito counts are \n884 considered as predictors. Differences in model accuracy between P. vivax and P. \n885 falciparum are driven by the selection of the sets of predictors that minimize \n886 prediction errors. Mosquito counts imputed with the regression models for All \n887 Anopheles species preserve the seasonality in the mosquito abundance cycle but \n888 fail to represent important year-to-year variability fluctuations exhibiting a higher \n889 degree of smoothing. However, when used as predictors they can explain a good \n890 percentage of the malaria time series variability in the P. falciparum model. This is \n891 not the case for the GB and KNN imputed mosquito time series, for which the more \n892 complex fluctuations do not contribute to explain the observed variance in the \n893 malaria time series model. We can speculate that data limitations may be an \n894 important factor underlying these results. \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n45\n895 Despite data limitations, this work provides a template for integrating multiyear \n896 entomological observations, climate information, and robust imputation techniques \n897 to improve malaria prediction systems in remote settings. The approaches \n898 developed here can support more timely and geographically informed \n899 decision‑making, enabling public health authorities to anticipate high‑risk periods, \n900 optimize vector control interventions, and better allocate scarce resources. \n901 Continued investment in local capacity for mosquito monitoring—combined with \n902 modern data‑analytic tools—can strengthen the foundations of malaria surveillance \n903 and contribute to more effective control strategies in the Amazon and similar \n904 hard‑to‑reach regions.\n905 Finally, a key entomological parameter for the implementation of a particular \n906 intervention for control is to determine the vector biting hourly pattern. This pattern \n907 is species specific; nevertheless, An. darlingi exhibits different patterns along its \n908 range of distribution, even among locations within a few kilometers apart \n909 [11,64,65,66]. During the present study, the mosquito abundance at different hour \n910 intervals was not significant, contrasting with previous studies in Sucre municipality \n911 along the Caura River where An. darlingi showed a biting peak around midnight in \n912 Las Majadas [64] whereas in Jabillal, there were two distinctive peaks at sunset \n913 and sunrise [11]). The lack of significance in the hour intervals might be due to the \n914 low abundance of mosquitoes collected throughout the study. Similar results were \n915 reported for An. calderoni in Colombia [67].\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n46\n916 Conclusion\n917 This study demonstrates that machine learning–based imputation methods offer a \n918 powerful tool for reconstructing highly incomplete mosquito abundance time series \n919 in remote, resource‑limited malaria‑endemic regions. Incorporating the imputed \n920 mosquito series and climate predictors into generalized time‑series models of \n921 malaria incidence revealed important differences between P. vivax and P. \n922 falciparum. Models for P. vivax responded strongly to mosquito abundance, \n923 rainfall, temperature, ENSO anomalies, and recent malaria history, with Gradient \n924 Boosting and KNN imputations yielding the most accurate forecasts. In contrast, P. \n925 falciparum incidence was largely insensitive to mosquito abundance, likely \n926 reflecting mismatches between the geographic scale of entomological and \n927 epidemiological datasets, lower case counts, or intrinsic differences in \n928 transmission. Climate drivers—particularly rainfall and ENSO—retained predictive \n929 power for both parasites, underscoring their central role in shaping transmission \n930 potential.\n931 Acknowledgment\n932 We are most grateful to Simón Caura, the local leader, Hernán Guzmán, Yarys \n933 Estrada, Victor Sánchez, Jorge Moreno and Jesús González, who participated in \n934 field activities. To the people of Boca de Nichare, for their support and friendship. \n935 To Ana María Ibañez and Horacio Vargas for logistic support and friendship. \n936 Mariapia Bevilacqua, Lya Cárdenas, and Zenaida Muria (ACOANA) and Dirección \n937 de Salud Ambiental (MPPSalud) for logistic support.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n47\n938 References\n939 1. World Health Organization (WHO). World Malaria Report 2025: addressing the \n940 threat of antimalaria drug resistance. Geneva: World Health Organization; 2025. \n941 2. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A \n942 survey on missing data in machine learning. J Big Data. 2021; 8: 140.\n943 3. Zhang X, Yan C, Gao C, Malin BA, Chen Y. Predicting missing values in medical \n944 data via XGBoost regression. J Health Inform Res. 2020; 4: 383–394.\n945 4. Cauthen KR, Lambert G, Ray J, Lefantzi S. Imputing data that are missing at \n946 high rates using a boosting algorithm. Sandia National Laboratories. SAND-2016. \n947 Available from: https://www.sandia.gov/app/uploads/sites/203/2022/06/sand2016-\n948 9430J.pdf.\n949 5. Sattari MT, Roshan E, Bakhshandeh M, Pourghasemi HR. Potential of kernel \n950 and tree-based machine-learning models for estimating missing data of rainfall. \n951 Eng Appl Comput Fluid Mech. 2020; 14(1): 1078-1094.\n952 6. Yoon J, Zame WR, van der Schaar M. Multi-directional recurrent neural \n953 networks: A novel method for estimating missing data. In: Time series workshop in \n954 International Conference on Machine Learning; 2017. p. 1-9.\n955 7. Fang C, Wang C. Time series data imputation: A survey on deep learning \n956 approaches. 2020; arXiv preprint arXiv: 2011.11347.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n48\n957 8. Rubio-Palis Y, Zimmerman RH. Ecoregional classification of malaria vectors in \n958 the Neotropics. J Med Entomol. 1997; 34: 499-510.\n959 9. Sinka ME, Rubio-Palis Y, Manguin S, Patil AP, Temperley WH, Gething PW, et \n960 al. The dominant Anopheles vectors of human malaria in the Americas: occurrence \n961 data, distribution maps and bionomic precis. Parasites Vectors. 2010; 3: 1-72. doi: \n962 10.1186/1756-3305-3-72.\n963 10. Sinka ME, Bangs MJ, Manguin S, Rubio-Palis Y, Chareonviriyaphap T, \n964 Coetzee M, et al. A global map of dominant malaria vectors. Parasites Vectors. \n965 2012; 5: 69. doi: 10.1186/1756-3305-5-69.\n966 11. Rubio-Palis Y, Bevilacqua M, Medina DA, Moreno JE, Cárdenas L, Sánchez V, \n967 et al. Malaria entomological risk factors in relation to land cover in the Lower Caura \n968 River Basin, Venezuela. Mem Inst Oswaldo Cruz. 2013; 108: 220-228. doi: \n969 10.1590/0074-0276108022013015.\n970 12. Arruda M, Carvalho MB, Nussenzweig RS, Maracic M, Ferreira AW, Cochrane \n971 AH. Potential Vectors of Malaria and Their Different Susceptibility to Plasmodium \n972 falciparum and Plasmodium vivax in Northern Brazil Identified by Immunoassay. \n973 Am J Trop Med Hyg. 1986; 35: 873–881. doi: 10.4269/ajtmh.1986.35.873.\n974 13. Rubio-Palis Y, Wirtz RA, Curtis CF. Malaria entomological inoculation rates in \n975 western Venezuela. Acta Trop. 1992; 52: 167-174. doi: 10.1016/0001-\n976 706x(92)90033-t.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n49\n977 14. Galardo AK, Arruda M, D’Almeida Couto AA, Wirtz R, Lounibos LP, et al. \n978 Malaria vector incrimination in three rural riverine villages in the Brazilian Amazon. \n979 Am J Trop Med Hyg. 2007; 76(3): 461–469. \n980 15. Macdonald G: The epidemiology and control of malaria London: Oxford\n981 University Press; 1957.\n982 16. Garret-Jones C, Shidrawi G. Malaria vectorial capacity of a population of \n983 Anopheles gambiae - An exercise in Epidemiological Entomology. Bull Wld Hlth \n984 Org. 1969; 40: 531-545.\n985 17. Magersa DM, Lou X-S. Effects of Climate Change on Malaria Risk to Human \n986 Health: A Review. Atmosphere. 2025; 16: 71. doi: 10.3390/atmos16010071.\n987 18. Stern P, Easterling WE. Making climate forecast matter. Washington DC: \n988 National Academy Press; 1999.\n989 19. Kavats SR. El Niño and health. Bull Wld Hlth Org. 2000; 78:1127-1134.\n990 20. Poveda G, Rojas W, Quiñones ML, Vélez ID, Mantilla RI, Ruíz D, et al. \n991 Coupling between annual and ENSO timescales in the malaria climate association \n992 in Colombia. Environ Hlth Perspect. 2001; 109: 489-493.\n993 21. Gagnon AS, Smoyer-Tomic KE, Bush ABG. The El Niño Southern Oscillation \n994 and malaria epidemics in South America. Int J Biometeorol. 2002; 46: 81–89. doi: \n995 10.1007/s00484-001-0119-6.\n996 22. Bouma MJ, Dye C. Cycles of malaria associated with El Niño in Venezuela. \n997 JAMA. 1997; 278(21): 1772-1774.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n50\n998 23. Sáez-Sáez V, Rubio-Palis Y, Pino JC. Variabilidad climática y malaria estudio \n999 regional: municipio Sifontes, estado Bolívar, Venezuela. Terra Nueva Etapa. 2009; \n1000 25: 93-112.\n1001 24. Grillet ME, El Souki M, Laguna F, León JR. The periodicity of Plasmodium \n1002 vivax and Plasmodium falciparum in Venezuela. Acta Trop. 2014; 129: 52-60.\n1003 25. Laguna F, Grillet ME, León JR, Ludeña C. Modelling malaria incidence by an \n1004 autoregressive distributed lag model with spatial component. Spat Spatiotemporal \n1005 Epidemiol. 2017; 22: 27-37. doi: 10.1016/j.sste.2017.05.001.\n1006 26. Lynn MK, Bossak BH. Analysis of a Secular Trend in Malaria Incidence: \n1007 Venezuela, 1959-2015. J Epidemiol Glob Health. 2017; 1:54-59.\n1008 27. Kifle MM, Teklemariam TT, Teweldeberhan AM, Tesfamariam EH, \n1009 Andegiorgish AK, Kidane EA. Malaria Risk Stratification and Modeling the Effect of \n1010 Rainfall on Malaria Incidence in Eritrea. J Environ Public Health. 2019; Article ID \n1011 7314129. doi: 10.1155/2019/7314129.\n1012 28. Kumar P, Vatsa R, Parth Sarthi P, Kumar M, Gangare V. Modeling an \n1013 association between malaria cases and climate variables for Keonjhar district of \n1014 Odisha, India: a Bayesian approach. J Parasit Des. 2020; 44(2): 319–331. \n1015 29. Laneri K, Cabella B, Prado PI, Mendes Coutinho R, Kraenkel RA. Climate \n1016 drivers of malaria at its southern fringe in the Americas. PLoS ONE. 2019; 14(7): \n1017 e0219249. doi: 10.1371/journal.pone.0219249.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n51\n1018 30. Bevilacqua M, Medina DA, Cárdenas L, Rubio-Palis Y, Moreno J, Martínez A. \n1019 Orientaciones para Fortalecer el Programa de Malaria en Zonas Remotas con \n1020 Población Indígena en el Caura, Venezuela. Bol Mal Salud Amb. 2009; 49(1): 53-\n1021 71. Available from: http://www.iaes.edu.ve/index.php/centro-de-\n1022 descargas/viewcategory/16-vol49-no1-ano-2009.\n1023 31. Rubio-Palis Y, Moreno JE, Bevilacqua M, Medina D, Martínez A, Cárdenas L, \n1024 et al. Caracterización ecológica de los anofelinos y otros culícidos en territorio \n1025 indígena del Bajo Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2010; \n1026 50(1): 95-107. Available from: http://www.iaes.edu.ve/index.php/centro-de-\n1027 descargas/viewcategory/12-vol50-no1-ano-2010.\n1028 32. Medina D, Bevilacqua M, Cárdenas L, Morales LG, Rubio-Palis Y, Martínez A, \n1029 et al. Mapa de riesgo de transmisión de malaria en la cuenca del río Caura, \n1030 Venezuela. Bol Mal Salud Amb. 2011; 51(2): 23-38. Available from: \n1031 http://www.iaes.edu.ve/index.php/centro-de-descargas/viewcategory/105-vol51-\n1032 no2-ano-2011.\n1033 33. Bevilacqua M, Rubio-Palis Y, Medina DA, Cárdenas L. Malaria control in \n1034 Amerindian communities of Venezuela. Ecohealth. 2015; 12: 253-266. doi: \n1035 10.1007/s10393-015-1026-3.\n1036 34. Moreno JE, Rubio-Palis Y, Bevilacqua M, Sánchez V, Guzmán H. \n1037 Caracterización de hábitats larvales de anofelinos en la cuenca baja del Río \n1038 Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2018; 58: 32-45. Available \n1039 from: http://iaes.edu.ve/iaespro/ojs/index.php/bmsa/article/view/57/29.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n52\n1040 35. Rubio-Palis Y, Moreno JE, Sánchez V, Estrada Y, Anaya W, Bevilacqua M, et \n1041 al. Can Mosquito Magnet® substitute for human-landing catches to sample \n1042 anopheline populations? Mem Inst Oswaldo Cruz. 2012;107: 546-549. doi: \n1043 10.1590/S0074-02762012000400017.\n1044 36. Pimenta PFP, Orfano AS, Bahia AC, Duarte APM, Ríos-Velásquez CM, Melo \n1045 FF, et al. An overview of malaria transmission from the perspective of Amazon \n1046 Anopheles vectors. Mem Inst Oswaldo Cruz. 2015; 110(1): 23-47. dor: \n1047 10.1590/0074-02760140266.\n1048 37. Magris M, Rubio-Palis Y, Menare C, Villegas L. Vector bionomics and malaria \n1049 transmission in the Upper Orinoco River, southern Venezuela. Mem Inst Oswaldo \n1050 Cruz. 2007; 102: 303-311. doi: 10.1590/s0074-02762007005000049.\n1051 38. Galardo AKR, Zimmerman RH, Lounibos LP, Young LJ, Galardo CD, Arruda \n1052 M, et al. Seasonal abundance of anopheline mosquitoes and their association with \n1053 rainfall and malaria along the Matapí River, Amapá, Brazil. Med Vet Entomol. \n1054 2009; 23(4): 335-349. doi: 10.1111/j.1365-2915.2009.00839.x.\n1055 39. Rubio-Palis Y, Curtis CF. Biting and resting behavior of anophelines in western \n1056 Venezuela and implications for control of malaria transmission. Med Vet Entomol. \n1057 1992; 6: 325-334. \n1058 40. Sterne AC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al. \n1059 Multiple imputation for missing data in epidemiological and clinical research: \n1060 potential and pitfalls. BMJ. 2009; 338. doi: 10.1136/bmj.b2393.\n1061 41. Zhang ML, Zhou ZH. ML-KNN: A lazy learning approach to multi-label learning, \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n53\n1062 Pattern Recog. 2007; 40(7): 2038-2048. doi: 10.1016/j.patcog.2006.12.019.\n1063 42. Bentéjac C, Csörgő A,  Martínez-Muñoz G. A comparative analysis of gradient \n1064 boosting algorithms. Artif Intell Rev. 2021; 54:1937–1967. doi: 0.1007/s10462-020-\n1065 09896-5.\n1066 43. Liboschik T, Fokianos K, Fried R. tscount: An R package for analysis of count \n1067 time series following generalized linear models. J Stat Softw. 2017; 82(5): 1–51. \n1068 doi: 10.18637/jss.v082.i05.\n1069 44. Cox FE. History of the discovery of malaria parasites and their vectors. Parasit \n1070 Vectors. 2010; 3:1-9.\n1071 45. Smith M, Mackli MG, Thomas CJ. Hydrological and geomorphological controls \n1072 of malaria transmission. Earth Sci Rev. 2013; 116: 109-127.\n1073 46. Ikeda T, Behera SK, Morioka Y, Minakawa H, Hashizume M, Tsuzuki A, et al. \n1074 Seasonally lagged effects climatic factor son malaria incidence in South Africa. Sci \n1075 Rep. 2017; 7: 2458. doi: 10.1038/s41598-017-02680-6.\n1076 47. Okeneye K, Gumel AB. Analysis of a temperature and rainfall dependant model \n1077 for malaria transmission dynamics. Math Biosci. 2017; 287: 72-92.\n1078 48. World Health Organization. Guidelines for malaria vector control. Geneva: \n1079 World Health Organization; 2019.\n1080 49. Drakou K, Nikolaou T, Vasquez M, Petric D, Michaelakis A, Kapranas A, et al. \n1081 The Effect of Weather Variables on Mosquito Activity: A Snapshot of the Main \n1082 Point of Entry of Cyprus. Int J Environ Res Public Health. 2020; 17: 1403. doi: \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n54\n1083 10.3390/ijerph17041403.\n1084 50. Grillet ME, Hernández-Villena JV, Llewellyn MS, et al. Venezuela’s \n1085 humanitarian crisis, resurgence of vector-borne diseases, and implications for \n1086 spillover in the region. Lancet Infect Dis. 2019; 19: e149-161.\n1087 51. Briceño-León R, Guevara M, Terán I. Politics as Disease in Venezuela. Vector \n1088 control before and after the Bolivarian revolution. In: Locating Zika. Social Change \n1089 and Governance in an Age of Mosquito Pandemics. Kevin Bardosh Editor. \n1090 Routledge, London. 238 p. doi: 10.4324/9780429456558.\n1091 52. Batista GEAPA, Monard MC. An analysis of four missing data treatment \n1092 methods for supervised learning. Appl Artif Intell. 2003; 17:5-6, 519-533. doi: \n1093 10.1080/713827181.\n1094 53. Moreno JE, Rubio-Palis Y, Páez E, Pérez E, Sánchez V. Abundance, biting behavior \n1095 and parous rate of anopheline mosquito species in relation to malaria incidence in gold-\n1096 mining areas of southern Venezuela. Med Vet Entom. 2007; 21: 339-349. doi: \n1097 10.1111/j.1365-2915.2007.00904.x.\n1098 54. Dos Santos F, Xu M, Bravo de Guenni L, Lourenço-de-Oliveira R, Rubio-Palis \n1099 Y.  Characterization of larval habitats of Anopheles (Nyssorhynchus) darlingi and \n1100 associated species in malaria areas in western Brazilian Amazon. Mem Inst \n1101 Oswaldo Cruz. 2024; 119: e240116.\n1102 55. NOAA. Physical Sciences Laboratory. Monthly Climate/Ocean Indices (Time-\n1103 Series) at PSL. Available from: https://psl.noaa.gov/data/timeseries/month.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n55\n1104 56. Poveda G, Quiñones ML, Vélez ID, Rojas W, Rúa-Uribe G, Ruíz CD et al. \n1105 Desarrollo de un sistema de alerta temprana para la malaria en Colombia. \n1106 Universidad Internacional de Andalucía, Sevilla, España. 2008; 181 p.\n1107 57. Kreppel K, Caminade C, Govella N, Morse AP, Ferguson HM, Baylis M. Impact \n1108 of ENSO 2016-2017 on regional climate and malaria vector dynamics in Tanzania. \n1109 Environ Res Lett. 2019; 14: 075009.\n1110 58. Mandal S, Sarkar RP, Sinha S. Mathematical models of malaria – a review. Mal \n1111 J. 2011; 10: 202. Available from: https://www.malariajournal.com/content/10/1/202.\n1112 59. Yiga V, Nampala H, Tumwiine J. Analysis of the modelo n the effect of \n1113 seasonal factor son malaria transmission dynamics. J Appl Math. 2020; doi: \n1114 10.1155/2020/8885558.\n1115 60. World Health Organization. Manual on Practical Entomology in Malaria. Part I \n1116 and II. Geneva: World Health Organization; 1975.\n1117 61. Martens WJM, Niessen LW, Rotmans J, Jetten TH, McMichael AJ. Potential \n1118 impact of global climate change on malaria risk. Environ Health Perspect. 1995; \n1119 103: 458-464.\n1120 62. Yadav CP, Hussain SSA, Mullick R, Rahi M, Sharma A. Climate zones are a \n1121 key component of the heterogeneous presentation of malaria and should be added \n1122 as a malariometric for the planning of malaria elimination. PLOS Glob Public \n1123 Health. 2023; 3(6): e0001878. doi:10.1371/journal.pgph.0001878.\n1124 63. Ma X, Liu Z. Application of a novel time-delayed polynomial grey model to \n1125 predict the natural gas consumption in China. J Comput Appl Math. 2017; 324: 17-\n1126 24. doi: 10.1016/j.cam.2017.04.020.\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n56\n1127 64. Rubio-Palis Y. Observaciones sobre el patrón de actividad hematofágica del \n1128 vector de malaria Anopheles darlingi en poblaciones del sur de Venezuela. Bol Dir \n1129 Malariol Sal Amb. 1995; 35(2): 66-70.\n1130 65. Rosa-Freitas MG, Broomfield G, Priestman A, Milligan PJM, Momen H, \n1131 Molyneux DH. Cuticular hydrocarbons, isoenzymes and behavior of three \n1132 populations of Anopheles darlingi from Brazil. J Amer Mosq Control Assoc. 1992; \n1133 8:357-366.\n1134 66. Moreno M, Saavedra MP, Bickersmith SA, Lainhart W, Tong C, Alava F, et al. \n1135 Implications for changes in Anopheles darlingi biting behaviour in three \n1136 communities in the peri-Iquitos region in Amazonian Perú. Malar J. 2015; 14:290. \n1137 doi: 10.1186/s12936-015-0804-2.\n1138 67. Orjuela LI, Ahumada ML, Avila I, Herrera S, Beier JC, Quiñones ML. Human \n1139 biting activity, spatial-temporal distribution and malaria vector role of Anopheles \n1140 calderoni in the southwest of Colombia. Malar J. 2015; 14:256. doi: \n1141 10.1186/s12936-015-0764-6.\n1142  \n1143 Supporting information\n1144 S1 Fig. El Niño 3.4 index for the period 2009 – 2016.\n1145 S2 Fig. Root mean square prediction error (RMSE)  after 1,000 iterations of \n1146 the random sampling approach [leave-one-out cross-validation (LOOCV)] for \n1147 each Anopheles species, showcasing the impact of lagging/no-lagging of \n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n57\n1148 climatic variables using an approach with linear regression imputation. All= \n1149 all Anopheles species collected; darl.1= Anopheles darlingi; goel.6= Anopheles \n1150 goeldii; osw.5= Anopheles oswaldoi; other= other Anopheles species collected.\n1151 S3 Fig. Root mean square prediction error (RMSE) for each species using a \n1152 leave-one-out cross-validation (LOOCV) approach with stochastic linear \n1153 regression imputation. All= all Anopheles species collected; darl.1= Anopheles \n1154 darlingi; goel.6= Anopheles goeldii; osw.5= Anopheles oswaldoi; other= other \n1155 Anopheles species collected.\n1156 S4 Fig. Estimated total number of All Anopheles species abundance (ETNM) \n1157 time series Gradient Boosting (GB) imputation method. \n1158\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint \n\n.CC-BY 4.0 International licenseavailable under a \n(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made \nThe copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint","source_license":"CC-BY-4.0","license_restricted":false}