1
1 Running Title: Machine learning to overcome mosquito missing data
2 Using machine learning to overcome mosquito
3 collections missing data for malaria modeling
4 Yasmin Rubio-Palis1*, Linjia Feng2#a, Karena S. Liang2#b,
5 Chen Song2#c, Songyuan Wang2, Tymon Duchnicki2, Xiao
6 Zhang2#d, Lelys Bravo de Guenni2*
7
8 1 Instituto de Investigaciones Biomédicas "Dr. Francisco J. Triana Alonso"
9 (BIOMED), Universidad de Carabobo, Maracay, Aragua, Venezuela
10 2 Department of Statistics, University of Illinois, Urbana-Champaign, Illinois, United
11 States of America
12 #a Current address: Department of Financial Engineering, Cornell University, New
13 York, New York, United States of America
14
15 #b Current address: Department of Computer Science, University of Illinois, Urbana-
16 Champaign, Illinois, United States of America
17
18 #c Current address: The Grainger College of Engineering, University of Illinois,
19 Urbana-Champaign, Illinois, United States of America
20
21 #d Current address: Department of Statistics, University of California, Santa Cruz,
22 California, United States of America
23
24 * Corresponding author
25 Email:
[email protected]
26 Email:
[email protected]
27
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
2
28
29 Abstract
30 Entomological surveillance plays a crucial role in areas where malaria remains
31 endemic, yet gathering data on mosquito populations is often expensive and
32 complicated, particularly in remote locations with challenging logistics and
33 inconsistent sampling schedules. Access to extensive time series data on mosquito
34 species at specific sites would greatly enhance insights into seasonal trends and
35 the biting habits of vectors of malaria parasites. Gaps in mosquito count records
36 pose a significant challenge for researchers and public health officials seeking to
37 establish early warning systems and effective vector control programs. In this
38 study, we apply quantitative machine learning techniques to address missing data
39 in estimates of mosquito abundance collected from 2009 to 2016 in Bolívar State,
40 Venezuela. We evaluated Linear Regression, Stochastic Linear Regression, K
41 Nearest-Neighbor, and Gradient Boosting methods for imputing missing counts of
42 Anopheles mosquitoes, employing a leave-one-out cross-validation strategy.
43 Additionally, we developed a predictive malaria transmission model incorporating
44 mosquito abundance and climate variables (El Niño 3.4 Index, rainfall, and mean
45 air temperature) as covariates. Our generalized time series model forecasts
46 malaria incidence of Plasmodium vivax and Plasmodium falciparum based on
47 climate dynamics and imputed mosquito data. Model performance was assessed
48 using root mean square error, mean absolute error, and mean absolute percentage
49 error. The final results demonstrated that machine learning imputation significantly
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
3
50 improved the accuracy and reliability of P. vivax malaria incidence predictions but
51 failed to predict P. falciparum incidence. The study demonstrates that method
52 choice significantly influences the reconstruction of seasonal abundance patterns
53 and the performance of malaria incidence models. Nevertheless, the proposed
54 models strengthen the foundation for targeted interventions and surveillance in
55 endemic regions. Despite limitations in data continuity and coverage, the findings
56 highlight the value of combining multiyear entomological data sets with robust
57 imputation and sensitivity analyses to improve predictive modeling in resource-
58 constrained, malaria-endemic settings.
59 Introduction
60 Malaria stands as the most significant parasitic disease transmitted by mosquitoes.
61 According to the World Health Organization [1], an estimate of 262 million malaria
62 cases occurred across 80 endemic countries in 2024. Within the Americas,
63 Venezuela reported an estimated incidence of 9.7-12.6 x 1,000 population at risk,
64 significantly higher than the incidence reported by Brazil (3.08-3.43 x 1,000
65 population at risk), accounting for 17 percent of all malaria cases in the region.
66 Notably, more than 70 percent of these cases originated from Bolívar State [1].
67 Mosquito control is a crucial intervention for preventing the transmission of malaria
68 parasites. Nevertheless, effective surveillance and monitoring of mosquito vector
69 populations are often constrained by the challenges of conducting regular
70 collections across multiple locations, particularly in vast and remote regions. These
71 logistical hurdles frequently result in incomplete data, leading to potentially
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
4
72 inaccurate assumptions. Furthermore, efforts to model malaria transmission based
73 on entomological data encounter significant limitations stemming from these data
74 gaps.
75 A key mathematical approach to addressing this limitation involves leveraging
76 machine learning techniques for imputing missing data. Numerous researchers
77 have adopted such methods to estimate missing values effectively. For instance,
78 Emmanuel et al. [2] provided a comprehensive review of the methodologies used
79 for missing data imputation. Among the techniques employed, the K-nearest
80 neighbor (KNN) method has demonstrated notable success. Their study also
81 highlights that the performance of imputation methods, such as random forest-
82 based algorithms and KNN, is highly dependent on the dataset being analyzed.
83 However, a potential limitation of the KNN method is its inability to consider the
84 correlation between potential predictors, which can lead to unsatisfactory results
85 [3]. To address such shortcomings, boosting algorithms have emerged as
86 promising alternatives. Research by Cauthen et al. [4] demonstrated that boosting
87 methods can outperform non-boosting approaches, especially in scenarios with
88 high rates of missing values. Their study, which focused on imputing the number of
89 dengue cases in India, highlighted the enhanced accuracy and robustness offered
90 by boosting algorithms in handling complex, incomplete datasets.
91 The Extreme Gradient Boosting (XGBoosting) algorithm, an improved boosting
92 method, has demonstrated superior performance in imputing medical data [3].
93 Additionally, both kernel-based and tree-based machine learning algorithms have
94 been successfully applied to meteorological datasets [5]. In recent years,
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
5
95 Recurrent Neural Network (RNN) models based on deep learning have become
96 increasingly popular for imputing missing values in time series data, owing to their
97 ability to extract and leverage temporal dependencies within datasets [6,7]. These
98 studies typically rely on prediction error measurements, most commonly, root mean
99 square error (RMSE) to evaluate algorithm performance. While machine learning
100 methods are well-established for missing data imputation in various domains, their
101 applicability to imputing mosquito abundance time series data remains largely
102 unexplored. Expanding research in this direction could offer significant advances in
103 handling incomplete entomological datasets, ultimately improving the accuracy of
104 malaria modeling and the effectiveness of vector control strategies.
105 In this study, we systematically compare four distinct approaches for imputing
106 missing data within incomplete time series of mosquito counts, focusing on the
107 most abundant Anopheles species collected in a remote Amerindian community in
108 southern Venezuela. Each imputation method leverages a range of potential
109 climate drivers as predictors, including indices that capture the influence of the El
110 Niño phenomenon. By applying these imputation strategies, we generate complete
111 mosquito abundance time series, subsequently evaluating their performance using
112 cross-validation error metrics. For both the most prevalent mosquito species
113 (Anopheles darlingi) and for the aggregated dataset encompassing all species, we
114 select the imputed time series that yield the lowest cross-validation error.
115 Anopheles darlingi is the most efficient vector of malaria parasites in the neotropics
116 [8,9,10], and the confirmed vector in the study area [11]; all the Anopheles species
117 collected are potential vectors [10,12,13,14], therefore were included in the
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
6
118 analysis. These optimally imputed mosquito abundance datasets are incorporated
119 as potential predictors in our time series models of malaria incidence, with the aim
120 of enhancing the accuracy of malaria transmission modeling in this challenging and
121 data-limited context.
122 Beyond the challenges of imputing mosquito abundance data, numerous studies
123 have explored the impact of climate drivers on malaria incidence across diverse
124 regions. The importance of climate as a driving force of malaria transmission has
125 been known for over a century [15]. The development and survival of Anopheles
126 mosquitoes and the Plasmodium parasites are highly dependent on temperature,
127 rainfall and humidity [16,17]. On the other hand, large scale climatic drivers as El
128 Niño/La Niña (ENSO) phenomena are key in modulating the inter-annual variability
129 of the precipitation worldwide [18]. The relation between ENSO and malaria
130 epidemics has been studied in South America [19,20,21], and particularly in
131 Venezuela [22,23,24,25,26]. The integration of climate variables as potential
132 covariates in malaria prediction models has garnered increasing attention. For
133 instance, Kifle et al. [27] demonstrated that rainfall data served as a reasonably
134 good predictor of malaria incidence in Eritrea, highlighting the role of precipitation
135 in shaping transmission dynamics. Similarly, Kumar et al. [28] introduced a
136 Bayesian regression model incorporating climate predictors to successfully model
137 malaria incidence in India. These findings underscore the potential of climate
138 variables -such as rainfall, temperature, and humidity- as essential components in
139 forecasting malaria risk and improving the precision of epidemiological models
140 tailored to specific local contexts.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
7
141 Importantly, mosquito abundance time series constitute a critical input for
142 elucidating the causal pathways that drive fluctuations in malaria incidence.
143 Previous works, such as Laneri et al. [29] have largely defined the causal
144 landscape of malaria transmission in terms of climate variables, focusing on factors
145 such as rainfall, temperature, and humidity. While this approach has yielded
146 valuable insights, it may overlook the direct role that vector population dynamics
147 play in modulating malaria risk.
148 The present study describes the missing-data imputation methods for Anopheles
149 abundance time series, and the generalized time series models for the
150 Plasmodium vivax and P. falciparum incidence in a malaria endemic region in
151 southern Venezuela. We present the application of the different imputation
152 approaches and compare their performance using a leave-one-out cross-validation
153 (LOOCV) approach, together with an assessment of the generalized time series
154 models for P. vivax and P. falciparum malaria.
155 After applying a variable-selection strategy based on predictive performance in a
156 training set, the generalized time series models can be used to forecast future
157 malaria cases in a testing set. These predictive models enhance our understanding
158 of the interactions between climate, vector abundance and malaria incidence in
159 remote areas.
160 Materials and Methods
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
8
161 Data Sources
162 Mosquito collection methods
163 Entomological surveillance in collaboration with local leaders from remote
164 Amerindian communities was implemented in 2009 [30]. This action was
165 demanded by the chief and members of the Ye’kwana community of Boca de
166 Nichare (lat 06∘34.44’N, long 64∘49.39’W, altitude 59 msnm), Sucre Municipality,
167 Bolívar State. In this area several studies have been conducted since 2005 with
168 local leaders, and the demographic, ecologic, epidemiologic, hydrologic and
169 geomorphologic characteristics have been described [11,30,31,32,33,34]. Local
170 leaders were trained and supervised by one of the authors (YRP) in mosquito
171 collections using the Mosquito Magnet Liberty Plus Trap (MMLPT) (American
172 Biophysics Corporation, North Kingstown, RI). The training topics included
173 mosquito identification, data collection, reporting, preservation, and shipping of
174 specimens collected to the central laboratory in Maracay (Aragua State,
175 Venezuela), where identification to species level and numbers collected were
176 confirmed. The MMLPT device is battery operated and utilizes Carbon Dioxide
177 (CO2), one of the products of the catalysis of propane gas, as an attractant
178 together with Octenol. This trap has been previously evaluated in this region of
179 southern Venezuela, and its efficiency to collect anophelines compared to human
180 landing catches was determined [35].
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
9
181 Mosquito Data
182 Mosquitoes were collected monthly between 2009 and 2016, for 4 to 10 nights per
183 month around the full moon from 18:00 to 06:00 hours. To estimate the biting
184 pattern of the most abundant species, the trap was checked and the net with
185 trapped mosquitoes was removed every four hours: 18:00 to 22:00 hrs; 22:00 to
186 02:00 hrs, 02:00 to 06:00 hrs. The total number of individuals was recorded, and
187 mosquito species were identified. The most predominant species were Anopheles
188 darlingi (darl.1), Anopheles oswaldoi (osw.5), Anopheles goeldii (goel.6), and other
189 species with a less important presence in the area included An. braziliensis, An.
190 triannulatus, An. benarrochi and An. mattogrossensis. A time series of the
191 estimated mean number of mosquitoes by combining all Anopheles species
192 collected was also considered in the analysis and will be referred to as All in the
193 text. The main assumption is that Anopheles abundance and species diversity is
194 representative of the municipality.
195 Mosquito Abundance Time Series
196 The monthly mosquito abundance time series for the most abundant species (An.
197 darlingi, An. goeldii, and An. oswaldoi), and the monthly time series combining all
198 Anopheles species collected (All) is presented in Fig 1.
199 Fig 1. Estimated total number of mosquitoes (ETNM) collected per month.
200 All = total number of mosquitoes collected of all Anopheles species; darl.1 =
201 Anopheles darlingi; goel.6 = Anopheles goeldii; osw.5 = Anopheles oswaldoi; Other
202 = total number of other Anopheles species collected.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
10
203
204 All Anopheles species identified during the surveys are considered potential
205 vectors [9,10,11,36] and, hence were included in the time series and other
206 analysis. These values were calculated by adding the total number of mosquitoes
207 collected per night during three consecutive four-hour periods from 18:00 to 06:00
208 hours. The average number of mosquitoes per night was calculated by adding the
209 total number of mosquitoes for all nights of observation and dividing by the total
210 number of nights in a particular month. This estimate was converted into a monthly
211 estimate by multiplying the mean number of mosquitoes per night by 𝑛𝑑𝑎𝑦𝑠, where
212 𝑛𝑑𝑎𝑦𝑠 is the total number of days in each month.
213 A distinctive feature of this data set is the important proportion of missing data. The
214 proportion of missing observations (or months without collections) is 60.4% of the
215 total number of observations. As shown in Fig 1, there are important data gaps,
216 especially during the years 2012-2013 due to travel limitations to the collection site,
217 including economic restrictions and the shortage of fuel.
218 Mosquito Abundance and Biting Activity
219 Mosquito abundance and biting activity can vary from year to year, depending on
220 the climatic drivers. It can vary between species and could also vary between
221 different collection periods (night hours) [11,37,38,39]. To determine the
222 significance of these factors on the mosquito biting behavior, a three-way ANOVA
223 model was fitted to the mean number of mosquitoes by year, species and
224 collection time, including also the interactions among all the three different factors.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
11
225 In this case, a normality Shapiro-Wilks test was not rejected. The variable
226 representing collection times (with three intervals: 18:00 to 22:00 hrs; 22:00 to
227 02:00 hrs, 02:00 to 06:00 hrs) and its interaction with the other two factors, resulted
228 as non-significant, indicating that there are not significant differences in the number
229 of mosquitoes among these distinct mosquito collection periods. For this reason,
230 this factor was not accounted for in the analysis herein.
231 Malaria Data Cases
232 Monthly malaria cases due to P. vivax (PV) and P. falciparum (PF) were obtained
233 from the Malaria Program report for the Sucre Municipality. Cases resulting from
234 mixed infections associated with both parasite species were also available and
235 added to the corresponding number of malaria cases for each parasite. Data on
236 malaria cases for the study location (Boca de Nichare) was not available.
237 Population data for the Sucre municipality was available from 2009 to 2017
238 (Instituto de Salud Pública, Bolívar State). The monthly P. vivax and P. falciparum
239 incidence were estimated by dividing the total number of cases of each parasite
240 species by the population per 1,000 population. The variables P. falciparum
241 incidence (PF), and P. vivax incidence (PV) were used as response variables in
242 the malaria model described in the methodology section. Time series of this data
243 set are shown in Fig 2.
244
245 Fig 2. Plasmodium vivax and Plasmodium falciparum incidence, Sucre
246 Municipality, Bolívar State, Venezuela, 2009-2017.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
12
247 Climate data
248 The climate of the study region has been classified as a Tropical Savanna Climate
249 (Aw) according to Köppen-Geiger climate classification system, with annual
250 average temperatures of approximately 23.46°C and average annual precipitation
251 of 2,764 mm/yr.
252 Nearby meteorological stations with long term time series data are unavailable for
253 the study site. Historical climate information is readily available from open data
254 sources, including large scale oceanic-atmospheric indices depicting the temporal
255 variability of the El Niño-Southern Oscillation (ENSO) index.
256 Rainfall and mean air temperature data were downloaded from the World Bank
257 data portal https://climateknowledgeportal.worldbank.org/. Fig 3 shows the monthly
258 historical time series for the study region.
259 Fig 3. Rainfall and Mean Air temperature time series during the period 2009-
260 2016. A. Total Rainfall in mm. B. Mean Air Temperature in degrees Celsius. Boca
261 de Nichare, Bolívar State, Venezuela.
262 The El Niño 3.4 Index (Nino3.4) was downloaded from the National Oceanic and
263 Atmospheric Administration (NOAA) website https://psl.noaa.gov/enso/data.html.
264 This index depicts the values of the Sea Surface Temperature (SST) Anomalies for
265 the Pacific Ocean region with extents (5N-5S, 120W-170W) (S1 Fig).
266 Fig 4 shows the estimated total number of mosquitoes (ETNM) per month time
267 series together with the anomalized rainfall and the anomalized temperature.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
13
268 Fig 4. Estimated total number of mosquitoes (ETNM) per month for All
269 Anopheles species (red line) compared with climate data. A. Anomalized mean
270 monthly rainfall in mm (blue line). B. Anomalized mean monthly air temperature in
271 degrees Celsius (green line). Boca de Nichare, Bolívar State, Venezuela (2009-
272 2016).
273 Anomalized rainfall and temperature time series were calculated by subtracting the
274 long-term monthly means from each monthly value. The anomalization procedure
275 filters out the series’ seasonal components while the monthly and inter-annual
276 variability remain as dominant components. The long-term monthly means were
277 calculated using data from the period 1981 to 2010.
278 To compare the estimated mean number of mosquito time series with the ENSO
279 data, Fig 5 overlaps the different ENSO occurrence modes with the mosquito data
280 before imputation. The highlighted peak in the mosquito times series occurs during
281 a weak La Niña phase. These maxima occur during a positive rainfall anomaly
282 period, also preceded by a positive air temperature anomaly.
283 Fig 5. Estimated Total Number of Mosquitoes (ETNM) per Month and El Niño
284 3.4 index for the period 2009 – 2016. Boca de Nichare, Bolívar State,
285 Venezuela.
286 A summary of the climatic variables during the study period 2009-2016 is
287 presented in Table 1.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
14
288 Table 1. Summary monthly statistics for climatic variables Rainfall, Mean Air
289 Temperature and El Niño 3.4 index during years 2009-2016.
290 __________________________________________________________________
Variable Mean Median
Standard
Deviation Max Min
Rainfall (mm) 176.68 168.95 130.77 469.1 1.6
Mean Air Temperature (⁰C) 26.44 26.4 0.69 28.6 25.2
El Niño 3.4 Index 0.1176 -0.085 0.9871 2.57 -1.7
291
292 Methodology
293 In this section, we describe the different methods used for imputation of
294 mosquitoes (An. darlingi and all Anopheles species) time series missing data and
295 present the generalized time series model used to model malaria incidence.
296 Missing Data Imputation Methods
297 Missing data handling methods are usually adapted to the nature of missing data
298 mechanisms. The missing data mechanisms are related to the dependence or not
299 of the missing data on the observed values. Emmanuel et al. [2] and Sterne et al.
300 [40] propose classifying missing data according to their type: missing completely at
301 random, missing at random and missing not at random. The type of missing data
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
15
302 we are dealing with in this study can be classified as missing completely at random
303 because missing values do not depend on the observed or unobserved data, but
304 rather on the ability to collect information in a timely manner if economic and/or
305 logistic resources (gasoline, propane gas, trap functioning) are readily available.
306 Deleting missing data is one method of handling missing values called the
307 Complete Case Analysis. This procedure is rarely applied because deleting
308 missing data points (Listwise deletion) can significantly reduce our sample size for
309 model training purposes, and therefore, accuracy would be a concern.
310 Using imputation methods as regression imputation allows for estimating missing
311 values by using other variables as covariates or predictors. In this research,
312 different approaches were used for missing data imputation. The following methods
313 will be discussed and compared:
314 Linear Regression (LR), Stochastic Linear Regression (SLR), K Nearest-Neighbor
315 (KNN) and Gradient Boosting (GB).
316 The imputation methods LR, SLR, KNN and GB were implemented in Python
317 (Google Colab tool) using packages fancyimpute, sklearn and xgboost.
318 A brief description of each method and the steps for their implementation are
319 described below.
320 Linear Regression and Stochastic Linear Regression Imputation
321 Regression imputation is a technique used to replace missing values 𝑌𝑚 in a data
322 set based on a set of predictors or attributes 𝑋1,𝑋2,…,𝑋𝑝. It can be classified into
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
16
323 two types: deterministic regression imputation and stochastic regression
324 imputation. Deterministic regression imputation predicts missing values using the
325 estimated regression model:
326 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝
327 where 𝛽𝑖 are the parameters of the estimated regression model using least
328 squares, and 𝑦𝑚 are the estimated mean values of the response variable 𝑌 given a
329 set of values for the predictors 𝑥1,𝑥2,…,𝑥𝑝. Estimation of the parameters 𝛽0,𝛽1,…,𝛽𝑝
330 is based on the observed values 𝑌𝑜 and the corresponding observations for the
331 predictors (complete cases). However, this method uses a point estimate that does
332 not consider any random variation (i.e., an error term) around the regression
333 model, leading to imputed values that may be too precise resulting in an
334 overestimation of the correlation between 𝑋𝑖 and 𝑌.
335 To address this limitation, stochastic regression imputation was used. This method
336 incorporates a random error term into the predicted value:
337 𝑦𝑚 = 𝛽0 + 𝛽1𝑥1 + … + 𝛽𝑝𝑥𝑝 + 𝜖
338 where 𝜖 is a simulated random error term from a normal distribution with zero mean
339 and variance estimated using least squares. By incorporating this error term,
340 stochastic regression imputation can reproduce the correlation between 𝑋𝑖 and 𝑌
341 more accurately.
342 Using regression imputation, we used the Leave-One-Out Cross-Validation
343 (LOOCV) method to calculate the average RMSE for the evaluation of imputation
344 performance. LOOCV is a special type of k-fold cross-validation in which the
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
17
345 number of folds k is equal to the number of observations N in the data set. Under
346 this approach, each observation will be treated as a validation set once, and the
347 rest (N-1) observations will be used as a training set. This method aims at reducing
348 bias and preventing over-fitting, and it is useful for data sets with a small number of
349 observations.
350 K-Nearest Neighbor method
351 This algorithm falls under the umbrella of supervised learning methods. It can be
352 used for both regression and classification problems. It is considered a Lazy
353 learner algorithm [41] and does not assume a probability distribution of the data.
354 Instead, KNN assumes that similar inputs have similar outputs, and it will impute
355 the missing data with the average value amongst its k most similar inputs. KNN
356 depends heavily on the parameter k, which is the number of nearest neighbors to
357 be used for imputation. The range of k values used in our imputation process
358 varied from 1 to 40.
359 The KNN algorithm uses a distance measure to detect the closest data points to a
360 target vector 𝑥0 in a neighborhood, for which a missing characteristic 𝑦0 should be
361 imputed. It is also important to note that the K neighbors do not have any missing
362 features. Examples of distance measures are the Minkowski distance, Manhattan
363 Distance, Cosine Distance, Jaccard Distance, Hamming Distance and Euclidean
364 distance [2]. Among these distance measures, Euclidean distance is commonly
365 used in practice. The equation is:
366
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
18
367 𝐷(𝑝,𝑞) = 𝛴𝑛
𝑑=1(𝑝𝑑 ― 𝑞𝑑)2
368
369 where 𝐷(𝑝,𝑞) is the Euclidean distance, 𝑝 and 𝑞 are two data points in an n-
370 dimensional space and 𝑛 is the dimension of the space.
371 The imputed value at 𝑥0 (𝑦0) is the average of all responses 𝑦𝑖 in the
372 neighborhood:
373 𝑦0 = 1
𝐾
𝐾
𝑖=1
𝑦𝑖
374
375 The LOOCV method (Leave-One-Out Cross-Validation) is used to select the
376 optimum value of 𝐾 by comparing the value of the root mean square error (RMSE)
377 for each number of neighbors 𝐾 and selecting 𝐾 that minimizes RMSE.
378 Gradient Boosting Method
379 The Gradient Boosting method (GBoosting) is also a supervised learning algorithm
380 where a training data set is used to predict a response 𝑦0 from an input vector of
381 features 𝑥0. It can be used both in a regression context and as a classifier. The
382 final prediction is an ensemble of decision trees constructed in an iterative way.
383 Each tree is considered a weak learner, and the algorithm uses the steepest
384 descent method to move into the direction of minimizing a loss function at each
385 step. The Gradient Boosting method employs regularization techniques to prevent
386 overfitting, and it can handle missing data automatically. The Gradient Boosting
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
19
387 model is flexible for data with a large missing data rate and small size, and it is also
388 useful for minimizing bias error.
389 A wide range of hyperparameters must be specified for the Gradient Boosting
390 model. The hyperparameters used in the imputation process are max_depth
391 (maximum depth of a tree), min_samples_leaf (minimum number of observations in
392 a leaf), and learning_rate (how quickly the model learns). The LOOCV method is
393 used to select the best combination of the three parameters with the lowest RMSE.
394 The gradient boost algorithm objective is to find an approximation of the function
395 F(x) that can minimize the expected value of the loss function L(y, F(x)) [42]:
396 In the first step, we want to minimize the loss function 𝐿(𝑦𝑖,𝛼), which represents the
397 difference between the actual values and the predicted values. A constant
398 prediction of the model 𝐹0(𝑥) is given by:
399 𝐹0(𝑥) = 𝑎𝑟𝑔 𝑚𝑖𝑛𝛼
𝑁
𝑖=1
𝐿(𝑦𝑖,𝛼)
400 Where 𝛼 is the value that minimizes the loss function ∑𝑁
𝑖=1 𝐿(𝑦𝑖,𝛼).
401 In subsequent steps (𝑚 = 1,…,𝑀) the following expression is minimized:
402 𝛼𝑗𝑚 =
𝑥𝑖∈𝑅𝑗𝑚
𝐿(𝑦𝑖 ,𝐹𝑚―1 (𝑥𝑖) + 𝛼 )
403 At each step, several regression trees are created on the residuals of the model.
404 These residuals are calculated as follows:
405 𝑟𝑖𝑚 = ―[ ∂𝐿(𝑦𝑖,𝐹(𝑥𝑖))
∂𝐹(𝑥𝑖) ]𝐹(𝑥)=𝐹𝑚―1(𝑥)
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
20
406 where 𝑖 = 1,…,𝑁; and m is the index of each tree.
407 The derivative is taken with respect to the previous prediction 𝐹𝑚―1 , and the result
408 is proportional to the residual value at the step 𝑚 ― 1.
409 In step 𝑚 the prediction 𝐹𝑚(𝑥) is updated as:
410 𝐹𝑚(𝑥) = 𝐹𝑚―1(𝑥) + 𝑣
𝐽𝑚
𝑗=1
𝛼𝑗𝑚1(𝑥 ∈ 𝑅𝑗𝑚)
411 𝑗 is the index of each terminal node (leaf); 1(.) is the indicator function and 𝐽𝑚 is the
412 total number of leaves in the tree. 𝑅𝑗𝑚 represents each terminal node and 𝑣 ∈ (0,1)
413 is the learning rate.
414 Generalized Time Series Model for Malaria Incidence
415 One method for modeling malaria cases with covariates is using a generalized
416 linear model for count time series. Let {𝑌𝑡:𝑡 ∈ 𝑁} be a count time series and {𝑋𝑡
417 :𝑡 ∈ 𝑁} be a time-varying 𝑟 dimensional covariate vector. We then model the
418 conditional mean 𝐸(𝑌𝑡|𝐹𝑡―1) of the time series by a process {𝜆𝑡:𝑡 ∈ 𝑁} such that 𝐸(
419 𝑌𝑡|𝐹𝑡―1) = 𝜆𝑡, where 𝐹𝑡 is the history of the joint process {𝑌𝑡,𝜆𝑡,𝑋𝑡+1:𝑡 ∈ 𝑁} up to
420 time 𝑡 and includes the information of covariates at time 𝑡 + 1. Furthermore, let 𝑔:
421 𝑅+→𝑅 be a link function and 𝑔:𝑁0→𝑅 be a transformation function. The count time
422 series model, as proposed in Liboschik et al. [43], is of the form:
423 𝑔(𝜆𝑡) = 𝛽0 +
𝑝
𝑘=1
𝛽𝑘 𝑔(𝑌𝑡―𝑖𝑘) +
𝑞
𝑙=1
𝛼𝑙𝑔(𝜆𝑡―𝑗𝑙) + 𝜂𝑇𝑋𝑡
424 Here 𝜂 is the parameter vector of the covariate effects, and 𝛽0 is the intercept. We
425 can regress on past observations of the time series by defining a set 𝑃 = {𝑖1,𝑖2,…,𝑖𝑝}
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
21
426 of integers where 0 < 𝑖1 < 𝑖2… < 𝑖𝑝 < ∞ and then regressing on lagged
427 observations 𝑌𝑡―𝑖1,𝑌𝑡―𝑖2,…,𝑌𝑡―𝑖𝑝 with effects 𝛽. We are also able to regress on
428 lagged conditional means by defining a similar set 𝑄 = {𝑗1,𝑗2,…,𝑗𝑞} and regressing
429 on conditional means 𝜆𝑡―𝑗1,𝜆𝑡―𝑗2,…,𝜆𝑡―𝑗𝑞 with effects 𝛼.
430 The model is fitted using quasi-conditional maximum likelihood estimation. If the
431 link function is the identity function, then 𝑔(𝑥) = 𝑔(𝑥) = 𝑥. Otherwise, if the link
432 function is logarithmic, then 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥) and 𝑔(𝑥) = 𝑙𝑜𝑔(𝑥 + 1). Furthermore, it
433 must be assumed that 𝑌𝑡|𝐹𝑡―1 ∼ 𝑃𝑜𝑖𝑠𝑠𝑜𝑛(𝜆𝑡) or 𝑌𝑡|𝐹𝑡―1 ∼ 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝜆𝑡,𝜙)
434 where 𝜙 ∈ (0,∞) is an additional dispersion parameter. For details on parameter
435 estimation, refer to the documentation of the R package tscount [43].
436 Results
437 Climate data and mosquito time series dependence
438 Sample cross-correlation functions were calculated to understand the association
439 between the estimated total number of mosquitoes per month and the climatic
440 variables. As mentioned, the rainfall and temperature data time series were
441 anomalized to partially remove the strong seasonal pattern from these climate
442 variables.
443 Fig 6 shows the sample cross-correlation function between the estimated total
444 number of mosquitoes (ETNM) per month and the monthly values of the El Niño
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
22
445 index (Niño3.4), the anomalized mean temperature, and mean monthly rainfall for
446 all Anopheles species.
447 Fig 6. Cross-correlation between the estimated total number of mosquitoes
448 (ETNM) for all Anopheles species time series and the Niño3.4 index (A), the
449 monthly mean temperature (B), and monthly rainfall (C).
450 By inspecting the negative lag portion of the figures, we can detect the leading lag
451 for which the mosquito data is significantly correlated with the climate variables.
452 The maximum correlation in absolute value within a year occurs for lags of 9, 6,
453 and 2 months respectively. There is a significant positive association between
454 mosquito abundance and El Niño index, temperature, and rainfall after 9, 6, and 2
455 months, respectively.
456 A similar calculation was repeated for the most abundant species, An. darlingi
457 (darl.1) and the same lagged significant effect of climate drivers on mosquito
458 abundance were observed.
459 Imputation methods comparison
460 Linear regression and stochastic linear regression
461 The linear regression model was fitted by least-squares, using all non-missing data
462 values (referred to as complete cases) as the training data set. This fitted model
463 was subsequently used to impute missing values. The model was based on
464 predictors including temperature and rainfall anomalies and the El Niño3.4 index.
465 These predictors were initially considered without any lagged effects (i.e., no time
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
23
466 delays). After analyzing the cross-correlations (Fig 6), lagged versions of the
467 climate predictors were incorporated into the model, using the selected lags
468 exhibiting the highest cross-correlation with the estimated total number of
469 mosquitoes (ETNM) response variable. We then made a comparative analysis of
470 model performance under these two scenarios: using lagged predictors and using
471 non-lagged predictors.
472 Our primary focus was on the imputation of data associated with all mosquito
473 species, obtained by aggregating mosquito counts from all species. In addition, we
474 examine data imputation for each species separately. For assessing the prediction
475 error after imputation of missing values, the following approach was employed:
476 1. Due to the limited availability of non-missing data, a simple random
477 sampling of the observed data with replacement was implemented. The
478 number of samples from this sampling process was equivalent to the
479 number of missing data values. Multiple training data sets were initially
480 generated by repeatedly applying the random sampling method for 𝐾
481 iterations.
482 2. A linear regression model was fitted to each training sample using least-
483 squares. To assess model performance, a leave-one-out cross-validation
484 (LOOCV) error was calculated for each of these training data sets, and the
485 average root mean square error (RMSE) was obtained.
486 3. In our analysis, we observed that using lagged climatic variables as
487 predictors resulted in a better model performance (RMSE= 141.40)
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
24
488 compared to using climatic variables without lagging (RMSE=142.69) (S2
489 Fig).
490 4. The same procedure was followed for the stochastic model: climatic
491 variables as predictors were considered before lagging (no lag) and after
492 lagging, and model performance was compared in these two cases. The
493 stochastic regression approach used the same random sampling function as
494 the linear regression model to impute missing values. It generated random
495 predictions based on the regression model, considering the prediction
496 standard error to fill in the missing values. Random predictions were
497 generated from a normal distribution centered at the estimated mean
498 predicted value, with a standard error equal to the standard deviation of the
499 residuals for each estimate. If the random prediction was greater than zero,
500 it was used as the imputed value for that specific missing value. To evaluate
501 the performance of the stochastic imputation method, we used the leave-
502 one-out cross-validation (LOOCV) method, resulting in a list of RMSE
503 values for each training sample that were averaged for all iterations. Our
504 analysis indicated that utilizing lagged climatic variables mean (RMSE=
505 0.13) led to a slightly better performance compared to using climatic
506 variables without lagging mean (RMSE=0.10) (S3 Fig).
507 K-Nearest Neighbor (KNN)
508 The model was trained using the leave-one-out cross-validation method (LOOCV)
509 to adjust the parameter 𝐾 (number of nearest-neighbors). Climatic variables as
510 predictors were considered before lagging (no lag) and after lagging, and the
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
25
511 model performance was compared in these two cases. The following steps were
512 followed:
513 The LOOCV method was used to select the best value of the 𝐾 parameter. A KNN
514 model was fitted to all five species for values of 𝐾 in the range of 1 to 40. The value
515 of 𝐾 with the lowest RMSE for the LOOCV method was 𝐾 = 16 in An. darlingi
516 (darl.1). Fig 7 shows a comparison of the RMSE for each value of 𝐾 for An. darlingi
517 (darl.1) and All mosquito species.
518 Fig 7. Comparison of the leave-one-out cross-validation (LOOCV) root mean
519 square error (RMSE) for each 𝐾 value for Anopheles darlingi (darl.1) and for
520 All Anopheles mosquito species.
521 The KNN imputation model was constructed for An. darlingi (darl.1) and for All
522 species with the number of neighbors 𝑛𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟𝑠=16 and a data 𝑡𝑟𝑎𝑖𝑛:𝑡𝑒𝑠𝑡 split
523 ratio of 8:2. The model was run 1,000 times, and the root mean square error
524 (RMSE) was calculated for each iteration. A better performance of the
525 methodology was found when using lagged climatic variables (RMSE= 126.83) in
526 comparison to using climatic variables with no lagging (RMSE= 164.47) for darl.1.
527 Fig 8 shows a comparison of the RMSE for each iteration using climatic variables
528 before and after lagging. These results confirm better predictability of the fitted
529 model when using lagged climate predictors.
530 Fig 8. Comparison of root mean square error (RMSE) for K-nearest neighbor
531 (KNN) method in each run, using climatic variables before and after lagging.
532 A: Anopheles darlingi (darl.1); B: All Anopheles species collected.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
26
533 Gradient Boosting (GB)
534 The model was trained using the leave-one-out cross-validation (LOOCV) method
535 by tuning three parameters in the Gradient Boosting algorithm, which include max
536 depth, min samples leaf, and learning rate. Climatic variables as predictors were
537 considered before lagging (no lags) and after lagging in the same way as with
538 previous imputation methods, and model performance was compared in these two
539 cases. The following steps were followed:
540 1. The LOOCV method was used to select the best value of the parameters
541 mentioned above. The GB model was fitted to species An. darlingi (darl.1)
542 and All species for different values of the hyperparameters. We used a grid
543 search to tune the hyperparameters, aiming to find the optimal combination
544 of three hyperparameters. The hyperparameter values with the lowest
545 RMSE for LOOCV were min samples leaf = 1, learning rate = 0.1, max
546 depth = 10.
547 2. The Gradient Boosting imputation model with the default hyperparameters
548 of GB was constructed for An. darlingi (darl.1) to compare the performance
549 between lagged and no-lagged climate variables. The dataset was split into
550 training and testing sets at an 8:2 ratio. The model was run 100 times and
551 the root mean square error (RMSE) was calculated for each iteration. The
552 methodology performed better when using lagged climatic variables
553 (RMSE= 163.75) compared to non-lagged climatic variables (RMSE=
554 217.44) for darl.1. Fig 9 shows a comparison of the RMSE for each iteration
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
27
555 using climatic variables before and after lagging. These results again
556 highlight the advantage of using lagged predictors over non-lagged ones.
557 Fig 9. Comparison of root mean square error (RMSE) for the Gradient
558 Boosting (GB) method in each run. Using climatic variables before and after
559 lagging. A: Anopheles darlingi (darl.1); B: All Anopheles species collected.
560 Fig 10 show the contrast between the original data (in blue) and the imputed data
561 (in yellow) for the four different imputation methods used for the species An.
562 darlingi (darl.1).
563 Fig 10. Estimated total number of Anopheles darlingi abundance (ETNM) time
564 series imputation methods. A. Linear Regression (LR). B. Stochastic Linear
565 Regression (SLR). C. K-Nearest Neighbor (KNN). D. Gradient Boosting (GB).
566 Important differences can be distinguished among the different methods, especially
567 between linear regression imputation and the remaining methods. These observed
568 differences are not surprising given the variety of imputation approaches used. A
569 comparison of the prediction errors based on the non-missing or complete case
570 data values can be assessed by estimating the RMSE using a LOOCV method for
571 each imputation approach. These results are presented in Table 2. Table entries
572 are relative numbers, calculated after dividing by the mean number of mosquitoes
573 for each species.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
28
574 Table 2. Leave-One-Out Cross Validation (LOOCV) method errorsa for the
575 different imputation methods and mosquito species.
576 _______________________________________________________________
Anopheles
Species Imputation Method
KNN G. Boosting Linear R. Stochastic R.
Allb 0.7638104005 0.0000000297 0.968946389 0.000917769
An. darlingi 0.7727904964 0.0000000539 0.972812828 0.001148532
An. goeldii 0.8078696262 0.00000012 1.308744024 0.002081277
An. oswaldoi 1.345153286 0.000000627 1.503379349 0.001657576
Other 1.073280571 0.00004271 1.120539972 0.001491594
577 aThe root mean square error (RMSE) values were calculated using the complete
578 case observations (non-missing cases).
579 bAll Anopheles species collected.
580 c Other than An. darlingi, An. goeldii and An. oswaldoi species collected.
581 The GB methodology shows the best predictive performance with the lowest
582 testing errors in all cases. The same analysis was done for All Anopheles species,
583 with similar results (S4 Fig).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
29
584 Malaria Incidence Time Series Model
585 Malaria Incidence Model Covariates
586 For the generalized linear time series count model described above, we consider
587 several potential covariates for inclusion in the model: monthly rainfall, mean
588 monthly temperature, ENSO index (El Niño 3.4), and an imputed mosquito count
589 time series. Initially, we selected the imputed mosquito time series with the lowest
590 LOOCV error which corresponds to the Gradient Boosting approach (Table 2).
591 Additionally, we considered lagged versions of each predictor. To select the
592 optimal lag, we calculated the rank cross-correlation of the malaria incidence time
593 series and each climatic anomalized predictor after removing the long-term
594 monthly means. The selected lag was the one that resulted in the highest cross-
595 correlation between the anomalized climatic time series and malaria incidence. For
596 rainfall, a two-month lagged time series was considered based on the time lag that
597 takes into account the time for creating habitats for the development of the aquatic
598 stages of Anopheles mosquitoes (from eggs to adults), the buildup of mosquito
599 populations, blood feeding interval and the development of the parasite inside the
600 mosquito (sporogony) [37,44,45,46,47].
601 Figs 11, 12, 13, and 14 show the sample rank cross correlation functions of the P.
602 vivax and P. falciparum incidence with rainfall, mean air temperature, ENSO index,
603 and imputed All Anopheles species time series. The lag values for the negative
604 portion of the graphs with the highest rank cross-correlations in absolute value are
605 shown in Table 3. Note that cross-correlations between rainfall and P. vivax
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
30
606 incidence are nonsignificant at any lag. However, the negative lag corresponding to
607 the highest cross-correlation (in absolute value) was selected as a model
608 covariate. This lag varied between 4 and 5 months for rainfall, 8 months for
609 temperature, 16 and 17 months for the ENSO index, and 4 to 6 months for
610 mosquito counts (Table 3).
611 Fig 11. Anomalized rainfall and malaria incidence Rank Cross-Correlation.
612 A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum incidence (PF).
613 ACF= Sample Cross-correlation Function.
614 Fig 12. Anomalized temperature and malaria incidence Rank Cross-
615 Correlation. A. Plasmodium vivax incidence. B. Plasmodium falciparum
616 incidence. ACF= Sample Cross-correlation Function.
617 Fig 13. Anomalized El Niño 3.4 (ENSO) and malaria incidence Rank Cross-
618 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum
619 incidence (PF). ACF= Sample Cross-correlation Function.
620 Fig 14. All Anopheles species counts and malaria incidence Rank Cross-
621 Correlation. A. Plasmodium vivax incidence (PV). B. Plasmodium falciparum
622 incidence (PF). ACF= Sample Cross-correlation Function.
623 Table 3. Selected lags for climate and mosquito count covariates to be
624 included in the Plasmodium vivax and Plasmodium falciparum malaria
625 incidence models for All Anopheles species and for Anopheles darlingi.
626 ________________________________________________________________
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
31
Malaria
Model
Mosquito
Species Rainfall Temperature ENSO
Mosquito
count
P. vivax All species 2, 4 8 17 4
P. falciparum 2, 5 8 16 5
P. vivax An. darlingi 2, 4 8 17 4
P. falciparum 2, 5 8 16 6
627
628
629 Finally, we considered regressing the malaria incidence on past observations and
630 past conditional means. The Sample Autocorrelation Function (ACF) and Partial
631 Autocorrelation Function (PACF) plots of the malaria time series shown in Fig 15
632 was used to determine the lag of past observations that had the greatest impact on
633 the malaria incidence cases at time 𝑡. Due to the highly significant partial
634 autocorrelation at lag 1, we included past malaria observations at time 𝑡 ― 1 only.
635 Furthermore, after including the past observation, we also consider adding the
636 conditional mean at 𝑡 ― 12 to account for possible seasonal effects .
637 Fig 15. Sample Autocorrelation Function (ACF) and Partial Autocorrelation
638 Function (PACF) for Plasmodium vivax incidence time series.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
32
639 For all models, the logarithmic link function 𝑔(𝜆𝑡) = 𝑙𝑜𝑔(𝜆𝑡) was used. The identity
640 function cannot fit models with negative covariates (such as the ENSO index) and
641 cannot estimate effect parameters less than 0. Since some effects may be
642 negative, the identity function is too restrictive in this case.
643 Malaria Model Fitting and Prediction
644 The malaria model was fitted to the incidence of P. vivax and P. falciparum. To
645 select the best Time Series Generalized Linear Model (TSGLM), we calculated
646 different accuracy metrics for each unique combination of lagged and non-lagged
647 covariates (rainfall, temperature, ENSO, and mosquito counts), an autoregressive
648 term of malaria incidence at 𝑡 ― 1, and the conditional mean malaria incidence at
649 𝑡 ― 12. This term considers any seasonal dependence that might be present in the
650 data. The conditional mean was only considered if the autoregressive term was
651 already in the model. Mosquito counts for All Anopheles species imputed with the
652 GB method were proposed as potential covariates. Additional results were also
653 produced using An. darlingi mosquito counts as a predictor. We also included an
654 additional trend shift component to account for the observed trend increase of
655 malaria incidence from year 2015 onward.
656 We focus on selecting the model that gives the best prediction accuracy when
657 compared to observations. For this purpose, we used an 80-20 training-testing
658 sample split. The training set was used to fit models according to the above
659 algorithm, and each model was used for prediction in the testing set. Using these
660 predictions, we calculated a variety of error metrics: the root mean square error
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
33
661 (RMSE), the mean absolute error (MAE), and the mean absolute percentage error
662 (MAPE). Lower error values are preferred. The best models, as indicated by the
663 lowest values of RMSE, MAE, and MAPE, are presented in Table 4.
664 Table 4. Selected covariates for the Plasmodium vivax and Plasmodium
665 falciparum malaria incidence models that minimize the prediction error in a
666 testing set using different accuracy measures.
667 Plasmodium vivax Plasmodium falciparum
Covariates RMSE MAE MAPE Covariates RMSE MAE MAPE
Rainfall Xa X Xa Rainfall Xa Xa Xa
Rainfall (lag 2) X X Rainfall (lag 2) X Xa Xa
Rainfall (lag 4) X X X Rainfall (lag 5)
Temperature X X X Temperature X X
Temp (lag 8) Temp (lag 8) X
ENSO X X X ENSO Xa Xa Xa
ENSO (lag 17) ENSO (lag 17) Xa Xa Xa
Mosquito Countb X X Mosquito Countb
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
34
Mosq Count (lag 4) X X X Mosq Count (lag 4)
Past Obsc (lag 1) Xa Xa Xa Past Obsc (lag 1) Xa Xa Xa
Past Mean (lag 12) Past Mean (lag 12) X X X
Time Shiftd Xa Xa Xa Time Shiftd X X X
668
669 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean
670 absolute percentage error.
671 a Coefficient is significant at 5%
672 b Mosquito counts include All Anopheles species imputed using Gradient Boosting
673 c Past Observation
674 d Time Shift Post-2015
675 The general malaria incidence model for P. vivax or P. falciparum can be written
676 as:
677 𝑙𝑜𝑔(𝜆𝑡) = 𝛽0 + 𝛽1𝑙𝑜𝑔(𝑌𝑡―1) + 𝛼1𝑙𝑜𝑔(𝜆𝑡―12) + 𝜂𝑅⊺𝑅𝑡 + 𝜂𝑇⊺𝑇𝑡 + 𝜂𝐸𝑛⊺𝐸𝑛𝑡 + 𝜂𝑀𝐶⊺𝑀𝐶𝑡
+ 𝜂𝐼𝐼𝑡―2015
678 where 𝜆𝑡 is the expected malaria incidence rate and 𝑌𝑡 is the observed incidence
679 rate at time 𝑡 [(cases/population) × 1,000]. 𝑅𝑡, 𝑇𝑡 and 𝐸𝑛𝑡 are the climatic variables
680 (rainfall, temperature, ENSO respectively) vector with non-lagged and lagged
681 values at time 𝑡. 𝑀𝐶𝑡 is the mosquito counts vector with non-lagged and lagged
682 values at time 𝑡. 𝐼𝑡―2015 is an indicator variable which is zero if the year is lower
683 than 2015 (constant trend), and it is proportional to time if time is greater than
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
35
684 2015. Model parameters 𝛽0 (intercept), 𝛽1 (𝑡 ― 1 observation), 𝛼1 (𝑡 ― 12 mean),
685 and parameter vectors 𝜂𝑅,𝜂𝑇,𝜂𝐸𝑛,𝜂𝐼 are estimated by maximum likelihood using the
686 R library tscount [43].
687 Selected covariates in Table 4 (marked with “X”) minimize the different prediction
688 error metrics and may vary depending on the prediction criteria used. For the P.
689 vivax malaria model, rainfall, lagged rainfall (lags 2 (MAE and MAPE) and 4),
690 temperature, ENSO, mosquito counts (RMSE, MAE), lagged mosquito counts (lag
691 4) and the previous time-step expected malaria incidence are selected as the most
692 important covariates in the sense of minimizing the predictive errors for the testing
693 set, with a significant trend shift after the year 2015. Only rainfall, past month
694 observations, and the trend shift component are significant at the 95% confidence
695 level.
696 For the P. falciparum malaria incidence model, rainfall, lagged rainfall (lag 2),
697 temperature (MAE and MAPE), lagged temperature (lag 8, RMSE), ENSO, lagged
698 ENSO (lag 16), the previous time step malaria incidence observation, past year
699 mean incidence and a linear trend shift are selected as the most important
700 covariates that minimize the model predictive errors over the testing data set. Only
701 rainfall and lagged rainfall, ENSO and lagged ENSO, and the previous month
702 observation are statistically significant according to the 95% confidence intervals
703 calculated using the normal approximation for the model parameter estimates.
704 Predictors selected by minimizing the predictive errors are not necessarily going to
705 be statistically significant according to the Wald test statistic (see Table 4).
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
36
706 However, predictability performance was preferred over statistical significance in
707 this case.
708 Fig 16 shows the model fitting and forecast results for the P. vivax malaria
709 incidence model for All Anopheles species and An. darlingi counts as predictors.
710 Model forecasts also include the 80% and 95% confidence intervals. Two model
711 versions were fitted: the first one used imputed An. darlingi as a predictor; and the
712 second one used All Anopheles species mosquito counts as a predictor.
713 Fig 16. Model fit and predictions for Plasmodium vivax malaria incidence
714 time series. A: All Anopheles species used as a covariate for mosquito counts. B:
715 Anopheles darlingi used as a covariate for mosquito counts.
716 Fig 16 shows a better forecasting performance when using All mosquito species
717 as a predictor (Fig 16A) in comparison when using An. darlingi (Fig 16B). In this
718 case most of the observations in the testing set are included within the 80%
719 confidence intervals with a clear upward trend from year 2015.
720 Fig 17 shows the model fitting and forecast results for the P. falciparum malaria
721 incidence model. Based on the lowest prediction errors, the mosquito count
722 variable was not included as a predictor (Table 4).
723 Fig 17. Model fit and predictions for Plasmodium falciparum malaria
724 incidence time series.
725 According to Fig 17, data observations in the testing set are included within the
726 95% and 80% confidence intervals of the predicted values.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
37
727 Malaria Model sensitivity to mosquito data imputation methods
728 An important question to be addressed is the sensitivity of the mosquito counts
729 data imputation methods in the malaria model configuration and prediction
730 accuracy. Although the Gradient Boosting approach resulted in the best imputation
731 method performance according to the results presented in Table 2, mosquito
732 counts imputed with all methods were tested as potential predictors, and variable
733 selection was recorded in each case. We explored how predictability changes
734 when using different mosquito imputed time series in the model selection process.
735 Model configuration (or best predictors set) can also change when different
736 imputed mosquito count time series are used in the model selection process.
737 Gradient Boosting (GB) imputed mosquito counts and lagged mosquito counts for
738 All Anopheles species were not included as potential model covariates in the P.
739 falciparum best predictive model; however, mosquito counts were selected as
740 covariates in the P. vivax malaria model by all prediction accuracy criteria.
741 In Table 5, we compare the different accuracy measures in the 20% testing data
742 set for the P. vivax and P. falciparum malaria models and the four different time
743 series of imputed mosquito counts for All Anopheles species.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
38
744 Table 5. Plasmodium vivax and Plasmodium falciparum malaria models.
745 Comparison of prediction errors for different imputation methods for
746 mosquito counts time series for All Anopheles species.
747 Plasmodium vivax Plasmodium falciparum
Imp Method RMSE MAE MAPE RMSE MAE MAPE
LR 17.537 14.207 63.244 0.855 1.168 21.941
SRL 16.813 14.543 58.895 1.023 0.711 23.94
KNN 5.295 4.268 26.392 0.879 0.697 22.707
GB 5.368 4.137 25.838 0.879 0.697 22.707
748
749 LR= Linear Regression; SLR= Stochastic Linear Regression; KNN= K-Nearest
750 Neighbor; GB= Gradient Boosting.
751 RMSE= root mean square error; MAE= mean absolute error; MAPE= mean
752 absolute percentage error.
753 The results show that for the P. vivax malaria incidence model, the lowest
754 prediction errors are obtained when imputed K-Nearest Neighbor (KNN) and
755 Gradient Boosting (GB) All Anopheles species counts are used as covariates.
756 Larger prediction errors are obtained when using imputed LR and SLR mosquito
757 time series as covariates in the model, compared with using imputed GB and KNN
758 mosquito time series.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
39
759 The best predictive P. falciparum malaria incidence models exclude the covariate
760 mosquito counts when K-Nearest Neighbor (KNN) and Gradient Boosting (GB)
761 imputed time series are used as predictors. In this case, prediction errors for the
762 malaria incidence model are the same for both time series. However, this result
763 changed when Linear Regression (LR) or Stochastic Linear Regression (SLR)
764 were used as imputation methods. All Anopheles species-imputed time series were
765 included as potential covariates in the malaria incidence model in this case.
766 Prediction errors across each predictive measure (RMSE, MAE, MAPE) do not
767 show high variability among different imputation methods.
768 Results from Table 5 indicate a greater sensitivity to the imputation methods for the
769 P. vivax malaria incidence model than for the P. falciparum model.
770 Discussion
771 The elimination of malaria requires a strong surveillance system. A key component
772 is entomological surveillance which relies on monitoring anopheline populations
773 mainly in terms of species composition, abundance, biting behavior, and
774 insecticide resistance [48]. The collection of entomological data requires a huge
775 investment in regular mosquito collections in various sentinel sites throughout
776 endemic areas. Systematic data collection approaches as the one presented by
777 Drakou et al [49] provides a good example of the implementation of mosquito
778 network surveillance with adequate spatial coverage, while in our study
779 implementing a comprehensive spatial surveillance effort was logistically
780 unfeasible. In the study by Drakou et al [49], surveillance was limited to a single
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
40
781 year. However, a multiyear analysis, such as the one we are presenting, is always
782 valuable for understanding the impact of environmental factors on mosquito
783 bionomics and to understand year-to-year variability.
784 In remote locations with difficult access, a way to contribute to providing relevant
785 entomological data to the malaria program is through well-trained local leaders.
786 This approach was attempted in an Amerindian village of the Caura River basin in
787 southern Venezuela. Nevertheless, the political, economic, and social crisis [50,51]
788 jeopardized the success of such an initiative, resulting in irregular interruptions of
789 mosquito collections with the end result of large missing data, making it difficult to
790 interpret the data and provide reliable information for decision-making for vector
791 control.
792 To overcome this limitation, the application of data imputation methods arise as a
793 potential solution. Machine learning approaches, specifically Boosting methods
794 have been proposed by other authors when there is a significant percentage of
795 missingness [4]. However, imputing missing data under the assumption that it is
796 missing at random may not be entirely accurate, as intrinsic seasonal and
797 operational factors may have influenced the sampling effort.
798 Given the large percentage of missing data in the mosquito abundance time series,
799 the application of different imputation methods provides a good alternative when
800 mosquito surveillance is highly fragmented, and continuous time series are not
801 available. However, different imputation methods might capture different data
802 features, and testing the feasibility of the imputed values still remains a challenging
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
41
803 task. Four imputation methods were compared: Linear Regression, Stochastic
804 Linear Regression, K‑Nearest Neighbor, and Gradient Boosting.
805 Considering the limited data availability and rather short time series, the LOOCV
806 approach was the best option to evaluate predictability among the different
807 methods. Gradient Boosting and the Stochastic Linear Regression methods
808 demonstrated better efficiency in terms of accuracy, reporting the lowest LOOCV
809 errors (Table 2). In contrast, the Linear Regression approach was less efficient
810 than the other methods, showing the highest LOOCV errors and a poorer
811 representation of the seasonal cycle. Comparison among different imputation
812 methods for mosquito data and any other time series data is a necessary first step
813 in the evaluation of potential biases in the new data information provided [52]. If
814 this new data is used as input to an impact model, conducting a sensitivity analysis
815 of the imputation methods on the expected response would provide additional
816 insights into the usefulness of the imputed data. In general, the seasonal cycle was
817 well represented by all imputation methods, with marked seasonal peaks between
818 August and September, two to three months after the beginning of the rainy
819 season. Similar results were recorded in other malaria endemic areas of
820 Venezuela [13,39], while in other regions the seasonal peaks are related to the
821 transition period between rainy and dry season [38,53,54]. The large spike
822 observed in 2010 (Fig 5), has been associated to a weak La Niña condition. A
823 similar spike to the one observed in 2010 was estimated for 2016 when applying
824 the GB imputation method to the An. darlingi time series, with a maximum
825 comparable to the one recorded in 2010. However, mosquito measurements were
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
42
826 only available through July 2016, so direct validation for this period is not possible.
827 Furthermore, we cannot conclude that there is a direct association of this spike
828 with a weak La Niña event, since the next event was reported during the Fall 2017
829 up to Spring 2018 [55]. It is important to point out that although there are abundant
830 references worldwide relating malaria and other diseases to ENSO events, the
831 available data on mosquito abundance and ENSO is scant. El Niño is associated
832 with an increase in rainfall in the Colombia Pacific coast and East Africa, and La
833 Niña with drought. A three-year study conducted in the Colombian Pacific coast
834 reported no relation between ENSO and the abundance of An. albimanus and An.
835 darlingi [56] while a one-year study in Tanzania showed a decline in the
836 abundance of An. arabiensis and An. funestus associated with the drought due to
837 La Niña 2016-2017 [57].
838 The relationship between climate and malaria is powerful, complex, and highly
839 localized. Several studies clearly show that factors like temperature and rainfall,
840 and the global climate patterns like El Niño that influence them, are critical drivers
841 of malaria epidemics [17,58,59]). Previous studies in Venezuela have shown that
842 El Niño is associated with drought, resulting in an increase in malaria cases the
843 following year [21,22,23]). In the present study the significant lags for rainfall and
844 ENSO were one month earlier for the P. vivax model than for the P. falciparum
845 model (Table 3), which is reasonable considering that the sporogonic cycle of P.
846 falciparum is larger than that of P. vivax [60,61]. Although variations in temperature
847 have a great impact on the developmental time of the mosquitoes and the
848 parasites [60,61,62], the lag of 8 months was the most significant for both models.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
43
849 When fitting the P. vivax and P. falciparum malaria incidence models, the main
850 issue is the non-significance or exclusion of mosquito abundance as covariates for
851 the P. falciparum malaria incidence model. These findings suggest that it is
852 insufficient to treat mosquito abundance and diversity data collected from a single
853 locality as representative for the entire municipality, because these ecological
854 characteristics can differ significantly between localities within the same region
855 [11,38,39,53].
856 However, the proposed malaria model is still useful for predictive purposes, since
857 the model accounts for potential seasonal effects and significant changes in the
858 incidence trend, lagged and non-lagged environmental factors, and the impact of
859 the previous month's malaria incidence. The better model performance observed
860 when using All Anopheles species as a mosquito count predictor, compared with
861 using An. darlingi (Fig 16) might be due to the unexpected spikes obtained in the
862 imputed mosquito count time series of An. darlingi for the Boosting method (GB)
863 (Fig 10D).
864 The sensitivity analysis using different imputation methods is a key component in
865 the results presented in this work. These results are demonstrated through the
866 process of evaluating both model accuracy and model structure, as these aspects
867 are sensitive to the imputation method employed. Therefore, it is important to
868 explore all available methods to ensure robust and reliable conclusions. When
869 assessing the sensitivity of the different imputation methods on the malaria model
870 predictive performance, the MAPE results shown in Table 5 confirms a reasonably
871 good predictive performance of the P. vivax model when mosquito counts time
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
44
872 series for All Anopheles species imputed with KNN and GB methods are used as
873 predictors (20% <MAPE<30%), while values of MAPE are greater than 50% when
874 mosquito count time series imputed with LR and SLR methods are used instead.
875 Thresholds to evaluate model predictive performance according to the MAPE
876 criteria have been discussed by Ma and Liu [63].These results might be an
877 indication that these two imputation methods (LR and SLR) do not accurately
878 represent the mosquito abundance variability for All Anopheles species, and as a
879 consequence might fail to reproduce the association between mosquito abundance
880 and malaria incidence. In the case of the P. falciparum model predictive
881 performance according to MAPE is more homogeneous across the different
882 imputation methods (Table 5). But in this case mosquito abundance does not yield
883 a better model accuracy when the KNN and GB imputed mosquito counts are
884 considered as predictors. Differences in model accuracy between P. vivax and P.
885 falciparum are driven by the selection of the sets of predictors that minimize
886 prediction errors. Mosquito counts imputed with the regression models for All
887 Anopheles species preserve the seasonality in the mosquito abundance cycle but
888 fail to represent important year-to-year variability fluctuations exhibiting a higher
889 degree of smoothing. However, when used as predictors they can explain a good
890 percentage of the malaria time series variability in the P. falciparum model. This is
891 not the case for the GB and KNN imputed mosquito time series, for which the more
892 complex fluctuations do not contribute to explain the observed variance in the
893 malaria time series model. We can speculate that data limitations may be an
894 important factor underlying these results.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
45
895 Despite data limitations, this work provides a template for integrating multiyear
896 entomological observations, climate information, and robust imputation techniques
897 to improve malaria prediction systems in remote settings. The approaches
898 developed here can support more timely and geographically informed
899 decision‑making, enabling public health authorities to anticipate high‑risk periods,
900 optimize vector control interventions, and better allocate scarce resources.
901 Continued investment in local capacity for mosquito monitoring—combined with
902 modern data‑analytic tools—can strengthen the foundations of malaria surveillance
903 and contribute to more effective control strategies in the Amazon and similar
904 hard‑to‑reach regions.
905 Finally, a key entomological parameter for the implementation of a particular
906 intervention for control is to determine the vector biting hourly pattern. This pattern
907 is species specific; nevertheless, An. darlingi exhibits different patterns along its
908 range of distribution, even among locations within a few kilometers apart
909 [11,64,65,66]. During the present study, the mosquito abundance at different hour
910 intervals was not significant, contrasting with previous studies in Sucre municipality
911 along the Caura River where An. darlingi showed a biting peak around midnight in
912 Las Majadas [64] whereas in Jabillal, there were two distinctive peaks at sunset
913 and sunrise [11]). The lack of significance in the hour intervals might be due to the
914 low abundance of mosquitoes collected throughout the study. Similar results were
915 reported for An. calderoni in Colombia [67].
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
46
916 Conclusion
917 This study demonstrates that machine learning–based imputation methods offer a
918 powerful tool for reconstructing highly incomplete mosquito abundance time series
919 in remote, resource‑limited malaria‑endemic regions. Incorporating the imputed
920 mosquito series and climate predictors into generalized time‑series models of
921 malaria incidence revealed important differences between P. vivax and P.
922 falciparum. Models for P. vivax responded strongly to mosquito abundance,
923 rainfall, temperature, ENSO anomalies, and recent malaria history, with Gradient
924 Boosting and KNN imputations yielding the most accurate forecasts. In contrast, P.
925 falciparum incidence was largely insensitive to mosquito abundance, likely
926 reflecting mismatches between the geographic scale of entomological and
927 epidemiological datasets, lower case counts, or intrinsic differences in
928 transmission. Climate drivers—particularly rainfall and ENSO—retained predictive
929 power for both parasites, underscoring their central role in shaping transmission
930 potential.
931 Acknowledgment
932 We are most grateful to Simón Caura, the local leader, Hernán Guzmán, Yarys
933 Estrada, Victor Sánchez, Jorge Moreno and Jesús González, who participated in
934 field activities. To the people of Boca de Nichare, for their support and friendship.
935 To Ana María Ibañez and Horacio Vargas for logistic support and friendship.
936 Mariapia Bevilacqua, Lya Cárdenas, and Zenaida Muria (ACOANA) and Dirección
937 de Salud Ambiental (MPPSalud) for logistic support.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
47
938 References
939 1. World Health Organization (WHO). World Malaria Report 2025: addressing the
940 threat of antimalaria drug resistance. Geneva: World Health Organization; 2025.
941 2. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A
942 survey on missing data in machine learning. J Big Data. 2021; 8: 140.
943 3. Zhang X, Yan C, Gao C, Malin BA, Chen Y. Predicting missing values in medical
944 data via XGBoost regression. J Health Inform Res. 2020; 4: 383–394.
945 4. Cauthen KR, Lambert G, Ray J, Lefantzi S. Imputing data that are missing at
946 high rates using a boosting algorithm. Sandia National Laboratories. SAND-2016.
947 Available from: https://www.sandia.gov/app/uploads/sites/203/2022/06/sand2016-
948 9430J.pdf.
949 5. Sattari MT, Roshan E, Bakhshandeh M, Pourghasemi HR. Potential of kernel
950 and tree-based machine-learning models for estimating missing data of rainfall.
951 Eng Appl Comput Fluid Mech. 2020; 14(1): 1078-1094.
952 6. Yoon J, Zame WR, van der Schaar M. Multi-directional recurrent neural
953 networks: A novel method for estimating missing data. In: Time series workshop in
954 International Conference on Machine Learning; 2017. p. 1-9.
955 7. Fang C, Wang C. Time series data imputation: A survey on deep learning
956 approaches. 2020; arXiv preprint arXiv: 2011.11347.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
48
957 8. Rubio-Palis Y, Zimmerman RH. Ecoregional classification of malaria vectors in
958 the Neotropics. J Med Entomol. 1997; 34: 499-510.
959 9. Sinka ME, Rubio-Palis Y, Manguin S, Patil AP, Temperley WH, Gething PW, et
960 al. The dominant Anopheles vectors of human malaria in the Americas: occurrence
961 data, distribution maps and bionomic precis. Parasites Vectors. 2010; 3: 1-72. doi:
962 10.1186/1756-3305-3-72.
963 10. Sinka ME, Bangs MJ, Manguin S, Rubio-Palis Y, Chareonviriyaphap T,
964 Coetzee M, et al. A global map of dominant malaria vectors. Parasites Vectors.
965 2012; 5: 69. doi: 10.1186/1756-3305-5-69.
966 11. Rubio-Palis Y, Bevilacqua M, Medina DA, Moreno JE, Cárdenas L, Sánchez V,
967 et al. Malaria entomological risk factors in relation to land cover in the Lower Caura
968 River Basin, Venezuela. Mem Inst Oswaldo Cruz. 2013; 108: 220-228. doi:
969 10.1590/0074-0276108022013015.
970 12. Arruda M, Carvalho MB, Nussenzweig RS, Maracic M, Ferreira AW, Cochrane
971 AH. Potential Vectors of Malaria and Their Different Susceptibility to Plasmodium
972 falciparum and Plasmodium vivax in Northern Brazil Identified by Immunoassay.
973 Am J Trop Med Hyg. 1986; 35: 873–881. doi: 10.4269/ajtmh.1986.35.873.
974 13. Rubio-Palis Y, Wirtz RA, Curtis CF. Malaria entomological inoculation rates in
975 western Venezuela. Acta Trop. 1992; 52: 167-174. doi: 10.1016/0001-
976 706x(92)90033-t.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
49
977 14. Galardo AK, Arruda M, D’Almeida Couto AA, Wirtz R, Lounibos LP, et al.
978 Malaria vector incrimination in three rural riverine villages in the Brazilian Amazon.
979 Am J Trop Med Hyg. 2007; 76(3): 461–469.
980 15. Macdonald G: The epidemiology and control of malaria London: Oxford
981 University Press; 1957.
982 16. Garret-Jones C, Shidrawi G. Malaria vectorial capacity of a population of
983 Anopheles gambiae - An exercise in Epidemiological Entomology. Bull Wld Hlth
984 Org. 1969; 40: 531-545.
985 17. Magersa DM, Lou X-S. Effects of Climate Change on Malaria Risk to Human
986 Health: A Review. Atmosphere. 2025; 16: 71. doi: 10.3390/atmos16010071.
987 18. Stern P, Easterling WE. Making climate forecast matter. Washington DC:
988 National Academy Press; 1999.
989 19. Kavats SR. El Niño and health. Bull Wld Hlth Org. 2000; 78:1127-1134.
990 20. Poveda G, Rojas W, Quiñones ML, Vélez ID, Mantilla RI, Ruíz D, et al.
991 Coupling between annual and ENSO timescales in the malaria climate association
992 in Colombia. Environ Hlth Perspect. 2001; 109: 489-493.
993 21. Gagnon AS, Smoyer-Tomic KE, Bush ABG. The El Niño Southern Oscillation
994 and malaria epidemics in South America. Int J Biometeorol. 2002; 46: 81–89. doi:
995 10.1007/s00484-001-0119-6.
996 22. Bouma MJ, Dye C. Cycles of malaria associated with El Niño in Venezuela.
997 JAMA. 1997; 278(21): 1772-1774.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
50
998 23. Sáez-Sáez V, Rubio-Palis Y, Pino JC. Variabilidad climática y malaria estudio
999 regional: municipio Sifontes, estado Bolívar, Venezuela. Terra Nueva Etapa. 2009;
1000 25: 93-112.
1001 24. Grillet ME, El Souki M, Laguna F, León JR. The periodicity of Plasmodium
1002 vivax and Plasmodium falciparum in Venezuela. Acta Trop. 2014; 129: 52-60.
1003 25. Laguna F, Grillet ME, León JR, Ludeña C. Modelling malaria incidence by an
1004 autoregressive distributed lag model with spatial component. Spat Spatiotemporal
1005 Epidemiol. 2017; 22: 27-37. doi: 10.1016/j.sste.2017.05.001.
1006 26. Lynn MK, Bossak BH. Analysis of a Secular Trend in Malaria Incidence:
1007 Venezuela, 1959-2015. J Epidemiol Glob Health. 2017; 1:54-59.
1008 27. Kifle MM, Teklemariam TT, Teweldeberhan AM, Tesfamariam EH,
1009 Andegiorgish AK, Kidane EA. Malaria Risk Stratification and Modeling the Effect of
1010 Rainfall on Malaria Incidence in Eritrea. J Environ Public Health. 2019; Article ID
1011 7314129. doi: 10.1155/2019/7314129.
1012 28. Kumar P, Vatsa R, Parth Sarthi P, Kumar M, Gangare V. Modeling an
1013 association between malaria cases and climate variables for Keonjhar district of
1014 Odisha, India: a Bayesian approach. J Parasit Des. 2020; 44(2): 319–331.
1015 29. Laneri K, Cabella B, Prado PI, Mendes Coutinho R, Kraenkel RA. Climate
1016 drivers of malaria at its southern fringe in the Americas. PLoS ONE. 2019; 14(7):
1017 e0219249. doi: 10.1371/journal.pone.0219249.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
51
1018 30. Bevilacqua M, Medina DA, Cárdenas L, Rubio-Palis Y, Moreno J, Martínez A.
1019 Orientaciones para Fortalecer el Programa de Malaria en Zonas Remotas con
1020 Población Indígena en el Caura, Venezuela. Bol Mal Salud Amb. 2009; 49(1): 53-
1021 71. Available from: http://www.iaes.edu.ve/index.php/centro-de-
1022 descargas/viewcategory/16-vol49-no1-ano-2009.
1023 31. Rubio-Palis Y, Moreno JE, Bevilacqua M, Medina D, Martínez A, Cárdenas L,
1024 et al. Caracterización ecológica de los anofelinos y otros culícidos en territorio
1025 indígena del Bajo Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2010;
1026 50(1): 95-107. Available from: http://www.iaes.edu.ve/index.php/centro-de-
1027 descargas/viewcategory/12-vol50-no1-ano-2010.
1028 32. Medina D, Bevilacqua M, Cárdenas L, Morales LG, Rubio-Palis Y, Martínez A,
1029 et al. Mapa de riesgo de transmisión de malaria en la cuenca del río Caura,
1030 Venezuela. Bol Mal Salud Amb. 2011; 51(2): 23-38. Available from:
1031 http://www.iaes.edu.ve/index.php/centro-de-descargas/viewcategory/105-vol51-
1032 no2-ano-2011.
1033 33. Bevilacqua M, Rubio-Palis Y, Medina DA, Cárdenas L. Malaria control in
1034 Amerindian communities of Venezuela. Ecohealth. 2015; 12: 253-266. doi:
1035 10.1007/s10393-015-1026-3.
1036 34. Moreno JE, Rubio-Palis Y, Bevilacqua M, Sánchez V, Guzmán H.
1037 Caracterización de hábitats larvales de anofelinos en la cuenca baja del Río
1038 Caura, estado Bolívar, Venezuela. Bol Mal Salud Amb. 2018; 58: 32-45. Available
1039 from: http://iaes.edu.ve/iaespro/ojs/index.php/bmsa/article/view/57/29.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
52
1040 35. Rubio-Palis Y, Moreno JE, Sánchez V, Estrada Y, Anaya W, Bevilacqua M, et
1041 al. Can Mosquito Magnet® substitute for human-landing catches to sample
1042 anopheline populations? Mem Inst Oswaldo Cruz. 2012;107: 546-549. doi:
1043 10.1590/S0074-02762012000400017.
1044 36. Pimenta PFP, Orfano AS, Bahia AC, Duarte APM, Ríos-Velásquez CM, Melo
1045 FF, et al. An overview of malaria transmission from the perspective of Amazon
1046 Anopheles vectors. Mem Inst Oswaldo Cruz. 2015; 110(1): 23-47. dor:
1047 10.1590/0074-02760140266.
1048 37. Magris M, Rubio-Palis Y, Menare C, Villegas L. Vector bionomics and malaria
1049 transmission in the Upper Orinoco River, southern Venezuela. Mem Inst Oswaldo
1050 Cruz. 2007; 102: 303-311. doi: 10.1590/s0074-02762007005000049.
1051 38. Galardo AKR, Zimmerman RH, Lounibos LP, Young LJ, Galardo CD, Arruda
1052 M, et al. Seasonal abundance of anopheline mosquitoes and their association with
1053 rainfall and malaria along the Matapí River, Amapá, Brazil. Med Vet Entomol.
1054 2009; 23(4): 335-349. doi: 10.1111/j.1365-2915.2009.00839.x.
1055 39. Rubio-Palis Y, Curtis CF. Biting and resting behavior of anophelines in western
1056 Venezuela and implications for control of malaria transmission. Med Vet Entomol.
1057 1992; 6: 325-334.
1058 40. Sterne AC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, et al.
1059 Multiple imputation for missing data in epidemiological and clinical research:
1060 potential and pitfalls. BMJ. 2009; 338. doi: 10.1136/bmj.b2393.
1061 41. Zhang ML, Zhou ZH. ML-KNN: A lazy learning approach to multi-label learning,
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
53
1062 Pattern Recog. 2007; 40(7): 2038-2048. doi: 10.1016/j.patcog.2006.12.019.
1063 42. Bentéjac C, Csörgő A, Martínez-Muñoz G. A comparative analysis of gradient
1064 boosting algorithms. Artif Intell Rev. 2021; 54:1937–1967. doi: 0.1007/s10462-020-
1065 09896-5.
1066 43. Liboschik T, Fokianos K, Fried R. tscount: An R package for analysis of count
1067 time series following generalized linear models. J Stat Softw. 2017; 82(5): 1–51.
1068 doi: 10.18637/jss.v082.i05.
1069 44. Cox FE. History of the discovery of malaria parasites and their vectors. Parasit
1070 Vectors. 2010; 3:1-9.
1071 45. Smith M, Mackli MG, Thomas CJ. Hydrological and geomorphological controls
1072 of malaria transmission. Earth Sci Rev. 2013; 116: 109-127.
1073 46. Ikeda T, Behera SK, Morioka Y, Minakawa H, Hashizume M, Tsuzuki A, et al.
1074 Seasonally lagged effects climatic factor son malaria incidence in South Africa. Sci
1075 Rep. 2017; 7: 2458. doi: 10.1038/s41598-017-02680-6.
1076 47. Okeneye K, Gumel AB. Analysis of a temperature and rainfall dependant model
1077 for malaria transmission dynamics. Math Biosci. 2017; 287: 72-92.
1078 48. World Health Organization. Guidelines for malaria vector control. Geneva:
1079 World Health Organization; 2019.
1080 49. Drakou K, Nikolaou T, Vasquez M, Petric D, Michaelakis A, Kapranas A, et al.
1081 The Effect of Weather Variables on Mosquito Activity: A Snapshot of the Main
1082 Point of Entry of Cyprus. Int J Environ Res Public Health. 2020; 17: 1403. doi:
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
54
1083 10.3390/ijerph17041403.
1084 50. Grillet ME, Hernández-Villena JV, Llewellyn MS, et al. Venezuela’s
1085 humanitarian crisis, resurgence of vector-borne diseases, and implications for
1086 spillover in the region. Lancet Infect Dis. 2019; 19: e149-161.
1087 51. Briceño-León R, Guevara M, Terán I. Politics as Disease in Venezuela. Vector
1088 control before and after the Bolivarian revolution. In: Locating Zika. Social Change
1089 and Governance in an Age of Mosquito Pandemics. Kevin Bardosh Editor.
1090 Routledge, London. 238 p. doi: 10.4324/9780429456558.
1091 52. Batista GEAPA, Monard MC. An analysis of four missing data treatment
1092 methods for supervised learning. Appl Artif Intell. 2003; 17:5-6, 519-533. doi:
1093 10.1080/713827181.
1094 53. Moreno JE, Rubio-Palis Y, Páez E, Pérez E, Sánchez V. Abundance, biting behavior
1095 and parous rate of anopheline mosquito species in relation to malaria incidence in gold-
1096 mining areas of southern Venezuela. Med Vet Entom. 2007; 21: 339-349. doi:
1097 10.1111/j.1365-2915.2007.00904.x.
1098 54. Dos Santos F, Xu M, Bravo de Guenni L, Lourenço-de-Oliveira R, Rubio-Palis
1099 Y. Characterization of larval habitats of Anopheles (Nyssorhynchus) darlingi and
1100 associated species in malaria areas in western Brazilian Amazon. Mem Inst
1101 Oswaldo Cruz. 2024; 119: e240116.
1102 55. NOAA. Physical Sciences Laboratory. Monthly Climate/Ocean Indices (Time-
1103 Series) at PSL. Available from: https://psl.noaa.gov/data/timeseries/month.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
55
1104 56. Poveda G, Quiñones ML, Vélez ID, Rojas W, Rúa-Uribe G, Ruíz CD et al.
1105 Desarrollo de un sistema de alerta temprana para la malaria en Colombia.
1106 Universidad Internacional de Andalucía, Sevilla, España. 2008; 181 p.
1107 57. Kreppel K, Caminade C, Govella N, Morse AP, Ferguson HM, Baylis M. Impact
1108 of ENSO 2016-2017 on regional climate and malaria vector dynamics in Tanzania.
1109 Environ Res Lett. 2019; 14: 075009.
1110 58. Mandal S, Sarkar RP, Sinha S. Mathematical models of malaria – a review. Mal
1111 J. 2011; 10: 202. Available from: https://www.malariajournal.com/content/10/1/202.
1112 59. Yiga V, Nampala H, Tumwiine J. Analysis of the modelo n the effect of
1113 seasonal factor son malaria transmission dynamics. J Appl Math. 2020; doi:
1114 10.1155/2020/8885558.
1115 60. World Health Organization. Manual on Practical Entomology in Malaria. Part I
1116 and II. Geneva: World Health Organization; 1975.
1117 61. Martens WJM, Niessen LW, Rotmans J, Jetten TH, McMichael AJ. Potential
1118 impact of global climate change on malaria risk. Environ Health Perspect. 1995;
1119 103: 458-464.
1120 62. Yadav CP, Hussain SSA, Mullick R, Rahi M, Sharma A. Climate zones are a
1121 key component of the heterogeneous presentation of malaria and should be added
1122 as a malariometric for the planning of malaria elimination. PLOS Glob Public
1123 Health. 2023; 3(6): e0001878. doi:10.1371/journal.pgph.0001878.
1124 63. Ma X, Liu Z. Application of a novel time-delayed polynomial grey model to
1125 predict the natural gas consumption in China. J Comput Appl Math. 2017; 324: 17-
1126 24. doi: 10.1016/j.cam.2017.04.020.
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
56
1127 64. Rubio-Palis Y. Observaciones sobre el patrón de actividad hematofágica del
1128 vector de malaria Anopheles darlingi en poblaciones del sur de Venezuela. Bol Dir
1129 Malariol Sal Amb. 1995; 35(2): 66-70.
1130 65. Rosa-Freitas MG, Broomfield G, Priestman A, Milligan PJM, Momen H,
1131 Molyneux DH. Cuticular hydrocarbons, isoenzymes and behavior of three
1132 populations of Anopheles darlingi from Brazil. J Amer Mosq Control Assoc. 1992;
1133 8:357-366.
1134 66. Moreno M, Saavedra MP, Bickersmith SA, Lainhart W, Tong C, Alava F, et al.
1135 Implications for changes in Anopheles darlingi biting behaviour in three
1136 communities in the peri-Iquitos region in Amazonian Perú. Malar J. 2015; 14:290.
1137 doi: 10.1186/s12936-015-0804-2.
1138 67. Orjuela LI, Ahumada ML, Avila I, Herrera S, Beier JC, Quiñones ML. Human
1139 biting activity, spatial-temporal distribution and malaria vector role of Anopheles
1140 calderoni in the southwest of Colombia. Malar J. 2015; 14:256. doi:
1141 10.1186/s12936-015-0764-6.
1142
1143 Supporting information
1144 S1 Fig. El Niño 3.4 index for the period 2009 – 2016.
1145 S2 Fig. Root mean square prediction error (RMSE) after 1,000 iterations of
1146 the random sampling approach [leave-one-out cross-validation (LOOCV)] for
1147 each Anopheles species, showcasing the impact of lagging/no-lagging of
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
57
1148 climatic variables using an approach with linear regression imputation. All=
1149 all Anopheles species collected; darl.1= Anopheles darlingi; goel.6= Anopheles
1150 goeldii; osw.5= Anopheles oswaldoi; other= other Anopheles species collected.
1151 S3 Fig. Root mean square prediction error (RMSE) for each species using a
1152 leave-one-out cross-validation (LOOCV) approach with stochastic linear
1153 regression imputation. All= all Anopheles species collected; darl.1= Anopheles
1154 darlingi; goel.6= Anopheles goeldii; osw.5= Anopheles oswaldoi; other= other
1155 Anopheles species collected.
1156 S4 Fig. Estimated total number of All Anopheles species abundance (ETNM)
1157 time series Gradient Boosting (GB) imputation method.
1158
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
.CC-BY 4.0 International licenseavailable under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprintthis version posted April 17, 2026. ; https://doi.org/10.64898/2026.04.15.718796doi: bioRxiv preprint
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.