### Summary

We analysed trends in the frequencies of target-site insecticide resistance mutations across space and time in three African malaria vector species: *An. gambiae*, *An. coluzzii*, and *An. arabiensis*. We use spatiotemporal modelling approaches that apply both Bayesian statistics and machine learning methods in order to predict mutation frequencies jointly across the three species over a spatial grid of approximately 5 km resolution. Our model predictions are based on surveillance data that records observed frequencies in mosquitoes sampled widely throughout west and northeast Africa. The machine learning methods predict the proportions of each allele in each mosquito sample, and they are informed by 99 potential predictor variables that represent environmental and biological processes which may influence selection for resistance. A Bayesian multinomial metamodel then combines predictions across the multiple machine learning models in order to make more accurate and robust predictions (a methodology known as stacked generalisation [46]). Using the metamodel, we predict the frequencies of each mutation in all grid cells within our nine selected countries for all years in the period 2005–2017.

### Vgsc allele frequency data

Our models are informed by a database containing frequencies of *Vgsc* mutations in mosquito samples belonging to the *Anopheles gambiae* species complex collected from within western and eastern Africa over the period 2005–2017 (Additional File 1: Figures S2-S6). This database is an updated version of a publicly available data set containing *Vgsc* allele frequencies [47] and collates data sets from multiple contributors, including published and unpublished sources. The database records the number of mosquitoes tested in each sample, together with the frequencies of the *Vgsc*-995 L, *Vgsc*-995F, and *Vgsc*-995S alleles in the sample. Some, but not all, data sets in the database record the *Vgsc* genotype of the sampled mosquitoes. The database also records information about the mosquito species tested, the molecular screening methods used for species identification and *Vgsc* allele identification, and the geographic coordinates of the sample collection location. We only included samples that are representative of the *An. gambiae* population sampled at each place and time (i.e. randomly sampled from the population). We also only included samples that contained five or more mosquitoes. The final data set included 2418 samples distributed across 27 countries.

We developed predictive maps of *Vgsc* allele frequencies for a focal selection of countries which had the highest number of samples, excluding those countries for which the spatial distribution of samples was strongly clustered (Additional File 1: Figures S2 and S3). In selecting countries for inclusion in our mapping analysis, we subdivided the African continent into western and eastern regions, with Cameroon and countries further west of Cameroon falling within our western region and countries that lie east of the Central African Republic falling within our eastern region. Within the western region, we selected the five countries with the greatest number of samples (Additional File 1: Figure S4), excluding Senegal because of a tight clustering of sampling locations around the border with The Gambia (Additional File 1: Figure S2). In the eastern region, we selected all countries that had samples that were included in our modelling analysis (Additional File 1: Figure S5), excluding Tanzania due to a strong spatial clustering of the sampling observations (Additional File 1: Figure S3). Sudan is the most data-rich country included in our study (Additional File 1: Figure S5), but it covers a large spatial area and the sampling locations are all located in a region in the eastern part (Additional File 1: Figure S3). Therefore, we developed predictive maps only for a region in the east of Sudan that does not extend further west than a longitude of 29.5° E or further north of 17° N. In the case of Ethiopia, we excluded the region east of a longitude of 44° E because we have no samples located in this region.

We included one central African country, the Democratic Republic of Congo (DRC), in our mapping analysis. Although the *Vgsc* allele frequency data is sparse throughout the country (Additional File 1: Figures S2, S3 and S5), we included the DRC because it covers a region that is rarely studied. In the case of the DRC, our modelling analysis is restricted to predicting the frequency of the *Vsgc*-995F mutation only, and we do not predict *Vsgc*-995S frequencies (see below). We excluded the data on *Vsgc*-995S frequencies from the DRC analysis because most studies from the DRC only perform an assay capable of detecting L995F, which can lead to erroneous genotypes when both resistant alleles are present, which appears typical in DRC (Loonen 2020 [48], Lynd et al. 2018).

### Potential predictor variables

Our set of predictors is similar to that described in Hancock et al. [5, 9] and includes 99 variables describing environmental characteristics that could potentially be related to the development and spread of insecticide resistance in populations of *Anopheles gambiae* complex mosquito species. These variables describe the coverage of insecticide-based vector control interventions, agricultural land use [49, 50], and the environmental fate of agricultural insecticides [51], other types of land use [49, 52,53,54], climate [49, 55, 56], and relative species abundance. A detailed description of this set of predictor variables is provided in Additional File 1: Table S3. Our vector control intervention data includes a variable estimating ITN coverage in terms of the proportion of people who slept under a net the preceding night, at each ~ 5 km pixel location for each year [57, 58]. Relative species abundance is represented by a variable estimating the abundance of *An*. *arabiensis* relative to the abundance of *An*. *gambiae* and *An*. *coluzzii* [21]. For all variables, we obtained spatially explicit data on a grid with a 2.5 arc-minute resolution (which is approximately 5 km at the equator) covering sub-Saharan Africa. For variables for which temporal data were available at an annual resolution, we included time-lagged representations with lags of 0, 1, 2, and 3 years.

### Stacked generalization ensemble modelling approach

We used stacked generalization to develop a model ensemble that combines the predictions generated by multiple machine learning models [46, 59]. Stacked generalization uses a meta-model, or “generalizer”, that learns a weighted combination of the predictions across each model in the ensemble, where the predictions of each model are the out-of-sample predictions derived from *K-*fold cross validation. The predictions produced by the generalizer correct for the biases of each model and are expected to have improved prediction accuracy relative to any of the individual models included in the ensemble [46, 59, 60].

#### Machine learning models

Our model ensemble included three different machine learning models that predicted the frequencies of the *Vgsc*-995 L, *Vgsc*-995F, and *Vgsc*-995S at each pixel within our mapped countries for each year within the period 2005–2017. The three machine learning models were an extreme gradient boosting (XGB) model, a random forest (RF) model, and a neural network (NN) model. These models were chosen due to their demonstrated high predictive performance [60, 61], which derives from their ability to represent non-linear relationships and high-level interactions across the model features [5, 60]. The XGB model was implemented using the R package xgboost [62], and the RF and NN models were implemented using the sklearn [63] and keras packages [64] in Python. The label for these models was a categorical variable corresponding to whether the *Vgsc*-995 L, *Vgsc*-995F, or *Vsgc*-995S mutation was detected across all the alleles screened in each sample. All *Vgsc* allele frequency observations from the 27 countries in our data set were used to inform the model (see above). The models predict the expected frequencies of each allele at each mapped pixel. The features used in the models included the 99 environmental predictor variables together with the 1-, 2-, and 3-year lags for those variables that vary on a yearly time step. A factor variable representing the mosquito species (*An. gambiae, An. coluzzii*, *An. arabiensis*, or *An. gambiae s.l*) was also included as a feature, where the *An. gambiae s.l.* category describes individuals within samples for which species within the *Anopheles gambiae* complex were not identified. Finally, the year in which the bioassay and allele frequency samples were collected was also included as a feature. For each machine learning model, parameter tuning was performed using out-of-sample validation by subdividing the data into training, validation, and test subsets (see Additional File 1).

We developed an additional model ensemble that predicted only the frequency of *Vsgc*-995F, which we used to develop predictive maps of the *Vsgc*-995F frequency for the DRC. This model ensemble included the three machine learning models as described above, and the label was a categorical variable corresponding to whether the *Vsgc*-995F mutation was detected across all the alleles screened in the sample. The label included the full data set containing the *Vsgc*-995F frequencies in the 2418 samples. The features used were the same as those used in the above models, and parameter tuning was performed as described above.

#### Model stacking and multinomial logistic regression

We use a Bayesian multinomial logit regression model as our meta-model to combine the out-of-sample predictions obtained from performing *K-*fold cross-validation on each of the three machine learning models in the model ensemble [65,66,67] (see www.r-inla.org). The multinomial logit model represents observations where the sampling unit corresponds to one of a set of mutually exclusive alternatives j ∈ {1, …, J}; in our case *J* = 3, with the alternatives being the *Vgsc*-995 L, *Vgsc*-995F, or *Vgsc*-995S marker (we do not account for diploid genotypes in our model). Our observations *y*_{ij} are the numbers of *Vgsc*-995 L alleles (*j* = 1), *Vsgc*-995F alleles (*j* = 2), and *Vsgc*-995S alleles (*j* = 3) in sample *i,* with *i* = 1,….,*N* samples in total*.* Our model has three covariates which are the out-of-sample predictions of the frequencies of each allele in each sample given by the three machine learning models, transformed using the empirical logit transform to avoid discontinuities at 0 and 1. We store these covariates in the matrices **X**^{1}, **X**^{2}, and **X**^{3}, which have dimension *N* × *J*, with each matrix containing the predictions of frequencies of the three alleles for one of the three machine learning models. Our multinomial logit model uses the following linear predictor:

$${V}_{ij}={\beta}_j^1{X}_{ij}^1+{\beta}_j^2{X}_{ij}^2+{\beta}_j^3{X}_{ij}^3$$

where there are three sets of three coefficients \({\beta}_j^1\), \({\beta}_j^2\)and \({\beta}_j^3\) (*j =* 1,2,3); we combine these into the vector Β. For each observation *i*, the expected probabilities of each alternative are:

$${p}_{ij}=\frac{g_{ij}\left(\mathrm{B}\right)}{G_i\left(\mathrm{B}\right)}$$

where *g*_{ij}(Β) = *exp* (*V*_{ij}) and \({G}_i=\sum_{j=1}^J{g}_{ij}\left(\mathrm{B}\right)\) (see Croissant [66] and www.r-inla.org). We use the multinomial-Poisson transformation [67], which gives the following expression for the Poisson likelihood [67]:

$$L\left({y}_{ij}|\mathrm{B},\phi \right)=\prod_i\prod_{j=1}^J{\left({g}_{ij}\left(\mathrm{B}\right)\mathit{\exp}\left({\phi}_i\right)\right)}^{y_{ij}}\mathit{\exp}\left(-{g}_{ij}\left(\mathrm{B}\right)\mathit{\exp}\left({\phi}_i\right)\right)$$

where *ϕ*_{i} are *N* additional parameters that need to be estimated in order to use the multinomial-Poisson transformation. Posterior distributions of the parameters Β and *ϕ*_{i} are obtained by fitting the model using the R-INLA package [65] (see www.r-inla.org), with the coefficients *Β* as fixed effects and the intercepts *ϕ*_{i} as an independent (iid) random effect. Our implementation constrains each of the nine coefficients to be positive \(\left({\beta}_j^q\ge 0,\forall j,q,q=1,2,3\right)\) [60]. Once the parameter estimation has been performed, the final set of predictions given by the model ensemble are obtained by replacing the elements of **X**^{1}, **X**^{2}, and **X**^{3} with the in-sample predictions of the machine learning models obtained by fitting each of these models to all the data (all the labels and the corresponding sets of features). For our second model ensemble for predicting only *Vsgc-*995F frequencies, the formulation of the meta-model is the same as described above, with *J* = 2.

### Posterior validation

To assess the ability of our model to accurately represent the data, we performed posterior validation of our model ensemble using 10-fold out-of-sample cross-validation. Specifically, the data were divided into 10 subsets (or “test” sets, using random sampling without replacement), and 10 successive model fits were performed, each withholding a different test set. The test sets were withheld from each of the three machine learning models included in the ensemble, as well as from the multinomial logit metamodel. The root mean squared error (RMSE) across all (withheld) *Vgsc* allele frequency observations confirmed that the model ensemble delivered higher prediction accuracy than each of the three machine learning model constituents (Additional File 1: Table S1).

### Insecticide resistance bioassay data

To analyse relationships between our predicted resistance allele frequencies and resistance phenotypes observed in field vector populations, we utilised a database of insecticide resistance bioassay data [5] including samples tested over the period 2005–2017. All species included in the samples are from the *Anopheles gambiae* complex and the composition of sibling species is unknown for the majority of samples. The data record the number of mosquitoes in the sample and the proportional sample mortality resulting from the bioassay, as well as variables describing the mosquitoes tested, the sample collection site, and the bioassay conditions and protocol. We selected the bioassay results for standard diagnostic dose WHO susceptibility tests performed using deltamethrin for all samples collected within the five countries included in our analysis (see the “Results” section), resulting in 159 results for Burkina Faso, 297 results for Benin, 184 results for Cameroon, 134 results for Ethiopia– and 256 results for Sudan. The bioassay data set included only two bioassay results for Equatorial Guinea and 22 bioassay results for Uganda, so we excluded these countries from our analysis of associations between our mapped *Vgsc* allele frequencies and the prevalence of insecticide resistance phenotypes. Susceptibility tests have a high measurement error; Hancock et al. [5] estimated that the measurement error associated with the sample proportional mortality had a standard deviation (sd) = 0.25 for bioassays performed using deltamethrin. Therefore, we used the predicted mean mortality to deltamethrin for *Anopheles gambiae* complex mosquitoes obtained from a series of annual predictive maps [5], using the predicted value for each sample collection location and year in our analysis.

### Regression models of associations between resistance allele frequencies and mortality following exposure to deltamethrin

We assessed associations between the predicted mean mortality following exposure to deltamethrin and the predicted frequency of the *Vsgc*-995F allele. Mean mortality measurements represent the entire *Anopheles gambiae* complex, so we combined our species-specific predictions of *Vsgc*-995F frequencies across *An. gambiae*, *An. coluzzii*, and *An. arabiensis* to estimate the *Vsgc*-995F frequency in the *An. gambiae* complex for each sample collection location and year*, f*_{C, i}:

$${f}_{C,i}={R}_{a,i}{f}_{a,i}+\left(1-{R}_{a,i}\right)\left({f}_{g,i}+{f}_{z,i}\right)/2$$

(1)

where *R*_{a,i} is the abundance of *An. arabiensis* at location *i* relative to the combined abundance of *An. gambiae* and *An. coluzzii*, and *f*_{a, i}, *f*_{g, i}, and *f*_{z, i} are the predicted frequencies of the *Vsgc*-995F allele in *An. arabiensis*, *An. gambiae*, and *An. coluzzii* at location *i*, respectively. Values of the relative abundance of *An. arabiensis* at each geographic location were obtained from the maps developed by [21]. We do not have spatially explicit estimates of the relative abundances of *An. gambiae* and *An. coluzzii*, so we used the mean frequency across these two species in our calculation. We excluded Kenya from our regression analysis because frequencies of *Vsgc*-995F are low at our sampled locations (observed *Vsgc*-995F frequencies are less than 0.07 across 90% of samples). We tested the accuracy of our estimated *Vgsc*-995F frequencies for the *An. gambiae* complex (Eq. 1) against 797 of the observed *Vgsc*-995F sample frequencies in our data set that were representative of the *An. gambiae* complex [47] and found a good level of accuracy (Additional File 1: Figure S15).

We fitted OLS linear regression models to predict mean mortality to deltamethrin using *f*_{C, i} as a covariate. Before model fitting, we applied the empirical logit transformation to both the independent variable and the covariate. To allow for serial autocorrelation in the data, we calculated Newey-West robust standard errors [68,69,70], using the sandwich package in R [70,71,72], specifying automated calculation of the bandwidth parameter.

### Importance of potential explanatory variables

In order to identify which of our potential predictor variables were having the most impact on our modelled *Vgsc* allele frequencies, we calculated measures of the importance of each predictor variable for each of the machine-learning models used in our model ensemble. It is important to note that variable importance measures cannot be used to infer causality, and they can be difficult to interpret when predictor variables are correlated. For XGB, we used the gain measure calculated for each variable using the xgboost package [62], which is the fractional total reduction in the training error gained across all of that variable’s splits. For RF, we use the Gini importance, which is calculated using the sklearn package [63]. The Gini importance measures the influence of a variable in discriminating between classes in a classification algorithm [73]. For NN, we use the permutation importance, again calculated using the sklearn package. The permutation importance of a variable is obtained by randomly shuffling the values of the variable across all observations and recalculating the model score, which in our case is the prediction error across all data points.

### Independent conditional expectation (ICE) analysis across varying ITN coverage

We studied how variation in ITN coverage impacted our model-predicted resistance allele frequencies using ICE analysis. For a single chosen location in each country, we calculated the ICE [27] of the model predicted *Vsgc*-995F frequency with varying ITN coverage for the year 2005. The ICE simply calculates the predicted response value from the model across a range of a focal predictor variable, keeping all other predictor variables fixed at their original values. This can be used to explore how the focal covariate influences the model predictions, by examining the shape and magnitude of the relationship. It is important to be aware, however, that the variation in the focal covariate is artificial and does not represent the actual variation in that particular covariate over space or time. Our ICE calculations represent variation in the model predictions for a single location and year only. The selected location within each country was chosen at random from the *Vgsc* allele frequency sampling locations for that country (the coordinates of each location are shown in Additional File 1: Table S4). We used our model ensemble to calculated predicted *Vsgc*-995F frequencies across values of the ITN coverage in the year 2005 from zero to one in intervals of 0.1.