Global population genetic structure and demographic trajectories of the black soldier fly, Hermetia illucens

Background The black soldier fly (Hermetia illucens) is the most promising insect candidate for nutrient-recycling through bioconversion of organic waste into biomass, thereby improving sustainability of protein supplies for animal feed and facilitating transition to a circular economy. Contrary to conventional livestock, genetic resources of farmed insects remain poorly characterised. We present the first comprehensive population genetic characterisation of H. illucens. Based on 15 novel microsatellite markers, we genotyped and analysed 2862 individuals from 150 wild and captive populations originating from 57 countries on seven subcontinents. Results We identified 16 well-distinguished genetic clusters indicating substantial global population structure. The data revealed genetic hotspots in central South America and successive northwards range expansions within the indigenous ranges of the Americas. Colonisations and naturalisations of largely unique genetic profiles occurred on all non-native continents, either preceded by demographically independent founder events from various single sources or involving admixture scenarios. A decisive primarily admixed Polynesian bridgehead population serially colonised the entire Australasian region and its secondarily admixed descendants successively mediated invasions into Africa and Europe. Conversely, captive populations from several continents traced back to a single North American origin and exhibit considerably reduced genetic diversity, although some farmed strains carry distinct genetic signatures. We highlight genetic footprints characteristic of progressing domestication due to increasing socio-economic importance of H. illucens, and ongoing introgression between domesticated strains globally traded for large-scale farming and wild populations in some regions. Conclusions We document the dynamic population genetic history of a cosmopolitan dipteran of South American origin shaped by striking geographic patterns. These reflect both ancient dispersal routes, and stochastic and heterogeneous anthropogenic introductions during the last century leading to pronounced diversification of worldwide structure of H. illucens. Upon the recent advent of its agronomic commercialisation, however, current human-mediated translocations of the black soldier fly largely involve genetically highly uniform domesticated strains, which meanwhile threaten the genetic integrity of differentiated unique local resources through introgression. Our in-depth reconstruction of the contemporary and historical demographic trajectories of H. illucens emphasises benchmarking potential for applied future research on this emerging model of the prospering insect-livestock sector. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-021-01029-w.


Additional file 2 -Table of contents
. Microsatellite marker characteristics of the novel Hermetia illucens population genetics tool kit.
GenBank Accession numbers and primer sequences (F: forward; R: reverse) for the 15 new loci, including superscript numbers for individual dyes ( 1 : FAM; 2 : ATTO532; 3 : ATTO550; 4 : ATTO565) used for forward primer labelling. Total primer concentrations in final PCR reaction volumes are specified (the same for forward and reverse), as are effective concentrations of labelled primers. Repeat motifs, including number of arrays for the sequenced specimen, are listed for each locus, together with documented ranges of fragment lengths in the entire dataset. Effective multiplexed combinations (MP 1 -3) are indicated but can easily be customised due to identical cycling conditions. Standard quality assurance measures ensure the repeatability of unambiguous allele scoring across different laboratories.  Countries are listed alphabetically within subcontinents. Numbers of sampled countries per subcontinent are indicated, as well as sampled states, provinces or territories within countries (where informative). Numbers in brackets specify numbers of samples from any given country, highlighted in italics if including wild populations. For more details see Table S1, Additional file 1 (in several cases more detailed information could not be disclosed for captive populations because several commercial providers wished to stay anonymous).   Table S6. Diversity and pairwise differentiation of globally inferred Hermetia illucens genetic clusters.
Genetic variance explained according to provenance (wild or captive) nested within subcontinents. Significant variance components (p < 0.01) are highlighted in bold.
More details on patterns according to analyses that specifically test for subcontinental origin within different provenances, as well as provenances within different subcontinents, are given in Table 3 and the main text. There, for instance, the substantial structure revealed particularly for Europe reflects independent introductions of genetically distinct captive populations to Europe (notably from North America and Asia) that are contrasted by highly diverse wild populations found in different European regions (see Table 3b).

Source of variance
Proportion of variance (%)

Between subcontinents 2.4
Between provenances within subcontinents 8.9 Between populations within provenances 14.3 Between individuals within populations 6.3 Within individuals 68.1 Table S8. Model-based estimates of contrasts and significance levels for populationspecific allelic richness.
Three separate linear mixed models on allelic richness were performed, including locus as random effect and the following fixed effects: A) provenance status (wild vs. captive), B) subcontinent of origin, and C) provenance nested within subcontinent (see Table panels). Model assumptions were fulfilled, so model-based estimates are reported on the response scale for post-hoc Tukey contrasts in a), b) and c).
Rarefaction was applied throughout based on the lowest number of individuals across populations (5 diploid specimen). For the most complex model (c), besides all pairwise contrasts for exclusively captive and exclusively wild-sourced populations within subcontinents, pairwise contrasts of wild versus captive populations are only reported for inter-continental comparisons (shaded in grey). Raw data are in Table S3. *** p < 0.0001; ** p < 0.01; * p < 0.05; . p < 0.1.
Overall model statistics:  Table S9. Genetic isolation by distance for selected hierarchical groupings.

Mantel-test-inferred statistics based on Pearson's product-moment correlation. Pairwise genetic (FST /
(1 -FST)) and geographic (log kilometres) distances were analysed for hierarchical groupings of interest according to subcontinental origin and/or provenance status. Significance, based on 9999 permutations, is indicated in bold.  Table S10. Specifications on the global distribution of Hermetia illucens genetic clusters presented in Figure 4.

Analysed group # Populations Isolation by distance
Summary of the major geographic regions of interest. Listed are sampled countries (alpha-3 code, ISO 3166-1) and states/territories (where relevant), provenances (wild vs. captive), and numbers of individuals and populations as represented by the pie charts depicted in Figure 4. Details about inferred underlying demography for each major geographic region as identified by roman numbers in Figure 1 are described in the sub-table below. More information about several commercial captive samples (beyond continental scales) could not be disclosed. For additional details see also Tables S1, S4, S6 and S11-13, as well as Figures S4-8 I: Dominated by introduced domesticated strains from North America (XX) and Europe (XVI), yet also comprising captive stocks derived from naturalised wild populations of cluster 16 and introgressants of cluster 11, respectively (see III); pure locally naturalised clusters 11 and 13 were virtually absent across African captive populations. II: Mostly cluster 13, going back to a historic introduction from a single source in central-east South America (XXIV); genetically homogenous across entire western Africa, except for few indications of introgression from recently introduced domesticated strains. III: Admixture between south-east African cluster 11 (IV) and domesticated captive populations plus a separate single founder population of cluster 16, either from North America directly (XXI or XXII) or via Mediterranean Europe (XVIII or XIX). IV: Pure cluster 11, going back to a secondary admixture event between previously established western African cluster 13 (II) and early Polynesian colonisers (primarily admixed group XI); genetically homogenous across entire south-eastern Africa. V: Asian captive populations comprise a large fraction of regionally naturalised origins (VI), but mostly more recently derived hybrids and introgressants (cluster 5 and 4, respectively) due to rare yet highly influential introductions of domesticated strains from North America (XX). VI: Pure cluster 6, directly deriving from Polynesian primary founders of the Australasian range expansion (XI); genetically homogenous naturalised lineage across eastern and south-eastern continental and insular Asia. VII: Largely cluster 6 (VI), yet Sunda Islands are characterised by widespread introgression into locally naturalised populations from commonly farmed domesticated strains of North American origin (XX). VIII: Pure cluster 7, derived from admixture between Asian naturalised cluster 6 (VII) and eastern Australian naturalised cluster 8 (IX), thus representing the youngest distinct group of the Australasian serial colonisation; genetically homogenous across western Australia. IX: Pure cluster 8, diverged from cluster 9 (X) that previously colonised nearby Pacific Islands; genetically homogenous across eastern Australia. X: Pure cluster 9, diverged from Asian naturalised cluster 6 (VI) upon colonising Australasia from northern Polynesian areas (XI) in a counter-clockwise manner; Pacific Islands thus feature differentiated start and endpoints of this range expansion. XI: Admixed cluster memberships originating from independent introductions from western North America (XXI) and northwestern South America (XXIII); constituted the primarily admixed Polynesian bridgehead population that mediated serial colonisations across entire Australasia and secondary admixture events worldwide. XII: Northern Mexico appears to be the region where North American cluster 16 displaces Central American cluster 15 (see also XIII). XIII: Cluster 15 dominates Central America where it derived from cluster 14 (XIV) during northward range expansions (north of the Panamanian isthmus); found across provenances. XIV: Cluster 14 (XXIII) that further expanded its range into southern Central America without notable genetic changes (Panama only). XV: Pure cluster 9; inferred admixed ancestry primarily involving south-eastern North American and central-eastern South American lineages, plus limited gene flow from Central America. XVI: North American introductions of domesticated strains (comparatively diverse cluster 2 around 2005 and more recent ones, notably cluster 1) and a regional breeding program (cluster 3), plus recent introductions of captive stocks from Africa (indistinguishable) and Asian hybrids (cluster 5); absence of pure or introgressed European naturalised clusters in captivity. XVII: Unique cluster 10, going back to an admixture event between independent south-eastern African (IV) and eastern Australian (IX) introductions; genetically homogenous across western and central Europe.   Table showing analysis ID and description, the number of samples used for each analysis, the number of tested models and corresponding simulations, and the range of uniform prior distributions for each of the model parameters. If more than a single population was available from a given region/genetic group of interest, populations included in the analyses were selected based on the following criteria: (1) membership of all individuals to a genetic cluster of interest (Fig. 1, Table 2), (2) preference of wild populations from a given cluster/region, (3) preference of populations with comparatively high allelic richness (Table S3) that group basal in a given cluster within the same characteristic clade (Fig. 3, Figure  S5), and (4) whenever available, knowledge on complementary mitochondrial data based on a separate study (Ståhls et al. 2020). Individual analyses regarding non-native colonisation routes were largely built on available documentation records (see Introduction and references cited in this context). Individual analyses are presented in a sequential order as increasing information content was incorporated to build models for subsequent analyses focussing on supposedly successive colonisation events across non-native ranges (if indicated). See Tables S12 and S13 for further details.
Individual analyses addressed key questions and focal genetic groups including sample sets as follows:  Table S12. Posterior probabilities of demographic models inferred from ABC analyses.
Posterior probabilities and 95% confidence intervals for demographic models compared in seven separate analyses (sub-tables A-G). Significantly best-fitting models are highlighted with asterisks. For some analyses moderate but non-significant support was also received for particular competing models, besides the significantly best supported model. Specific additional details and interpretations of these selected cases are provided in the following: B) North American captive: A posterior probability of 0.47 may indicate that the significantly best fitting model, i.e. direct descent from south-east North American wild populations, does not capture all demographic events that shaped the genetic relationship among the populations. This may be influenced by specific breeding-mediated effects that are difficult to model relative to included wild populations. However, inspecting the still moderately supported alternative model revealed that the source of uncertainty in these ABC analyses was the genetic relationship of south-east North American wild populations to other American clusters, rather than their direct ancestry of North American captive strains. This is presumably due to occasional gene flow into south-east North America via the Caribbean (see below). Overall, the significantly best-explaining model and a moderately supported alternative model both favoured south-eastern North America as the direct source of North American captive populations. In contrast, models that considered a different American origin for the North American captive strains were hardly supported. C) Caribbean: A posterior probability of 0.46 may indicate that the significantly best fitting model, i.e. admixture between populations from central-eastern South America and south-eastern North America, does not capture all demographic events that shaped the genetic relationship among the populations. In agreement with this, two models accounting for gene flow between central-eastern South America and central regions of Central America in the Caribbean, as well as between the latter and south-eastern North America, also received lower non-significant support. This suggests multiple introductions from all three areas into the Caribbean, whereas Central American ancestors appear to have left comparatively limited genetic footprints.

Analysis
D) Australasia: The support of the significantly best-fitting model was high. Nevertheless, the only competing model that received additional moderate (but non-significant) support is notable in that it considered southern Polynesian populations (New Zealand) as basal and northern Polynesian populations (Hawaii) as the oldest descendent, while the remaining topology was the same as in the selected model. This provides strong support of the chronology of the serial colonisation of the Australasian region overall: starting in Polynesia and successively reaching Asia first, then eastern Australia, and finally western Australia. However, in addition to the major counter-clockwise colonisation of entire Australasia, the possibility that there might have been some translocations also in a clockwise manner specifically across the Pacific Islands may not be excluded. E) Polynesia: A posterior probability of 0.5 may indicate that the significantly best fitting model, i.e. admixture between populations from north-western South America and western North America, does not capture all demographic events that shaped the genetic relationship among the populations. Indeed, the model that considered admixture between central-east South American and western North American origins also received non-zero, but non-significant support. This may indicate multiple independent historic introductions from South America to Polynesia in addition to (presumably later) introductions from North America.
F) South-east Africa: In addition to the significantly best fitting model of an admixture scenario between west African and Polynesian populations, the model that considered a common ancestor of west African and central-east South American populations contributing to admixture with Polynesian origins as the source of cluster 11 also received non-zero yet non-significant support. This overall strongly supports a west African ancestry, and may indicate that present west African populations have diverged genetically since their original introduction. This may have taken place in the course of a geographic shift from a potentially more southwards region colonised first within Africa, given that South African records of BSF pre-date those from west Africa.

Table S13. Estimates of posterior distributions of population genetic parameters inferred from ABC analyses.
Summary statistics for the posterior distributions of population genetic parameters and the mutation model for the microsatellite markers. Sub-tables A-G correspond to the seven separate analyses and sample selections (N1-N6) as described in Table S11. Additional parameters for the shown best fitting models are denoted as follows: t1-t5: individual splits going back in time (estimated generation times), see also Figure S6; r: admixture rate between two indicated sources, see also  Weighted linear regression statistics (see Methods for details) of cross-locus inbreeding coefficients FIS for three groups of populations (see Table S3), specified according to captive populations belonging to clusters 1 -4 (domesticated), global wild populations (wild), and wild-derived captive populations grouping in clusters 5 -16 (captive wild-derived).
The overall model indicated that inbreeding coefficients significantly differed among the three groups (F2 = 16.43, p < 0.001). Estimates of post-hoc Tukey contrasts are reported on the response scale.  Locus IDs and motif lengths as in Table S2, plus the maximum expected range of motif copies as used for the simulations. Figure S1: Discriminatory power of the novel microsatellite marker set for Hermetia illucens genotyping.

Group comparison
Based on the present dataset of 2,862 unique black soldier fly multilocus genotypes, boxplots of the numbers of distinct multilocus genotypes (y-axis) for increasing numbers of any arbitrarily combined loci (x-axis) were calculated. Accordingly, eight arbitrarily chosen loci of the novel microsatellite set resulted in a median discriminatory power of > 97% for distinguishing unique individual multilocus genotypes in the present dataset.

Figure S2. Significant deviations from Hardy-Weinberg equilibrium of individual microsatellite loci tested within populations.
All 150 black soldier fly populations are shown individually (rows; see Table S3) for all 15 loci (columns). P-values were adjusted by accounting for 15 simultaneous tests within populations, i.e. with an α-level of p = 0.0033, instead of a global adjustment (which might have been too conservative). Nonsignificant values are highlighted in light blue, significant values are highlighted in dark-blue.

Figure S3. Computational details on genetic cluster analyses and retaining discriminatory functions for visualisation.
A) Inference of the optimal number of clusters of worldwide Hermetia illucens populations using the snapclust function of the adegenet package based on Kullback Information Criterion (KIC) goodnessof-fit statistics for model selection (convergence was verified and only the informative range of K between 1-40 is shown). B) Evaluation of the optimal number of discriminatory functions to retain in discriminant analyses of principal components (DAPC) of the 16 genetically distinct clusters (Figure 1) based on the optim.a.score function of the adegenet package to interpolate optimally performing alphascores (here: 65 PCs). Further, we assessed PCs using the xvalDapc function of the adegenet package, as exemplarily depicted in panels C) and D). First, a larger range in looser intervals was inspected (6 runs with 30 replicates each) and then individual PCs were inspected within a more plausible range of 20-150 (six runs with 50 replicates each). Here, the optimal number of PCs identified ranged from 68-78, and lowest mean square errors (MSE) and values of highest mean assignment success (HMS) across runs both identified 68 as the optimal number of PCs retained in the analysis shown in Figure 1 (MSE: 0.088; HMS: 0.914).

A B
C D Figure S4. Global population genetic patterns of Hermetia illucens according to provenance (wild vs. captive) nested within subcontinent of origin.
Discriminant Analysis of Principal Components based on all 2,862 black soldier fly multilocus genotypes. Individual genotypes were analysed within original population samples (see Table S3). Captive origins are labelled as asterisks, while wild-sourced origins are labelled as dots. Colour allocation is based on original subcontinent origins, following Figures 1, 2 & 3 (see also Table S4). Panel A shows axes 1 and 2, and panel B shows axes 2 and 3. Optimal number of principal components retained were cross-validated, and the relative explained variance for the first three axes is indicated (see smaller integrated boxes).

Figure S5. Neighbour-joining tree based on population pairwise FST across 150 Hermetia illucens populations.
Population provenance is indicated by symbols on each tip label, while colours indicate subcontinental origins as specified in the legend (congruent with Fig. 2 and 3). Labelled clades refer as follows to the groups that were mainly identified by cluster analysis (Fig. 1, Table 2): i) Exclusively captive clusters 1 -3 (domesticated North American origin); ii) Exclusively (but one) captive clusters 4 -5 (hybrids and introgressants between North American domesticated origins and Asian naturalised wild populations of cluster 6); iii) Clusters 15 and 16 (indigenous Central and North American populations); iv) Clusters 12, 13 and 14 (indigenous South American and southern Central American, as well as naturalised west African populations); v) Cluster 11 (south-east African naturalised populations); vi) Cluster 6 (naturalised Asian populations, mixed provenances); vii) Clusters 7, 8, 9 and 10 (Australian and western European naturalised populations and Caribbean).

Figure S6. Demographic inference with ABC.
Graphical results of competing demographic models of seven independent analyses are depicted in panels A-G (see Tables S11-S13 for details of addressed models and further results). In each panel, barplots on the left show posterior probabilities with error bars denoting 95% confidence intervals. The significantly best-fitting model is highlighted with an asterisk. Plots on the right show the topologies of the best-fitting model for each of the seven independent ABC analyses. Abbreviations in populationlabels refer to subcontinents (AF: Africa; AS: Asia; AU: Australia; CA: Central America; EU: Europe; NA: North America; SA: South America), and specific genetic clusters (1 -16) are highlighted by coloured numbers based on Figure 1.

Figure S7. Detection of hybrids and backcrosses: a west African case of introgression.
Inferences of hybrids and backcrosses based on ancestry coefficients for two west African wild populations that were distinct from the regionally naturalised cluster 13 and exhibited admixture between cluster 2 and 16 according to Figures 1 & 3. Captive populations of cluster 2 were introduced to a farming facility in their vicinity two years prior to sampling. Thus, the two wild populations suspected to be admixed were compared to their two traceable European sources and to the two geographically closest wild populations assigned to the west African naturalised cluster 13 (as most probable parental groups). Independent of previous cluster allocations for each sampled population (based on the majority of individuals), all individuals were re-assigned to either parental group, F1 hybrids (balanced admixture) or backcrosses with either parental group (0.75 : 0.25 admixture).

Figure S8. Detection of hybrids and backcrosses: case-specific analyses of central-east African populations and predominantly farmed populations from Asia.
Inferences on hybrids and backcrosses based on ancestry coefficients for potentially introgressed groups of interest relative to recently introduced domesticated clusters 1, 2 and 3 ( Figure 1). A: all African populations allocated to cluster 16 (by the majority of individuals), including south-east African naturalised cluster 11 as the most probable second parental group; B: entire clusters 4 and 5 (all but one population reared in captivity), including the Asian naturalised cluster 6 as the most probable second parental group.
Independently of previous cluster allocations for each sampled population, all individuals were reassigned to either parental group, F1 hybrids (balanced admixture) or backcrosses towards either parental group (0.75 : 0.25 admixture). Since cluster membership of original populations was based on the majority of respective individuals, some individuals from populations of supposedly parental ancestry were re-allocated to admixed groups in this analysis. This was particularly pronounced in European captive populations, where recent introgression from both African and Asian origins appears to be frequent; see Figure 1C and main text in the paper.