Rapid, large-scale species discovery in hyperdiverse taxa using 1D MinION sequencing

Srivathsan, Amrita; Hartop, Emily; Puniamoorthy, Jayanthi; Lee, Wan Ting; Kutty, Sujatha Narayanan; Kurina, Olavi; Meier, Rudolf

doi:10.1186/s12915-019-0706-9

Research article
Open access
Published: 29 November 2019

Rapid, large-scale species discovery in hyperdiverse taxa using 1D MinION sequencing

Amrita Srivathsan¹,
Emily Hartop^2,3,
Jayanthi Puniamoorthy¹,
Wan Ting Lee¹,
Sujatha Narayanan Kutty^1,4,
Olavi Kurina⁵ &
…
Rudolf Meier ORCID: orcid.org/0000-0002-4452-2885^1,4

BMC Biology volume 17, Article number: 96 (2019) Cite this article

9201 Accesses
80 Citations
42 Altmetric
Metrics details

Abstract

Background

More than 80% of all animal species remain unknown to science. Most of these species live in the tropics and belong to animal taxa that combine small body size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted based on detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for sorting specimens to species. Morphological verification of such “molecular operational taxonomic units” (mOTUs) could then be based on dissection of a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well-equipped, well-funded molecular laboratories are not readily available in many biodiverse countries.

Results

We here document how MinION sequencing can be used for large-scale species discovery in a specimen- and species-rich taxon like the hyperdiverse fly family Phoridae (Diptera). We sequenced 7059 specimens collected in a single Malaise trap in Kibale National Park, Uganda, over the short period of 8 weeks. We discovered > 650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using an improved low-cost MinION pipeline that increased the barcoding capacity sevenfold from 500 to 3500 barcodes per flowcell. This was achieved by adopting 1D sequencing, resequencing weak amplicons on a used flowcell, and improving demultiplexing. Comparison with Illumina data revealed that the MinION barcodes were very accurate (99.99% accuracy, 0.46% Ns) and thus yielded very similar species units (match ratio 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with morphology (93% of mOTUs; > 99% of specimens) and revealed that 90% of the putative species belong to the neglected, megadiverse genus Megaselia. We demonstrate for one Megaselia species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.).

Conclusions

We document that one field site in Africa can be home to an estimated 1000 species of phorids and speculate that the Afrotropical diversity could exceed 200,000 species. We furthermore conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing could quickly reveal the extent of the unknown diversity and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.

Introduction

In 2011, the former president of the Royal Society, Robert May, wrote that “[w]e are astonishingly ignorant about how many species are alive on earth today, and even more ignorant about how many we can lose [and] yet still maintain ecosystem services that humanity ultimately depends upon.” [1]. Little has changed since then and > 80% of all extant animal species remain unknown to science [2]. Most of these unknown species belong to hyperdiverse and species-rich invertebrate clades. They are ubiquitous, contain most of the multicellular animal species, and often occur in great abundance. However, research on the species diversity of such clades is underdeveloped because it requires the examination of large numbers of specimens. These specimens have to be grouped into species before they can be either identified (if they belong to a known species) or described (if they are unknown to science).

In invertebrates, species discovery often starts with obtaining specimens via bulk sampling methods. In insects, one of the most widely used methods is Malaise trapping. Such traps routinely collect thousands, or even tens of thousands, of specimens per site and week; i.e., sorting all specimens to species level virtually never happens and the world’s natural history museums store billions of unsorted specimens. Species-level sorting is usually restricted to a few taxa with small to moderate numbers of specimens. It is accomplished in two stages. The first is grouping specimens into easily identifiable major taxa (e.g., major groups of beetles, flies, wasps). This type of pre-sorting is usually accomplished by parataxonomists with basic training in morphology (e.g., students). The main challenge is the second sorting stage; i.e., sorting to species level. This work is best carried out by taxonomic experts whose techniques are, however, only effective for taxa that have fairly small numbers of specimens and species. In contrast, large, hyperdiverse and abundant taxa are ill-suited because they require dissection and microscopic study of many specimens. An alternative to species-level sorting by taxonomists is a hybrid approach that combines rapid pre-sorting to “morpho-species” by parataxonomists with subsequent verification of the morpho-species via DNA barcodes that are obtained for a few specimens for each morpho-species [3]. DNA barcodes are only obtained for few specimens because it would be too time-consuming and expensive to generate them for all specimens using the traditional DNA barcoding pipelines that require formal DNA extractions and Sanger sequencing [4]. Unfortunately, this widely used hybrid approach has three problems. Firstly, species-level sorting by parataxonomists is very imprecise [5, 6]. Secondly, small-scale DNA barcoding tends to overlook morphologically cryptic species. Thirdly, the hybrid approach requires a lot of manpower for morpho-species sorting.

An alternative approach to species discovery is the ‘reverse workflow’ of Wang et al. (2018) [4]. Here, every specimen in a sample is DNA barcoded with minimal or no damage to the specimen [4, 7, 8] using simplified DNA extraction protocols and Illumina sequencing [9]. After barcoding, the specimens are grouped into molecular operational taxonomic units (mOTUs) that in most cases represent species [4]. The confirmation of these mOTUs as species comes last. Taxonomic experts use morphology to study a subset of the specimens that were pre-sorted to putative species based on DNA sequences. The selection of the specimens can be guided by the genetic distance between individuals [3]. This “reverse workflow” has the advantage that species-level sorting relies on DNA sequencing which can be automated. It also associates morphologically dissimilar males, females, and immature specimens that belong to the same species [7]. However, barcoding all specimens in a sample is unrealistically expensive with traditional Sanger sequencing. The implementation of the reverse workflow thus requires more cost-effective sequencing solutions that are now provided by high-throughput sequencing platforms (e.g., Illumina, Nanopore, PacBio: [4, 8, 10,11,12,13]). For example, tens of thousands of specimens can be barcoded on a single lane of Illumina HiSeq with the total cost of a barcode being as low as 0.17 USD (including PCR cost, see discussion in Wang et al., 2018 [4]). However, due to read length restrictions, barcodes obtained with Illumina are < 400 bp and new solutions for obtaining full-length barcodes based on PacBio [10] or MinION [14] sequencing have only recently emerged.

Unfortunately, barcoding with Illumina and PacBio sequencing has some downsides. Firstly, both technologies are only cost-effective if > 10,000 specimens are simultaneously barcoded because the cost of flowcells is high. Secondly, sequencing must usually be outsourced; i.e., amplicon pools have to be shipped to sequencing facilities. This is not a major concern in developed countries, but it is often a problem for species discovery research in countries that lack capital-intensive, high-throughput sequencing facilities or have restrictive regulations with regard to the export of genetic material. It would thus be desirable to have alternative sequencing techniques that are fast, scalable, cost-effective, and require low initial investment. Such solutions would be particularly useful if barcoding could be accomplished under field conditions and/or by citizen scientists [15,16,17,18].

Oxford Nanopore’s MinION has the potential to be such a solution. It is a low-cost, portable device and delivers real-time sequencing. However, it unfortunately still generates error-prone data (ca. 10–15% [19]) at a fairly high cost per base pair. Therefore, its use and reliability for large-scale specimen barcoding remains poorly explored. A first step toward the use of MinION for barcoding was the recent demonstration that 500 DNA barcodes can be obtained using one flowcell of MinION and 1D² sequencing [14]. The study increased the throughput of one MinION flowcell by one order of magnitude compared to existing protocols. However, the scale was arguably still not sufficient for large-scale species discovery where thousands of specimens have to be processed. Furthermore, the experiment used 1D² sequencing, which requires complicated and time-consuming library preparation techniques and access to computer servers for base-calling. Here, we test whether the more straightforward, but less accurate, 1D sequencing can be used for large-scale species discovery.

Improved species discovery techniques are particularly needed for hyperdiverse clades of invertebrates that have many species in the tropics. A good example are insects whose diversity is concentrated in four hyperdiverse insect orders: Coleoptera (beetles), Diptera (midges and flies), Hymenoptera (bees, wasps, and ants), and Lepidoptera (moths and butterflies). Species estimates for all Insecta vary between 3 and 13 million (reviewed by Stork, 2018 [20]) with only ca. 1,000,000 currently described [21]. Historically, Coleoptera has been considered the most species-rich order of insects which is said to have led the evolutionary biologist J. B. S. Haldane to remark that the creator must have had an “inordinate fondness for beetles.” [22]. However, it now appears that the impression that Coleoptera is the most species-rich order may have been due to an inordinate fondness of taxonomists for beetles. Recent studies suggest that Diptera and Hymenoptera may be more species-rich. For example, Forbes et al. [23] proposed that Hymenoptera contained more species than either Diptera or Coleoptera based on parasite host ratios for Microhymenoptera. Similarly, a large barcoding study of Canadian insects found that Hymenoptera and Diptera together accounted for two thirds of the 46,937 molecular operational units found (in the form of BINs or Barcode Index Numbers [24]). The study predicted that one dipteran family alone, gall midges (Cecidomyiidae), may have 16,000 species in Canada. Once extrapolated to a worldwide scale, the authors estimated that 1.8 million of the 10 million predicted insect species could be cecidomyiids [25]; i.e., a single family of Diptera would far surpass the number of described beetle species. Other studies similarly hint at the extraordinary richness of Diptera. For example, the Zurqui All Diptera Biodiversity Inventory (ZADBI) of a single site in Costa Rica was heavily reliant on specimens collected with two Malaise traps over 1 year [26]. Only 41,001 specimens (a small fraction of the hundreds of thousands collected) were studied by taxonomic experts [27]. These specimens belonged to 4332 species of Diptera, of which 800 were Cecidomyiidae and 404 Phoridae [27], the fly family of focus here.

Phoridae, or scuttle flies, is a family of true flies with approximately 4300 described species [28]. Currently, only 466 species of phorids have been described for the Afrotropical Region [28] while Henry Disney, a world expert on the family, has recorded 75 species of phorids in his suburban garden in Cambridge alone [29]. Similarly, the BioSCAN project in Los Angeles recorded up to 82 species in city backyards [29]. These numbers make it very likely that the Afrotropical fauna is very large and currently vastly understudied. But not all phorid taxa are equally poorly sampled. The main obstacle to understanding phorid diversity is Megaselia Rondani which contains > 1600 of the 4300 described species. This makes Megaselia “one of the largest, most biologically diverse and taxonomically difficult genera in the entire animal kingdom” [30]. In groups like Megaselia, the obstacles to completing species discovery with traditional methods appear insurmountable. Extremely large numbers of specimens are routinely collected which can belong to very large numbers of species. This makes sorting such samples into species-level units using traditional workflows very labor-intensive. Rare and new species are often hidden among very large numbers of common and described species. The rare species cannot be found without the microscopic study of thousands of specimens for which prodigious notes have to be taken. Detailed drawings of male genitalia must be prepared (essential for Megaselia)—often based on dissections and slide mounts. This traditional workflow thus discourages all but the most tenacious taxonomists from taking up the study of hyperdiverse genera within insects.

Here, we test whether 1D MinION sequencing can help to reveal phorid diversity more comprehensively by relegating the sorting to species level to sequencing. MinION sequencing is here applied to ca. 30% of the phorid specimens that were collected in a single Malaise trap in Kibale National Park, Uganda. We describe how we processed ~ 8700 specimens, obtained ~ 7000 accurate barcodes, and found > 650 putative species. All this was accomplished using a workflow that would take less than a month.

Results

MinION-based DNA barcoding

The experiment was designed to obtain full-length COI barcodes via tagged amplicon sequencing for two sets of specimens. A total of 8699 phorid flies were processed (Set 1: 4275; Set 2: 4519; 95 specimens were duplicated in both sets) (Fig. 1). In order to assess amplification success rates, a subset of PCR products for each of the ninety-two 96-well plates were verified with agarose gels. Amplification success rates were estimated to be 86% and 74% for the two sets of specimens (80.7% overall); i.e., we estimated that > 3600 and > 3300 DNA barcodes should be obtainable via MinION sequencing given that gels tend to underestimate amplification success rates for weak amplicons that cannot be reliably visualized with commercial dyes (Table 1). The PCR products for each set were pooled and sequenced using MinION (set 1: 7,035,075; set 2: 7,179,121 1D nanopore reads). Both sets were sequenced in two MinION runs. The first run for each set was based on the pooled PCR products for all specimens in the set. It generated 3,069,048 and 4,853,363 reads, respectively. The results of the first run were used to estimate coverage for each PCR product. Products with weak coverage (≤ 50×) were re-pooled and resequenced (set 1: 2172 amplicons; set 2: 2211 amplicons). This added 3,966,027 and 2,325,758 reads to each set and improved the coverage of many low-coverage barcodes (Fig. 2).

Table 1 Number of reads and barcodes generated via MinION sequencing

Full size table

The combined data were processed using an improved version of a bioinformatics pipeline introduced in Srivathsan et al. [14]. The improvements led to a higher demultiplexing rate (14% increase for set 1: 898,979 vs. 787,239 reads; 9% increase for set 2: 647,152 vs. 593,131 reads) and faster demultiplexing (10× using 4 cores: demultiplexing in 9 min vs 87 min for one of the datasets).

Assessment of demultiplexing accuracy

We indirectly assessed the accuracy of the demultiplexing pipeline by testing whether reads would be incorrectly demultiplexed into bins belonging to unused tag combinations. This happened for a very small proportion of reads (0.23%: 2054 of 900,698 reads in set 1; 0.44%: 2837 of 649,587 reads in set 2). Note that such low error rates are unlikely to yield poor quality barcodes given that the average coverage per amplicon was 210× (set 1) and 143× (set 2). Surprisingly, 37% and 69% of the incorrectly demultiplexed reads were due to one tag: GTCCAACTTCAGT although the edit distances between all tag-pairs were high (≥ 5 bp); i.e., it is currently unclear whether the underperforming tag was due to a primer synthesis issue, systematic sequencing bias, or a wet-lab problem (Additional file 1: Figure S1). Out of caution, we provided four additional tag sequences that can be used as replacements (Additional file 2).

Barcode calling

Demultiplexing all data and calling preliminary barcodes generated 3797 and 3476 preliminary “MAFFT barcodes” with ≥ 5× coverage and < 1% ambiguous bases. These barcodes were subjected to correction using RACON [31] which yielded the same number of “RACON barcodes”. We overall obtained 7221 MAFFT and RACON barcodes. These preliminary barcodes still contained indel and substitution errors that were corrected with an amino acid correction pipeline that was first implemented in Srivathsan et al. [14]. It yielded 7178 AA-corrected MAFFT barcodes (“MAFFT+AA”) and 7194 AA-corrected RACON barcodes (“RACON+AA”). This pipeline rejects barcodes that have five or more consecutive indel errors so that there are fewer corrected than uncorrected barcodes. Finally, the two sets of corrected barcodes were consolidated. This yielded a set of 7155 consolidated, final barcodes. During this process, MAFFT+AA and RACON+AA barcodes are aligned but only those consolidated barcodes are accepted that do not require the insertion of indels because AA-corrected barcodes are expected to be indel-free. The overall barcoding success rate was thus 82.3% (7155 barcodes for 8699 specimens). This was close to the expected 80.7% success rate based on gel electrophoresis; i.e., MinION sequencing consistently produced sequence data for successfully amplified products.

A subsequent contamination check via BLAST revealed that of the 7155 barcodes, 96 barcodes were unlikely to be phorid flies (< 1.5%). These included 53 barcodes with matches to Wolbachia, Rickettsia, nematodes, human, and occasionally insects from other families (e.g., Drosophila, Hemipyrellia). Another 43 were incorrectly pre-sorted by parataxonomists and did not belong to Phoridae. After removal of these, we retained 7059 confirmed phorid barcodes. Lastly, we inspected the reads obtained for the 92 negative PCR controls (1 per microplate). Five negatives yielded MAFFT barcodes. Four of these had a > 97% match to non-phorids (two humans, one fish, one mollusk) and were eliminated. One low coverage (13×) negative survived all filters and matched phorid COI. It was removed after ascertaining that it did not impact the accuracy of the remaining barcodes obtained from this plate. This could be tested by comparing the MinION barcodes for the plate with Illumina barcodes obtained from different PCR products for the same DNA extraction plate (see below).

Comparison of MinION barcodes with Illumina barcodes

Illumina barcodes were obtained for 6251 of the 7059 specimens with MinION barcodes using a different set of primers that amplified a 313-bp subset of the full-length barcodes; i.e., comparison with MinION sequencing is based on 48% of the MinION sequence. The comparisons showed that the uncorrected MAFFT and RACON barcodes had an accuracy of 99.61% and 99.51% (Table 2). Correction of these barcodes with the amino acid correction pipeline improved the accuracy considerably (> 99.9% in all cases). The barcodes were corrected after optimizing a parameter that is here called “namino” because it specifies the length of the AA motifs that is used for correction. Overall, namino = 2 was found to optimize overall accuracy while minimizing the number of inaccurate barcodes. We found that MAFFT+AA barcodes were more accurate than RACON+AA barcodes, but MAFFT+AA barcodes contained a much higher number of ambiguous nucleotides (Fig. 3). When RACON+AA and MAFFT+AA barcodes were consolidated, the resulting “consolidated barcodes” were found to be highly accurate (99.99%) and containing few ambiguous bases (median = 0.3%, average = 0.46%). These accuracy rates were obtained after excluding < 0.5% specimens that had > 3% divergence with corresponding Illumina barcodes. Such barcode discrepancies are likely due to wet-lab errors (e.g., amplification of residual contaminating signals, see details in methods). Note that such errors are regularly observed in large-scale barcoding projects. For examples, a recent study by Hebert et al. [10] using PacBio Sequel for DNA barcoding found that 1.5–1.6% of the specimens had high abundances of non-target sequences.

Table 2 Accuracy of MinION as assessed by Illumina barcodes. The MinION barcodes were trimmed to the 313 bp that were sequenced using Illumina. The overall optimal strategy is “Consolidated (namino = 2)”. Optimal congruence values are highlighted in bold

Full size table

Comparison of MinION and Illumina barcodes at a mOTU level

Given that the barcodes were obtained for the purpose of species richness estimates, we compared the mOTU richness estimated based on the different barcode sets against those obtained with Illumina barcodes. For this purpose, we trimmed the MinION barcode sets to the 313-bp fragment that was sequenced using Illumina. mOTU richness was very similar (Table 2). However, comparison of mOTU numbers alone does not imply that the same specimens were grouped into mOTUs obtained with the MinION and Illumina barcodes. One also has to assess whether the contents of the mOTUs are identical. We thus calculated the match ratio for the datasets (3% clustering threshold). We found that all five barcode sets (MAFFT, RACON, MAFFT+AA, RACON+AA, and consolidated barcodes, namino = 2) had high match ratios (> 0.95). The consolidated and RACON barcodes performed best with match ratios of > 0.98 (consolidated barcodes: 0.991, RACON: 0.981). However, upon closer inspection the multiple sequence alignment (MSA) for the RACON barcodes contained indels while the consolidated barcodes are insertion-free and contain next to no deletions. The only exceptions were single bp deletions found in the first 20 bps of the barcode for 3/7059 specimens. The largest number of indels was found in the MSA of uncorrected RACON barcodes which indicated that the RACON barcodes retained a fair number of indel errors; i.e., RACON barcodes may not be of sufficient quality for submission to sequence databases. We thus recommend the usage of consolidated barcodes. This recommendation is based on maximizing per base accuracy (see below), yielding high-quality alignments, and revealing very similar mOTU diversity and composition (high match ratio) when compared to Illumina barcodes.

Given the different length of MinION and Illumina barcodes, we also compared the mOTUs obtained by full-length MinION barcodes (658 bp) with the mOTUs obtained with Illumina barcodes for those specimens for which both types of data were available. The match ratio was again high (0.951). For incongruent clusters, we analyzed at which distance threshold they would become congruent. We found that all clusters were congruent within the 1.9–3.7% range; i.e., the remaining 345 bp are not showing a major deviation from the signal obtained from the 313-bp fragment (Additional file 3). We next characterized if there was an increase in error in the 345-bp stretch of the MinION sequence that could not be directly compared to Illumina sequence: if this were the case, we would expect that spurious base calls would increase genetic distances for specimens. However, we found the opposite: in 18 of 21 cases, the threshold was lowered, i.e., the 345 additional nucleotides reduced the minimum distance in the cluster (Additional file 3).

Species richness estimation

After these quality checks, we proceeded to characterize the diversity of phorid flies based on the MinION barcodes of highest accuracy based on comparison with Illumina; i.e., the consolidated barcodes (namino = 2). We obtained a mean of 660 mOTUs when the thresholds were varied from 2 to 4% (2%: 705, 3%: 663, 4%: 613). These thresholds are widely used in the literature, but also supported by empirical data from GenBank. GenBank has 12,072 phorid sequences with species-level identifications belonging to 106 species. The intraspecific variability is overwhelmingly < 3% (> 95% of pairwise distances) and the match ratios between mOTUs and species identifications from GenBank are maximized for clustering thresholds of 2–3% (Additional file 1: Figure S2 and S3). In addition to clustering the barcodes based on a priori thresholds, we also used species delimitation based on Poisson Tree Processes (PTP) to estimate the number of species for the phorids from the trap. It yielded even higher richness estimate of 747 putative species than the threshold-based methods. Lastly, we used species accumulation and Chao 1 curves (mOTUs at 3%) to estimate the full phorid diversity of the Ugandan site. We find that the curves have yet to reach a plateau, but the shape of the curves suggests an estimated diversity of ~ 1000 species of Phoridae at a single field site in Uganda, collected by one Malaise trap (Fig. 4).

Paralogy check

We found that the Illumina barcodes were translatable which would not be expected for sequences obtained for old nuclear copies of mitochondrial DNA sequences (NuMTs). In addition, the congruence between the mOTUs estimated based on sequences for two different amplicons of different lengths and different primer specificity is very high. This would not be expected if NuMTs were regularly amplifying. We also scrutinized the read sets for Illumina amplicons for the presence of secondary phorid signal. We found such signal in 7% (30) of the 406 mOTUs with multiple specimens. Such signal can be caused by paralogs or low-level lab contamination when small amounts of template from one well contaminate the PCR reaction in another well. We suspect that much of the secondary signal is caused by the latter, but it is arguably more important that the level of secondary signal is sufficiently low that it could not significantly lower the overall species richness estimate of the site even if all secondary signal was caused by paralogy (Additional file 4).

Congruence with morphology

We conducted a morphological check of 100 randomly selected clusters (> 1500 specimens). We found that 6 of the 100 clusters contained, among other specimens, a single misplaced specimen. There was one cluster of four specimens that appeared to consist of a mixture of three morpho-species. This implies that 9 of the > 1500 examined barcoded specimens were misplaced due to lab contamination. This morphological check took ca. 30 h. mOTUs based on barcodes are expected to lump those species that recently speciated and split species with well-differentiated populations [32]. This means that taxonomists working with mOTUs should check for signs of lumping and splitting in closely related taxa. This requires morphological examination of a subset of specimens whose selection is guided by genetic information. This is aided by keeping closely related mOTUs physically together. In the case of phorids, this can be done by slide mounting representative specimens from the sub-clusters. This is here illustrated by describing one species based on a complex cluster.

New species description

During the morphological work, a distinctive new species of Megaselia was found. A mOTU-specific haplotype network was constructed and informed on which specimens should be studied based on morphology. The new species is here described. To continue reducing redundancy and ambiguity in species descriptions, the description of this species excludes the character table from the method previously established for Megaselia [33,34,35] and uses a molecular and photographic description. Photographs are a key element in descriptions for large, diverse groups [36], where verbose descriptions require much time while remaining insufficiently diagnostic. Most characters that would have been in table form are clearly visible in the photographs provided.

Megaselia sepsioides Hartop sp. n.

urn:lsid:zoobank.org:pub: ED268DF2-A886-4C31-A4FB-6271C382DECE

DNA barcode for UGC0005996 (GenBank accession: MN403533)

actttatattttatttttggagcttgagctggaatagtaggtacttccttaagaatcataattcgtgctgaattaggacacccaggagcacttattggtgatgaccaaatttataatgtgattgttactgcacatgcttttattataattttttttatagtaatacctattataataggaggttttggtaattgacttgtacctttaatattaggagccccagatatggcattccctcgaatgaataatataagtttttgaatattacctccttctttaactcttttattagccagaagtatagtagaaaatggagctggaactggttgaacagtttatcctcctttatcttctagaatcgctcatagtggagcttctgttgatttagcaattttctctcttcatttagctggaatttcatctattttaggagctgtaaattttattacaacaattattaatatacgatcatcaggtattacatttgaccgaatacctctatttgtttgatctgtaggtattacagctttattgctactcttatcacttcctgttttagctggtgctattacaatactattaacagaccgaaattttaatacttcattttttgacccagcaggaggaggagatccaattttataccaacatttattc.

Diagnosis

Well characterized by the following combination of characters: with unique semi-circular expansion with modified peg-like setae on the forefemur (Fig. 5b), hind tibia strongly constricted (Fig. 5d, e), and abdomen narrow and elongate. Three haplotypes were examined; variations in setation were observed between the main cluster and two haplotypes (Figs. 6 and 7). Only single specimens of the two distinct haplotypes were available; more specimens would be necessary to determine if these are eventually recognized as distinct species or fall within a continuum of intraspecific variation.

Material examined

Holotype. ♂, UGANDA: Kamwenge, Kibale National Park (00° 33′ 54.2″ N 30° 21′ 31.3″ E, 1530 m), iii-xii.2010, Olavi Kurina & Swaibu Katusabe (LKCNHM UGC0005996).

Paratypes. 7 ♂, UGANDA: Kamwenge, Kibale National Park (00° 33′ 54.2″ N 30° 21′ 31.3″ E, 1530 m), iii-xii.2010, Olavi Kurina & Swaibu Katusabe (LKCNHM: UGC0012899, UGC0012244, UGC0012568, UGC0003003, UGC0005864, UGC0012937, UGC0012971).

Distribution

Known from a single site in Kibale National Park, Uganda.

Biology

Unknown.

Etymology

Name suggested by Yuchen Ang for the sepsid-like (Diptera: Sepsidae) foreleg modification.

Discussion

Remarkably high diversity of Phoridae in Kibale National Park

The full extent of the world’s biodiversity is poorly understood because many hyperdiverse taxa are data-deficient. One such hyperdiverse clade is phorid flies. We here reveal that even a modest amount of sampling (one Malaise trap placed in Kibale National Park, Uganda) can lead to the discovery of > 650 putative species. This diversity constitutes 150% of the described phorid diversity of the entire Afrotropical region (466: [28]). Note that the ca. 7000 barcoded specimens covered in our study only represent 8 one-week samples obtained between March 2010 and February 2011. There are an additional 44 weekly samples that remain un-sequenced. We thus expect the diversity from this single site to eventually exceed 1000 species. This prediction is supported by a formal species-richness estimation based on the available data (Fig. 4). Such extreme diversity from a single site raises the question of whether these numbers are biologically plausible and/or whether they could be caused by unreliable data or species delimitation methods.

We would argue that the high numbers are both biological plausible and analytically sound. If a single garden in a temperate city like Cambridge (UK) can have 75 species and the urban backyards of Los Angeles up to 82, observing 10–15 times of this diversity at a site in a tropical National Park does not appear unrealistic. Our proposition that there are > 1000 species at a single site in Kibale National Park is further supported by the results of the Zurqui survey in Costa Rica which revealed 404 species of Phoridae without completing the species discovery process [26]. Furthermore, we are confident that our species richness estimate is not an artifact of poor data quality because the high species richness estimate is supported by both Illumina and MinION barcodes generated independently using different primer pairs. It is furthermore stable to modifications of sequence clustering thresholds which are widely used across Metazoa and here shown to be appropriate for phorid flies based on the available GenBank data [37]. Lastly, we checked 100 randomly chosen mOTUs for congruence between molecular and morphological evidence. We find that 93% of the clusters and > 99% of specimens are congruently placed (six of the seven cases of incongruence involved single specimens). This is also in line with congruence levels that we observed previously for ants and odonates [4, 7].

The high species richness found in one study site inspired us to speculate about the species diversity of phorids in the Afrotropical region. This is what Terry Erwin did when he famously estimated a tropical arthropod fauna of 30 million species based on his explorations of beetle diversity in Panama [38]. Such extrapolations are arguably useful because they raise new questions and inspire follow-up research. Speculation is inevitable given that it remains remarkably difficult to estimate the species richness of diverse taxa [39]. This is particularly so for the undersampled Afrotropical region which comprises roughly 2000 squares of 100 km² size. In our study, we only sampled a tiny area within one of these squares and observed > 650 species which likely represent a species community that exceeds > 1000 species. Note that Malaise traps only sample some phorid species because many specialist species (e.g., termite inquilines) rarely enter such traps. In addition, the estimated 1000 species that can be caught in a single Malaise trap are also only a subset of the species occurring in the remaining habitats of the same 100-km² grid cell. Overall, it seems thus likely that the 100 km² will be home to several thousand species of phorids. If we assume that on average each of the two thousand 100-km² grid cells of the Afrotropical Region has “only” 100 endemic phorid species, the endemic phorids alone would contribute 200,000 species of phorids to the Afrotropical fauna without even considering the contributions by the remaining species with a wider distribution. What is even more remarkable is that most of the diversity would belong to a single genus. We find that 90% of the newly discovered species and specimens in our sample belong to the genus Megaselia as currently circumscribed. Unless broken up, this genus could eventually have > 100,000 Afrotropical species. All these estimates would only be lower if the vast majority of phorid species had very wide distributions and/or the average number of species in 100-km² grid cells would be more than one order of magnitude lower than observed here. However, we consider this unlikely given that many areas of the Afrotropical region are biodiverse and cover a wide variety of climates and habitats which increases beta diversity.

Unfortunately, most of this diversity would have remained overlooked using the traditional taxonomic workflow because it is not well suited for taxa with high species diversity and specimen abundances. This means that the phorid specimens from the Kibale National Park Malaise trap would have remained in the unsorted residues for decades or centuries. Indeed, there are thousands of vials labeled “Phoridae” shelved in all major museums worldwide. Arguably, it is these unprocessed samples that make it so important to develop new rapid species-level sorting methods. We here favor sorting with “NGS barcodes” [4] because it would allow biologists to work through the Malaise trap residues in collections. These residues will almost certainly contain several times the number of species that have already been described. We foresee that there will be two stages to species discovery with NGS barcodes. The first is species-level sorting which can yield fairly accurate estimates of species diversity and abundance [4, 8]. Many biodiversity-related questions can already be addressed based on these data. The second phase is the refinement of mOTUs based on morphology with subsequent species identification (described species) or species description (new species). Given the large number of new species, this will require optimized “turbo-taxonomic” methods. Fortunately, new approaches to large-scale species description are being developed and there are now a number of publications that describe ~ 100 or more new species [36, 40,41,42].

MinION sequencing and the “reverse workflow”

MinION barcodes can be obtained without having to invest heavily into sequencing facilities. Furthermore, laboratories that are fully equipped for barcoding specimens with MinION can be mobile and operate under difficult conditions in the field or field stations [15,16,17,18]. The technology is thus likely to become important for the “democratization” of biodiversity research because the data are generated quickly enough that they can be integrated into high school and citizen scientist initiatives. Based on our data, we would also argue that MinION is now suitable for widespread implementation of the “reverse workflow” where all specimens are sequenced before mOTUs are assessed for consistency with morphology. Reverse workflow differs from the traditional workflow in that it relies on DNA sequences for sorting all specimens into putative species while the traditional workflow starts with species-level sorting based on morphology; only some morpho-species are subsequently examined with a limited amount of barcoding. We would argue that the reverse workflow is more suitable for handling species- and specimen-rich clades because it requires less time than high-quality sorting based on morphology which often involves genitalia preparations and slide mounts. For example, even if we assume that an expert can sort and identify 50 specimens of unknown phorids per day, the reverse workflow pipeline would increase the species-level sorting rate by > 10 times (based on the extraction and PCR of six microplates per day). In addition, the molecular sorting can be carried out by lab personnel trained in amplicon sequencing while accurate morpho-species sorting requires highly trained taxonomic experts. Yet, even highly trained experts are usually not able to match morphologically disparate males and females belonging to the same species (often one sex is ignored in morphological sorting) while the matching of sexes (and immatures) is an automatic and desirable by-product of applying the reverse workflow [7]. All these benefits can be reaped rapidly. One lab member can amplify the barcode of 600–1000 specimens per day (2–2.5 weeks for 8000 specimens). Obtaining DNA sequences requires ca. 1 week because it involves two cycles of pooling and sequencing on two MinION flowcells followed by two cycles of re-pooling and sequencing of weak amplicons. The bioinformatics work requires less than 1 week.

One key element of the reverse workflow is that the vials with specimens that have haplotype distances < 5% are physically kept together. This helps when assessing congruence between mOTUs and morphology. Indeed, graphical representations of haplotype relationships (e.g., haplotype networks) can guide the morphological re-examination as illustrated in our description of Megaselia sepsioides (Fig. 7). Here, eight specimens belonging to seven haplotypes were dissected and the most dissimilar haplotypes were studied in order to test whether the data support the presence of one or two species. Variations in setation were observed (Fig. 6), but they were consistent with intraspecific variation. Note that the morphological examination of clusters was straightforward because the use of QuickExtract™ for DNA extractions ensures morphologically intact specimens.

Large-scale species discovery using MinION 1D reads

Our results suggest that MinION’s 1D sequencing yields data of sufficient quality for producing DNA barcodes that can be used for large-scale species discovery. Through the development of new primer-tags, re-pooling of low-coverage amplicons, and an improved bioinformatics workflow, we here increased the barcoding capacity of a MinION flowcell by 700% from 500 specimens (Srivathsan et al., [14]: 1D² sequencing) to ~ 3500 specimens. This is achieved without a drop in accuracy because the error correction pipeline is effective at eliminating most of the (ca. 10%) errors in the 1D reads . Indeed, even the initial estimates of barcodes (“MAFFT” & “RACON”) have very high accuracy (> 99.5%) when compared to Illumina data, while the accuracy of consolidated barcodes is even higher (> 99.9%). Note that the accuracy of the barcodes obtained in this study is even higher than what was obtained with 1D² sequencing in Srivathsan et al. (99.2%) [14]. We suspect that this is partially due to improvements in MinION sequencing chemistry and base-calling, but our upgraded bioinformatics pipeline also helps because it increases coverage for the amplicons. This is welcome news because 1D library preparations are much simpler than the library preps for 1D². In addition, 1D² reads are currently less suitable for amplicon sequencing [43]. Overall, the accuracy of our MinION barcodes is also comparable with what can be obtained with PacBio Sequel ([10]: Additional file 1: Figure S5). This platform also allows for obtaining full-length barcodes, but the cost of purchasing and maintaining a PacBio sequencer is very high.

Using the workflow described here, MinION barcodes can be generated rapidly and at a low sequencing cost of < 0.35 USD per barcode. Molecular cost of PCR is 0.16 USD per reaction [4] while QuickExtract™ reagent costs 0.06 per specimen. These properties make MinION a valuable tool for species discovery whenever a few thousand specimens (< 5000) must be sorted to species. Even larger-scale barcoding projects are probably still best tackled with Illumina short-read or PacBio’s Sequel sequencing [4, 10, 11] because the barcoding cost is even ower. However, both require access to expensive sequencing instruments, sequencing is thus usually outsourced, and the users usually have to wait for several weeks in order to obtain the data. This is not the case for barcoding with MinION, where most of the data are collected within 10 h of starting a sequencing run. Another advantage of the MinION pipeline is that it only requires basic molecular lab equipment including thermocyclers, a magnetic rack, a Qubit, a custom-built computational device for base-calling ONT data (“MinIT”), and a laptop (total cost of lab < USD 10,000). Arguably, the biggest operational issue is access to a sufficiently large number of thermocyclers given that a study of the scale described here involves amplifying PCR products in 92 microplates (=92 PCR runs).

Our new workflow for large-scale species discovery is based on sequencing the amplicons in two sequencing runs. The second sequencing run can reuse the flowcell that was used for the first run. Two runs are desirable because they improve overall barcoding success rates. The first run is used to identify those PCR products with “weak” signal (=low coverage). These weak products are then resequenced in the second run. This dual-run strategy overcomes the challenges related to sequencing large numbers of PCR products: the quality and quantity of DNA extracts are poorly controlled and PCR efficiency varies considerably. Pooling of products ideally requires normalization, but this is not practical when thousands of specimens are handled. Instead, one can use the real-time sequencing provided by MinION to determine coverage, and then boost the coverage of low-coverage products by preparing and resequencing a separate library that contains only the low-coverage samples. Given that library preparations only require < 200 ng of DNA, even a pool of weak amplicons will contain sufficient DNA. This ability to re-pool within days of obtaining the first sequencing results is a key advantage of MinION. The same strategy could be pursued with Illumina and PacBio but it would take a long time to obtain all the results because one would have to wait for the completion of two consecutive runs.

Methods

Sampling

Samples were collected by a single Townes-type Malaise trap [44], in the Kibale National Park, close to Kanyawara Biological Station in the evergreen primeval forest at an altitude of 1513 m (00° 33′ 54.2″ N 30° 21′ 31.3″ E) (Fig. 4). Kibale National Park is characterized as a fragment of submontane equatorial woodland [45]. Temperatures in Kibale range from 16 to 23 °C (annual mean daily minimum and maximum, respectively) [46]. The Malaise trap was checked every week when the collecting bottle with the material was replaced by an employee of the National Park, Mr. Swaibu Katusabe [47]. Subsequently, the material was collected and transferred in accordance with approvals from the Uganda Wildlife Authority (UWA/FOD/33/02) and Uganda National Council for Science and Technology (NS 290/ September 8, 2011), respectively. The material was thereafter sorted to higher-level taxa. Target taxa belonging to Diptera were sorted to family and we here used the phorid fraction. The sampling was done over several months between 2010 and 2011. For the study carried out here, we only barcoded ca. 30% of the phorid specimens. The flies were stored in ethyl alcohol at − 20–25 °C until extraction.

DNA extraction

DNA was extracted using whole flies. The fly was first taken out from the vial and washed in Milli-Q® water prior to being placed in a well of a 96-well PCR plate. DNA extraction was done using 10 μl of QuickExtract™ (Lucigen) in a 96-well plate format, and the whole fly was used to extract DNA. The reagent allows for rapid DNA extraction by incubation (no centrifugation or columns are required). The solution with the fly was incubated at 65 °C for 15 min followed by 98 °C for 2 min. No homogenization was carried to ensure that the intact specimen was available for morphological examination.