1D MinION sequencing for large-scale species discovery: 7000 scuttle flies (Diptera: Phoridae) from one site in Kibale National Park (Uganda) revealed to belong to >650 species

Background More than 80% of all animal species remain unknown to science. Most of these species tend to live in the tropics and belong to animal taxa that combine small size with high specimen abundance and large species richness. For such clades, using morphology for species discovery is slow because large numbers of specimens must be sorted using detailed microscopic investigations. Fortunately, species discovery could be greatly accelerated if DNA sequences could be used for species-level sorting. Morphological verification of “molecular taxonomic operational units” (mOTUs) delineated with DNA sequences could then be based on inspecting a small subset of specimens. However, this approach requires cost-effective and low-tech DNA barcoding techniques because well equipped, well-funded molecular laboratories are not readily available in many biodiverse countries. Results We here document how MinION sequencing can be used to reveal the extent of the undiscovered biodiversity in specimen-rich taxa such as Phoridae, a hyper-diverse family of flies (Diptera). We sequenced 7,059 specimens collected in a single Malaise trap in Kibale National Park, Uganda over the short period of eight weeks. We discovered >650 species which exceeds the number of phorid species currently described for the entire Afrotropical region. The barcodes were obtained using a low-cost MinION pipeline that increases the barcoding capacity per flowcell from 500 to 3,500 barcodes. This was achieved by adopting 1D sequencing, re-sequencing weak amplicons on a used flowcell, improving demultiplexing, and introducing parallelization. Comparison with Illumina data revealed that the MinION barcodes are very accurate (99.99% accuracy, 0.46% Ns) and thus yield very similar putative species (match ratio: 0.991). Morphological examination of 100 mOTUs also confirmed good congruence with molecular clusters (93% of mOTUs; >99% of specimens) and revealed that 90% of the putative species belong to a neglected megadiverse genus, i.e., Megaselia. We demonstrate for one species how the molecular data can guide the description of a new species (Megaselia sepsioides sp. nov.). Conclusions We conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-scale species discovery in hyperdiverse taxa. MinION sequencing can reveal the extent of the unknown diversity quickly and is especially suitable for biodiverse countries with limited access to capital-intensive sequencing facilities.

genus, i.e., Megaselia. We demonstrate for one species how the molecular data can guide the 26 description of a new species (Megaselia sepsioides sp. nov.). 27

Conclusions 28
We conclude that low-cost MinION sequencers are very suitable for reliable, rapid, and large-29 scale species discovery in hyperdiverse taxa. MinION sequencing can reveal the extent of the 30 unknown diversity quickly and is especially suitable for biodiverse countries with limited access 31 to capital-intensive sequencing facilities.

INTRODUCTION 33
In 2011 the former president of the Royal Society, Robert May, wrote that "[w]e are astonishingly 34 ignorant about how many species are alive on earth today, and even more ignorant about how 35 many we can lose [and] yet still maintain ecosystem services that humanity ultimately depends 36 upon." [1] Little has changed since 2011 and >80% of all extant animal species remain unknown 37 to science [2]. Most of these unknown species belong to hyper-diverse and species-rich 38 invertebrate clades. They are ubiquitous, contain most of the multicellular animal species, and 39 often occur in great abundance. However, research on the species diversity of such clades is 40 slow because it requires the examination of large numbers of specimens. These specimens have 41 to be grouped into species before they can be either identified (if they belong to a known species) 42 or described (if they are unknown to science). 43 revealing very similar mOTU diversity and composition (high match ratio) when compared to 261

Illumina barcodes. 262
Given the different length of MinION and Illumina barcodes, we also compared the mOTUs 263 obtained by full-length MinION barcodes (658 bp) with the mOTU obtained with Illumina barcodes 264 for those specimens for which both types of data were available. The match ratio was again high 265 (0.951). For incongruent clusters, we analysed at which distance threshold they would become 266 congruent. We found that all clusters were congruent within the 1.9-3.7% range; i.e., the 267 remaining 345 bp are not showing a major deviation from the signal obtained from the 313 bp 268 fragment (Additional File 3). We next characterized if there was an increase in error in the 345 bp 269 stretch of the MinION sequence that could not be directly compared to Illumina sequence: if this 270 were the case, we would expect that spurious base calls would increase genetic distances for 271 specimens. However, we found the opposite: in 18 of 21 cases, the threshold was lowered, i.e., 272 the 345 additional nucleotides reduced the minimum distance in the cluster (Additional File 3). 273

Species richness estimation 274
After these quality checks, we proceeded to characterize the diversity of phorid flies based on the 275 consolidated barcodes (namino=2). We obtained a mean of 660 mOTUs when the thresholds 276 were varied from 2-4% (2%: 705, 3%: 663, 4%: 613). These thresholds are widely used in the 277 literature, but also supported by empirical data from GenBank. GenBank has 12,072 phorid 278 sequences with species-level identifications belonging to 106 species. The intraspecific variability 279 is overwhelmingly <3% (>95% of pairwise distances) and the match ratios between mOTUs and 280 species identifications from GenBank are maximized for clustering thresholds of 2 -3% (Additional 281 File 1: Fig S2, S3). In addition to clustering the barcodes based on a priori thresholds, we also 282 used species delimitation based on Poisson Tree Processes (PTP) to estimate the number of 283 species for the Ugandan phorids. It yielded even higher richness estimates of 747 putative 284 species than the threshold-based methods. We then used species accumulation and Chao 1 curves (mOTUs at 3%) to test whether the diversity of the Ugandan site had been exhaustively 286 sampled. We find that all curves have yet to reach a plateau and the shape of the curves suggests 287 an estimated diversity of ~1,000 species of Phoridae at a single field site in Uganda, collected by 288 one Malaise trap (Fig. 4). 289

Paralogy check 290
We found that the Illumina were translatable which is not expected for sequences obtained for old 291 NuMTs. In addition, the mOTUs estimated based on sequences for two different amplicons of 292 different lengths and different primer specificity are very high. This would not be expected if 293 NuMTs were regularly amplifying well. We also scrutinized the read sets for Illumina amplicons 294 for the presence of secondary phorid signal. We found such signal in 7% (30) of the 406 mOTUs 295 with multiple specimens. Such signal can be caused by paralogs or low-level lab contamination 296 when small amounts of template from one well contaminates the PCR reaction in another well. 297 We suspect that much of the secondary signal is caused by the latter, but it is arguably more 298 important that the level of secondary signal is sufficiently low that it does not significantly lower 299 the species richness estimate even if all secondary signal was caused by paralogy (Additional 300 File 4). 301

Congruence with morphology 302
We conducted a morphological check of 100 randomly selected clusters (>1,500 specimens). We 303 found that 6 of the 100 clusters contained, among other specimens, a single misplaced specimen. 304 There was one cluster of four specimens that appeared to consist of a mixture of three morpho-305 species. This implies that 9 of the >1,500 examined barcoded specimens were misplaced due to 306 lab contamination. This morphological check took ca. 30 hours. mOTUs based on barcodes are 307 expected to underestimate species for those that recently speciated and overestimate species 308 with deep splits [32]. This means that taxonomists working with mOTUs should check for signs of lumping and splitting in closely related taxa. This requires morphological examination of a subset 310 of specimens whose selection is guided based on genetic information. This is aided by keeping 311 closely related mOTUs physically together. In the case of phorids this can be done by slide 312 mounting representative specimen from the sub-clusters. This is here illustrated by describing 313 one species based on a complex cluster. 314 315

New Species Description 316
During the morphological work, a distinctive new species of Megaselia was found. A mOTU-317 specific haplotype network was constructed and informed on which specimens should be studied 318 based on morphology. The new species is here described. To continue reducing redundancy and 319 ambiguity in species descriptions, the description of this species has excluded the character Well characterized by the following combination of characters: with unique semi-circular 339 expansion with modified peg-like setae on the forefemur (Fig. 5, b), hind tibia strongly constricted 340 in setation were observed between the main cluster and two haplotypes (Fig. 6, 7). Only single 342 specimens of the two distinct haplotypes are available; more specimens will be necessary to 343 determine if these are eventually removed as distinct species or fall within a continuum of 344 intraspecific variation. Known from a single site in Kibale National Park, Uganda. 357

DISCUSSION 365
Remarkably high diversity of Phoridae in Kibale National Park 366 The full extent of the world's species-level biodiversity is poorly understood because many 367 hyperdiverse taxa are data-deficient. One such hyperdiverse clade is phorid flies. We here reveal 368 that even a modest amount of sampling (one Malaise trap placed in Kibale National Park, Uganda) 369 can lead to the discovery of >650 putative species. This diversity constitutes 150% of the 370 described phorid diversity of the entire Afrotropical region (466: [28]). Note that the ca. 7,000 371 barcoded specimens covered in our study only represent 8 one-week samples obtained between 372 March 2010 and February 2011. There are an additional 44 weekly samples that remain un-373 sequenced. We thus expect the diversity from this single site to eventually exceed 1,000 species. 374 This prediction is supported by a formal species-richness estimation based on the available data 375 (Fig. 4). Such extreme diversity from a single site raises the question of whether these numbers 376 are biologically plausible and/or whether they could be caused by unreliable data or species 377 delimitation methods. 378

379
We would argue that they are both biological plausible and analytically sound. If a single garden 380 in a temperate city like Cambridge (UK) can have 75 species and the urban backyards of Los 381 Angeles 82, observing 10-15 times of this diversity at a site in a tropical National Park does not 382 appear unrealistic. Our proposition that there are >1,000 species at a single site in Kibale National 383 Park is further supported by the results of the Zurqui survey in Costa Rica which revealed 404 384 species of Phoridae without completing the species discovery process [26]. Furthermore, we are confident that our species richness estimate is not an artefact of poor data quality because the 386 high species richness estimate is supported by both Illumina and MinION barcodes generated 387 independently using different primer pairs. It is furthermore stable to modifications of sequence 388 clustering thresholds which are widely used across Metazoa and here shown to be appropriate 389 for phorid flies based on the available Genbank data [37]. Lastly, we checked 100 randomly 390 chosen mOTUs for congruence between molecular and morphological evidence. We find that 391 93% of the clusters and >99% of specimens are congruently placed (six of the seven cases of 392 incongruence involved single specimens). This is in line with congruence levels that we observed 393 previously for ants and odonates [4,7]. 394

395
The high species richness found in one study site inspired us to speculate about the species 396 diversity of phorids in the Afrotropical region. This is what Terry Erwin did when he famously 397 estimated a tropical arthropod fauna of 30 million species based on his explorations of beetle 398 diversity in Panama [38]. Such extrapolations are arguably useful because they raise new 399 questions and inspire follow-up research. Speculation is inevitable given that it remains 400 remarkably difficult to estimate the species richness of diverse taxa [39]. This is particularly so for 401 the undersampled Afrotropical region which comprises roughly 2,000 squares of 100 km 2 size. In 402 our study, we only sampled a tiny area within one of these squares and observed >650 species 403 which likely represent a species community that exceeds >1,000 species. Note that Malaise traps 404 only sample a subset of a local phorid fauna because many specialist species (e.g., termite 405 inquilines) are rarely collected in such traps. Of course, the estimated 1,000 species that can be 406 caught in a single Malaise trap are also only a subset of the species occurring in the remaining 407 habitats in the same 100 km 2 . Overall, it seems thus likely that the 100 km 2 will be home to several 408 thousand species of phorids. If we assume that on average each of the two-thousand 100 km 2 of 409 the Afrotropical Region has "only" 100 endemic phorid species, the endemic phorids alone would 410 contribute 200,000 species of phorids to the Afrotropical fauna without even considering the 411 contributions by the remaining species with a wider distribution. What is even more remarkable is 412 that most of the diversity would belong to a single genus. We find that 90% of the newly discovered 413 species and specimens in our sample belong to the genus Megaselia as currently circumscribed. 414 Unless broken up, this genus could eventually have >100,000 Afrotropical species. All these 415 estimates would only be lower if the vast majority of phorid species had very wide distributions 416 and/or the average number of species in 100 km 2 squares would be more than one order of 417 magnitude lower than observed here. However, we consider this somewhat unlikely given that 418 many areas of the Afrotropical region are biodiverse and cover a wide variety of climates and 419 habitats which increases beta diversity. 420

421
Unfortunately, most of this diversity would not have likely been discovered using the traditional 422 taxonomic workflow because it is not well suited for taxa with high species diversity and specimen 423 abundances. This means that the phorid specimens from the Kibale National Park Malaise trap 424 would have remained in the unsorted residues for decades or centuries. Indeed, there are 425 thousands of vials labelled "Phoridae" shelved in all major museums worldwide. Arguably, it is 426 these unprocessed samples that make it so important to develop new rapid species-level sorting 427 methods. We here favour sorting with "NGS barcodes" [4] because it allows biologists to work on 428 taxa that contain a very large proportion of the species on our planet and in our natural history 429 museums. We predict that there will be two stages to species discovery with NGS barcodes. The 430 first is species-level sorting which can yield fairly accurate estimates of species diversity and 431 abundance [4,8]. Many biodiversity-related questions can already be addressed based on these 432 data. The second phase is the refinement of mOTUs based on morphological testing with 433 subsequent species identification (described species) or species description (new species). Given 434 the large number of new species, this will require optimized "turbo-taxonomic" methods. 435 Fortunately, new approaches to large-scale species description are being developed and there 436 are now a number of publications that describe ~100 or more new species [36,[40][41][42]. 437

MinION sequencing and the "reverse workflow" 439
MinION barcodes can be obtained without having to invest heavily into sequencing facilities, 440 MinION laboratories can be mobile and even operate under field conditions [15][16][17][18]. The 441 technology is thus likely to become important for the "democratization" of biodiversity research 442 because the data are generated quickly enough that they can be integrated into high school and 443 citizen scientist initiatives. Based on our data, we would argue that MinION is now suitable for 444 wide-spread implementation of the "reverse-workflow" where all specimens are sequenced first 445 before mOTUs are assessed for consistency with morphology. Reverse-workflow differs from the 446 traditional workflow in that it relies on DNA sequences for sorting all specimens into putative 447 species while the traditional workflow starts with species-level sorting based on morphology; only 448 some morpho-species are subsequently examined with a limited amount of barcoding. We would 449 argue that the reverse-workflow is more suitable for handling species-and specimen-rich clades 450 because it requires less time than high-quality sorting based on morphology which often involves 451 genitalia preparations and slide-mounts. For example, even if we assume that an expert can sort 452 and identify 50 specimens of unknown phorids per day, the reverse workflow pipeline would 453 increase the species-level sorting rate by >10 times (based on the extraction and PCR of six 454 microplates per day). In addition, the molecular sorting can be carried out by lab personnel trained 455 in amplicon sequencing while accurate morpho-species sorting requires highly specialized 456 taxonomic experts. Yet, even highly trained taxonomic experts are usually not able to match 457 morphologically disparate males and females belonging to the same species (often one sex is 458 ignored in morphological sorting) while the matching of sexes (and immatures) is an automatic 459 and desirable by-product of applying the reverse workflow [7]. All these benefits can be reaped 460 rapidly. One lab member can amplify the barcode of 600-1,000 specimens per day (2-2.5 weeks 461 for 8,000 specimens). Obtaining DNA sequences requires ca. one week because it involves two cycles of pooling and sequencing on two MinION flowcells followed by two cycles of re-pooling 463 and sequencing of weak amplicons. The bioinformatics work requires less than one week. 464

465
One key element of the reverse workflow is that vials with specimens that have haplotype 466 distances <5% are physically kept together. This helps when assessing congruence between 467 mOTUs and morphology. Indeed, graphical representations of haplotype relationships (e.g., 468 haplotype networks) are the guide for the morphological re-examination as illustrated in our 469 description of Megaselia sepsioides (Fig. 7). The eight specimens belonged to seven haplotypes. 470 The most dissimilar haplotypes were dissected in order to test whether the data are consistent 471 with the presence of one or two species. Variations in setation were observed ( Fig. 6)   higher than what was obtained with 1D 2 sequencing in Srivathsan et al. (99.2%) [14]. We suspect 491 that this partially due to improvements in MinION sequencing chemistry and base-calling, but our 492 upgraded bioinformatics pipeline also helps because it increases coverage for the amplicons. 493 These findings are welcome news because 1D library preparations are much simpler than the 494 library preps for 1D 2 . In addition, 1D 2 reads are currently less suitable for amplicon sequencing 495 while QuickExtract™ reagent costs 0.06. These properties make MinION a valuable tool for 500 species discovery whenever a few thousand specimens must be sorted to species (<5,000). Even 501 larger-scale barcoding projects are probably still best tackled with Illumina short-read or PacBio's 502 Sequel sequencing [4, 10, 11] because the barcoding cost is even lower. However, both require 503 access to expensive sequencing instruments, sequencing is thus usually outsourced, and the 504 users usually have to wait for several weeks in order to obtain the data. This is not the case for 505 barcoding with MinION, where most of the data are collected within 10 hours of starting a 506 sequencing run. Another advantage of the MinION pipeline is that it only requires basic molecular 507 lab equipment including thermocyclers, a magnetic rack, a Qubit, a custom-built computational 508 device for base-calling ONT data ("MinIT"), and a laptop (total cost of lab <USD 10,000). Arguably, 509 the biggest operational issue is access to a sufficiently large number of thermocyclers given that 510 a study of the scale described here involved amplifying PCR products in 92 microplates (=92 PCR 511 runs). 512 Our new workflow for large-scale species discovery is based on sequencing the amplicons in two 514 sequencing runs. The second sequencing run can re-use the flowcell that was used for the first 515 run. Two runs are desirable because they improve overall barcoding success rates. The first run 516 is used to identify those PCR products with "weak" signal (=low coverage). These weak products 517 are then re-sequenced in the second run. This dual-run strategy overcomes the challenges 518 related to sequencing large numbers of PCR products: the quality and quantity of DNA extracts 519 are poorly controlled and PCR efficiency varies considerably. Pooling of products ideally requires 520 normalization, but this is not practical when thousands of samples are handled. Instead, one can 521 use the real-time sequencing provided by MinION to determine coverage and then boost the 522 coverage of low-coverage products by preparing and re-sequencing a separate library that 523 contains only the low coverage samples. Given that library preparations only require <200 ng of 524 DNA, even a pool of weak amplicons will contain sufficient DNA. This ability to re-pool within days 525 of obtaining the first sequencing results is a key advantage of MinION. The same strategy could 526 be pursued with Illumina and PacBio but it would take a long time to obtain all results because 527 one would have to wait for the completion of two consecutive runs. Swaibu Katusabe). Subsequently, the material was collected and transferred in accordance with 538 approvals from the Uganda Wildlife Authority (UWA/FOD/33/02) and Uganda National Council for 539 Science and Technology (NS 290/ September 8, 2011), respectively. The material was thereafter 540 sorted to higher-level taxa. Target taxa belonging to Diptera were sorted to family and we here 541 used the phorid fraction. The sampling was done over several months between 2010 and 2011. 542 For the study carried out here, we only barcoded ca. 30% of the phorid specimens. The flies were 543 stored in ethyl alcohol at -20-25°C until extraction. 544

DNA extraction 545
DNA was extracted using whole flies. The fly first taken out from the vial and washed in Milli-Q  546 water prior to being placed in a well of a 96 well PCR plate. DNA extraction was done using 10 ul 547 of QuickExtract™ (Lucigen) in a 96 well plate format and the whole fly was used to extract DNA. 548 The reagent allows for rapid DNA extraction by incubation (no centrifugation or columns are 549 required). The solution with the fly was incubated at 65°C for 15 min followed by 98°C for 2 min. 550 No homogenization was carried to ensure that the intact specimen was available for 551 morphological examination. 552 553

I. Polymerase Chain Reactions (PCRs) 555
Each plate with 96 QuickExtract™ extracts (95 specimens and 1 control, with exception of one 556 plate with no negative and one partial plate) was subjected to PCR in order to amplify the 658 bp 557 fragment of COI using LCO1490 5' GGTCAACAAATCATAAAGATATTGG 3 ' and HCO2198 5' 558 TAAACTTCAGGGTGACCAAAAAATCA 3' [48]. This primer pair has had high PCR success rates 559 for flies in our previous study [14] and hence was chosen for phorid flies. Each PCR product was 560 amplified using primers that included a 13 bp tag. For this study, 96 thirteen-bp tags were newly 561 generated in order to allow for upscaling of barcoding; these tags allow for multiplexing >9200 562 products in a single flowcell of MinION through unique tag combinations (96x96 combinations).
To obtain these 96 tags, we first generated one thousand tags that differed by at least 6 bp using 564 BarcodeGenerator [49] However, tag distances of >6 bp are not sufficiently distinct because they 565 do not take into account MinION's propensity for creating errors in homopolymer stretches and 566 other indel errors. We thus excluded tags with homopolymeric stretches that were >2 bp long. We 567 next used a custom script to identify tags that differed from each other by indel errors. Such tags 568 were eliminated recursively to ensure that the final sets of tags differed from each other by >=3bp 569 errors of any type (any combination of insertions/deletions/substitutions). This procedure yielded 570 a tag set with the edit-distance distribution shown in Additional File 1: Fig S1 [minimum edit 571 distance (as calculated by stringdist module in Python) of 5 nucleotides (38.5%) which is much 572 higher than Nanopore error rates]. Lastly, we excluded tags that ended with "GG" because 573 LCO1490 starts with this motif. Note that longer tags would allow for higher demultiplexing rates, 574 but our preliminary results on PCR success rates suggested that the use of long tags reduced 575 amplification success (one plate: 7% drop). 576 The PCR conditions for all amplifications were as follows, reaction mix: 10 µl Mastermix (CWBio), 577 0.16 µl of 25mM MgCl2, 2 µl of 1 mg/ml BSA, 1 µl each of 10 µM primers, and 1ul of DNA. The 578 PCR conditions were 5 min initial denaturation at 94°C followed by 35 cycles of denaturation at 579 94°C (30 sec), annealing at 45°C (1 min), extension at 72°C (1 min), followed by final extension 580 of 72°C (5 min). For each plate, a subset of 7-12 products were run on a 2% agarose gel to ensure 581 that PCRs were successful. Of the 96 plates studied, 4 plates were excluded from further analyses 582 as they had <50% amplification success and one plate was inadvertently duplicated across the 583 two runs. 584 585

II. MinION sequencing 586
We developed an optimized strategy for nanopore sequencing during the study. For the initial 587 experiment (set 1), we sequenced amplicons for 4,275 phorid flies. For this, all plates were 588 grouped by amplicon strength as judged by the intensity of products on agarose gels and pooled accordingly (5 strong pools + 2 weak pools). The pools were cleaned using either 1X Ampure 590 beads (Beckman Coulter) or 1.1X Sera-Mag beads (GE Healthcare Life Sciences) in PEG and 591 quantified prior to library preparation. The flowcell sequenced for 48 hours and yielded barcodes 592 for ~3200 products, but we noticed lack of data for products for which amplification bands could 593 be observed on the agarose gel. We thus re-pooled products with low coverage (<=50X), 594 prepared a new library and sequenced them on a new flowcell. The experiment was successful. 595 However, in order reduce sequencing cost and improve initial success rates, we pursued a 596 different strategy for the second set of specimens (4,519 specimens). In the first sequencing run, 597 we stopped the sequencing after 24 hours. The flowcell was then washed using ONT's flowcell 598 wash kit and prepared for reuse. The results from the first 24 hours of sequencing were then used 599 to identify amplicons with weak coverage. They were re-pooled, a second library was prepared, 600 and sequenced on the pre-used and washed flowcell. 601 Pooling of weak products with <=50X coverage was done as follows: We located (1) specimens 603 <=10X coverage (set 1: 1,054, set 2: 1,054) and (2) samples with coverage between 10X and 604 50X (set 1: 1,118, set 2: 1,065). Lastly, we also created a (3) third pool of specimens with 605 "problematic" products that were defined as those that were found to be of low accuracy during 606 comparisons with Illumina barcodes and those that had high levels of ambiguous bases (>1% 607 ambiguous bases during preliminary barcode calling). Very few amplicons belonged to this 608 category (set 1: 68, set 2: 92). In order to efficiently re-pool hundreds of specimens across plates 609 we wrote a script that generates visual maps of the microplates that illustrate the wells where the 610 weak products are found (available in github repository for miniBarcoder). were uploaded onto a computer server and base-calling was carried out using Guppy 621 2.3.5+53a111f. No quality filtering criteria were used. Our initial work with Albacore suggested 622 that quality filtering improved demultiplexing rate but overall more reads could be demultiplexed 623 without the filtering criterion. 624

III. Data analyses for MinION barcoding 626
We attempted to demultiplex the data for set 1 using minibar [50], however it was found to 627 demultiplex only 1,039 barcodes for the 4275 specimens (command used: python 628 ../minibar/minibar.py -F -C -e 2 minibar_demfile_1 phorid_run1ab.fa_overlen599 > out ). This 629 success rate was so low that we discontinued the use of minibar. Instead, we analysed the data 630 using an improved version of miniBarcoder [14]. This pipeline starts with a primer search with 631 glsearch36, followed by identifying the "sequence tags" in the flanking nucleotide sequences, 632 before the reads are demultiplexed based on tags. For the latter, errors of up to 2 bp are allowed. 633 These "erroneous" tags are generated by "mutating" the original tags 96 tags to account for all 634 possible insertions/deletions/substitutions. The sequence tags are matched with this set of input 635 tags + tag mutants. This speeds up demultiplexing as it does not have to align each tag. When 636 comparing the performance of miniBarcoder with minibar, we found that using 4 cores, 637 miniBarcoder could demultiplex data in <10 minutes while minibar required 74 minutes in one. 638 Both pipelines demultiplexed similar numbers of reads: 898,979 reads using miniBarcoder, while 639 940,568 HH reads of minibar (56,648 reads in multiple samples). The demultiplexed reads were 640 aligned using MAFFT v7 (--op 0) (here v7) [51]. In order to improve speed, we only used a random subset of 100 reads from each demultiplexed file for alignment. Based on these alignments, a 642 majority rule consensus was called to obtain what we call "MAFFT barcodes". 643 Other studies usually incorporate a step of clustering of data at low thresholds (e.g. 70% by 645 Maestri et al. 2019 [18]) in order to account for the read errors produced by MinION. The 646 subsequent analysis is then carried out on the cluster that has the largest number of reads. We 647 deviate from this approach because it requires high coverage. In our barcoding pipeline, we 648 assess congruence for each base-pair by firstly eliminating the MAFFT gap-opening penalty (--649 op 0); this allows for all data to be used when calling the consensus [14]: this gap opening penalty 650 essentially treats indel and substitutions similarly and staggers the alignment. Any base that is 651 found in <50% of the position is called as an ambiguity; i.e., the majority-rule criterion is applied 652 for each site instead of filtering at the read level which is based on averages across all bases. 653 This site-specific approach maximizes data use by allowing barcodes to be called at much lower order to fix the remaining indel errors. The correction takes advantage of the fact that COI 662 sequences are translatable; i.e., an amino acid-based error correction pipeline can be used 663 (details can be found in Srivathsan et al. [14]). Applying this pipeline to MAFFT and RACON 664 barcodes, respectively, yields MAFFT+AA and RACON+AA barcodes. Lastly, these barcodes can 665 be consolidated them into "consolidated barcodes". 666 The version of the pipeline in Srivathsan et al. [14] was modified as follows: 668 a. Tackling 1D reads for massively multiplexed data: We developed ways for correcting for the 669 increased number of errors in 1D reads by identifying objective ways for quality assessments 670 based on the MinION data and publicly available data (GenBank): (1) The GraphMap max error 671 was increased from 0.05 to 0.15 to account for error rates of 1D reads. (2)   The novel steps introduced in this study are highlighted in green and the scripts available in 1068 miniBarcoder for analyses are further indicated. 1069    Re-sequencing