MinION barcodes: biodiversity discovery and identification by everyone, for everyone

DNA barcodes are a useful tool for discovering, understanding, and monitoring biodiversity which are critical at a time when biodiversity loss is a major problem for many countries. However, widespread adoption of barcodes requires cost-effective and simple barcoding methods. We here present a workflow that satisfies these conditions. It was developed via “innovation through subtraction” and thus requires minimal lab equipment, can be learned within days, reduces the barcode sequencing cost to <10 cents, and allows fast turnaround from specimen to sequence by using the real-time sequencer MinION. We first describe cost-effective and rapid procedures for obtaining tagged amplicons. We then demonstrate how a portable MinION device can be used for real-time sequencing of tagged amplicons in many settings (field stations, biodiversity labs, citizen science labs, schools). Small projects can use the flow cell dongle (“Flongle”) while large projects can rely on MinION flow cells that can be stopped and re-used after collecting sufficient data for a given project. We also provide amplicon coverage recommendations that are based on several runs of MinION flow cells (R10.3) which suggest that each run can generate >10,000 barcodes. Next, we present a novel software, ONTbarcoder, which overcomes the bioinformatics challenges posed by the sequencing errors of MinION reads. This software is compatible with Windows10, Macintosh, and Linux, has a graphical user interface (GUI), and can generate thousands of barcodes on a standard laptop within hours based on two input files (FASTQ, demultiplexing file). We document that MinION barcodes are virtually identical to Sanger and Illumina barcodes for the same specimens (>99.99%). Lastly, we demonstrate how rapidly MinION data have improved by comparing the performance of sequential flow cell generations. We overall assert that barcoding with MinION is the way forward for government agencies, universities, museums, and schools because it combines low consumable and capital cost with scalability. Biodiversity loss is threatening the planet and the use of MinION barcodes will help with enabling an army of researchers and citizen scientists, which is necessary for effective biodiversity discovery and monitoring.

includes 12 flow cells thus increasing the cost of the first 12 runs to ~$170 USD. The 3 3 6 turnaround time is fast, so the MinION Flongle is arguably the best sequencing option for 3 3 7 small barcoding projects with > 50 barcodes. Full MinION flow cells also have fast 3 3 8 turnaround times, but the minimum run cost is ca. 1000 USD, so this option only becomes 3 3 9 more cost-effective than Flongle when >1800 amplicons are sequenced. As shown later, one 3 4 0 regular MinION flow cell can comfortably sequence 10,000 amplicons. This is a similar 3 4 1 volume to what has been described for PacBio (Sequel) (Hebert, Braukmann et al. 2018), 3 4 2 but the high instrument cost for PacBio means that sequencing usually has to be 3 4 3 outsourced, leading to longer wait times. By far the most cost-effective sequencing method 3 4 4 for barcodes is Illumina's NovaSeq sequencing. The fixed costs for library and lanes are high 3 4 5 (3000-4000 USD), but each flow cell yields 800 million reads which can comfortably 3 4 6 sequence 800,000 barcodes at a cost of < $0.01 USD per barcode. This high capacity 3 4 7 means that the 6 million publicly available barcodes in BOLD Systems could have been 3 4 8 sequenced on just <8 NovaSeq flow cells for ~50,000 USD. However, Illumina sequencing 3 4 9 can only be used for mini-barcodes of up to 420 bp length (using 250bp PE sequencing 3 5 0 using SP flow cell). "Full-length" COI barcode (658 bp) can only be obtained by sequencing 3 5 1 two amplicons. Note that while Illumina barcodes are shorter than "full-length" barcodes, 3 5 2 there is no evidence that mini-barcodes have a negative impact on species delimitation or 3 5 3 identification as long as the mini-barcode is >250 bp in length (Yeo, Srivathsan et al. 2020). it is sufficient to combine only 1 μ l per PCR product. The pool can be cleaned using several 3 5 8 PCR clean-up methods. We generally use SPRI bead-based clean-up, with Ampure 3 5 9 (Beckman Coulter) beads but Kapa beads (Roche) or the more cost-effective Sera-Mag 3 6 0 beads (GE Healthcare Life Sciences) in PEG (Rohland and Reich 2012) are also viable 3 6 1 options (Srivathsan, Hartop et al. 2019). We recommend the use of a 0.5X ratio for Ampure 3 6 2 beads for barcodes longer than 300 bp since it removes a larger proportion of primers and primer dimers. However, this ratio is only suitable if yield is not a concern (e.g., pools 3 6 4 consisting of many and/or high concentration amplicons). Increasing the ratio to 0.7-1X will 3 6 5 improve yield but render the clean-up less effective. Amplicon pools containing large 3 6 6 numbers of amplicons usually require multiple rounds of clean-up, but only a small subset of 3 6 7 the entire pool needs to be purified because most library preparation kits require only small 3 6 8 amounts of DNA. Note that the success of the clean-up procedures should be verified with 3 6 9 gel electrophoresis, which should yield only one strong band of expected length. After the 3 7 0 clean-up, the pooled DNA concentration is measured in order to use an appropriate amount 3 7 1 of DNA for library preparation. Most laboratories use a Qubit, but less precise techniques are 3 7 2 probably also suitable. Obtaining a cleaned amplicon pool according to the outlined protocol is not time consuming.

7 5
However, many studies retain "old Sanger sequencing habits". For example, they use gel 3 7 6 electrophoresis for each PCR reaction to test whether an amplicon has been obtained and 3 7 7 then clean and measure all amplicons one at a time for normalization (often with very BioAnalyzer, Qubit: (Seah, Lim et al. 2020)). This is presumably done to obtain a pool of 3 8 0 amplicons where each has equal representation. However, reads are cheap while individual 3 8 1 clean-ups and measurements for each PCR product are expensive. Furthermore, weak 3 8 2 products that failed to yield a barcode can be re-sequenced (Srivathsan, Hartop et al 2019).

8 3
Another strategy is normalization based on the band strength of a few PCR products per 3 8 4 plate as determined by gel electrophoresis. We used this strategy in the current study, but 3 8 5 have since determined that pooling without such normalization yields nearly identical 3 8 6 success rates based on the same number of reads (>9600 amplicons: 76.8% vs. 77.2%).
Oxford Nanopore Technologies (ONT) instruments sequence DNA by passing single-3 9 1 stranded DNA through a nanopore. This creates current fluctuations which can be measured 3 9 2 and translated into a DNA sequence via basecalling (Wick 2019). The sequencing devices 3 9 3 are small and inexpensive, but the read accuracy is only moderate (85% -95%) (Wick 2019, 3 9 4 Silvestre-Ryan and Holmes 2021). This means that data analysis requires specialized 3 9 5 bioinformatics pipelines. The nanopores used for sequencing are arranged on flow cells, with cells and 100 ng for the Flongle and used ligation-based kits(see Table 1 for details). We 4 1 0 generally followed kit instructions, but excluded the FFPE DNA repair mix in the end-repair 4 1 1 reaction, as this is mostly needed for formalin-fixed, paraffin-embedded samples. The to 1x for all steps as DNA barcodes are short whereas the recommended ratio in the manual 4 1 7 is for longer DNA fragments. The libraries were loaded and sequenced with a MinION Mk 4 1 8 1B. Data capture involved a MinIT or a Macintosh computer that meets the IT specifications 4 1 9 recommended by ONT. The bases were called using Guppy (versions provided in Table 2 Sequencing. Six amplicon pools were sequenced (Table 1)   analyzing the read length distribution in the FASTQ file. Only those reads that meet the read 4 7 0 length threshold are demultiplexed (default= 658 bp corresponding to metazoan COI 4 7 1 barcode). Technically, the threshold should be the amplicon length plus the length of both 4 7 2 tagged primers, but ONT reads have indel errors such that they are occasionally too short 4 7 3 and we therefore advise to specify the amplicon length as threshold. Reads that are twice 4 7 4 the expected fragment length are split into two parts. Splitting is based on the user given 4 7 5 fragment size, primer and tag lengths, and a window size to account for indel errors 4 7 6 (default=100 bp). matched against the tags from the user-provided tag combinations (demultiplexing file). In 4 8 5 order to account for sequencing errors, not only exact matches are accepted, but also 4 8 6 matches to "tag variants" that differ by up to 2 bps from the original tag 4 8 7 (substitutions/insertions/deletions). Note that accepting tag variants does not lead to 4 8 8 demultiplexing error because all tags differ by >4 bp. All reads thus identified as belonging to 4 8 9 the same specimen are pooled into the same bin. To increase efficiency, demultiplexing is 4 9 0 parallelized and the search space for primers and tags are restricted to user-specified parts 4 9 1 of each read. 4 9 2 4 9 3 b. Barcode calling: Barcode calling uses the reads within each specimen-specific bin to 4 9 4 reconstruct the barcode sequence. The reads are aligned to each other and a consensus "Consensus by Similarity" and "Consensus by barcode comparison". The user can opt to 4 9 7 only use some of these methods. 4 9 8 4 9 9 "Consensus by Length" is the main barcode calling mode. Alignment must be efficient in 5 0 0 order to obtain high-quality barcodes at reasonable speed for thousands of amplicons. 5 0 1 ONTbarcoder delivers speed by using an iterative approach that gradually increases the 5 0 2 number of reads ("coverage") that is used during alignment. However, reconstructing 5 0 3 barcodes based on few reads could lead to errors which are here weeded out by using four 5 0 4 Quality Control (QC) criteria. The first three QC criteria are applied immediately after the 5 0 5 consensus sequence has been called: (1) the barcode must be translatable, (2) it has to 5 0 6 match the user-specified barcode length, and (3) the barcode has to be free of ambiguous 5 0 7 bases ("N"). To increase the chance of finding a barcode that meets all three criteria, we 5 0 8 subsample the reads in each bin by read length (thus the name "Consensus by Length"); 5 0 9 i.e., initially only those reads closest to the expected length of the barcode are used. For 5 1 0 example, if the user specified coverage=25x for a 658 bp barcode, ONTbarcoder would only 5 1 1 use the 25 reads that have the closest match to 658 bp. The fourth QC measure is only 5 1 2 applied to barcodes that have already met the first three QC criteria. A multiple sequence 5 1 3 alignment (MSA) is built for the barcodes obtained from the amplicon pool, and any barcode 5 1 4 that causes the insertion of gaps in the MSA is rejected. Note that if the user suspects that 5 1 5 barcodes of different length are in the amplicon pool, the initial analysis should use the 5 1 6 dominant barcode length. The remaining barcodes can then be recovered by re-analyzing all 5 1 7 data or only the failed read bins ("remaining", see below) and bins that yielded barcodes that 5 1 8 had to be "fixed". These bins can be reanalyzed using a different pre-set barcode length. 5 1 9 5 2 0 "Consensus by Similarity". The barcodes that failed the QC during the "Consensus by 5 2 1 Length" stage are often close to the expected length and have few ambiguous bases, and/or 5 2 2 cause few gaps in the MSA. These "preliminary barcodes" can be improved through Such reads often differ considerably from the signal of the consensus barcode and 5 2 5 ONTbarcoder identifies them by sorting all reads by similarity to the preliminary barcode.

2 6
Only the top 100 reads (this default can be changed) that differ by <10% from the 5 2 7 preliminary barcode are retained and used for calling the barcodes again using the same 5 2 8 techniques described previously (including the same QC criteria). This distance threshold 5 2 9 accounts for errors generated by MinION but excludes highly erroneous or contaminating 5 3 0 reads. This improvement step converts many preliminary barcodes found during "Consensus 5 3 1 by Length" into barcodes that pass all four QC criteria by filling/removing indels or resolving 5 3 2 an ambiguous base. parameters can be modified in the "parfile" supplied with the software which will help with 5 8 7 adjusting the values given the rapidly changing nanopore technology. All remaining MSAs in 5 8 8 the pipeline (e.g., of preliminary barcodes) use MAFFT's default settings. All read and 5 8 9 sequence similarities are determined with the edlib python library under the Needle-Wunsch 5 9 0 ("NW") setting, while primer search is using the infix options ("HW"). All consensus 5 9 1 sequences are called from within the software. This is initially done based on a minimum 5 9 2 frequency of 0.3 for each position. This threshold was empirically determined based on 5 9 3 datasets where MinION barcodes can be compared to Sanger/Illumina barcodes. The 5 9 4 threshold is applied as follows. All sites where >70% of the reads have a gap are deleted.

9 5
For the remaining sites, ONTbarcoder accepts those consensus bases that are found in at 5 9 6 least >30% of the reads. If no base/multiple bases reach this threshold, an "N" is inserted. 5 9 7 To avoid reliance on a single threshold, ONTbarcoder allows the user to change the 5 9 8 consensus calling threshold from 0.2 to 0.5 for all barcodes that fail the QC criteria at 0.3 5 9 9 frequency. However, barcodes called at different frequencies are only accepted if they pass 6 0 0 the first three QC criteria and are identical. If no such barcode is found, the 0.3 frequency 6 0 1 consensus barcode is used for further processing. reference. The barcode comparisons are conducted using edlib library. The barcodes in the 6 0 sets are compared and classified into three categories: "identical" where sequences are a 6 0 8 perfect match and lack ambiguities, "compatible" where the sequences only differ by 6 0 9 ambiguities, and "incorrect" where the sequences differ by at least one base pair. Several 6 1 0 output files are provided. A summary sheet, a FASTA file each for "identical", "compatible", 6 1 1 and the sequences only found in one dataset. Lastly, there is a folder with FASTA files 6 1 2 containing the different barcodes for each incompatible set of sequences. This module can 6 1 3 be used for either comparing set(s) of barcodes to reference sequences, or for comparing 6 1 4 barcode sets against each other. It furthermore allows for pairwise comparisons and 6 1 5 comparisons of multiple sets in an all-vs-all manner. This module was used here to get the 6 1 6 final accuracy values presented in Table 3. 6 1 7 6 1 8

Performance of flow cells (R10.3, Flongle) and high-accuracy basecalling 6 1 9
The pools used to test the new ONT products contained amplicons for 191 -9,932 6 2 0 specimens and were run for 15-49 hours ( Table 2). The fast5 files were basecalled using 6 2 1 Guppy in MinIT under the high accuracy (HAC) model. Basecalling large datasets under 6 2 2 HAC is currently still very slow and took 12 days in MinIT for the Palaearctic Phoridae (658 6 2 3 bp) dataset (Table 2) but the reads yielded high demultiplexing rates for three of the four 6 2 4 R10.3 MinION datasets (= 30-49%). The exception was the Palaearctic Phoridae (313 bp) 6 2 5 dataset (15.5%). Flongle datasets showed overall also lower demultiplexing rates (17-21%). 6 2 6 6 2 7 Overall, we thus predict that most users will, at most, try to multiplex 10,000 amplicons in the 8 0 3 same MinION flow cell so that the sequencing cost per specimen would be 0.06-0.10 USD 8 0 4 depending on the bulk purchase of flow cells. However, we also predict that large-scale 8 0 5 biodiversity projects will switch to sequencing with PromethION, a larger sequencing unit 8 0 6 that can accommodate up to 48 flow cells. This will lower the sequencing cost by more than 1000 USD, but we recommend purchase of Mk1C unit (currently 4900 USD) for easy access 8 1 5 to a GPU that is required for high accuracy basecalling. Note also, that obtaining flow cells at 8 1 6 low cost often requires collaboration between several labs because it allows for buying flow 8 1 7 cells in bulk. sensitive samples that could degrade before reaching a lab. However, for the time being it is 8 2 9 unlikely to help substantially with tackling the challenges related to large-scale biodiversity 8 3 0 discovery and monitoring because obtaining few MinION barcodes per flow cell is too 8 3 1 expensive for most researchers in biodiverse countries. Additionally, the bioinformatic 8 3 2 pipelines that were developed for these small-scale projects were not suitable for large- ONTbarcoder evolved from miniBarcoder, whose barcodes have been assessed for in the immediate future and readers are advised to watch out for developments. Fortunately, 8 7 8 these changes will only further improve MinION barcodes that are already highly accurate 8 7 9 and cost-effective. specimens remains essential for discovering and describing species as it preserves 8 8 8 individual voucher specimens associated with the barcode which can be used for further 8 8 9 research. Taxonomic research can be guided by examination of putative species units