De novo long-read assembly of a complex animal genome

Eukaryotic genome assembly remains a challenge in part because of the prevalence of complex DNA repeats. This is a particularly acute problem for holocentric nematodes because of the large number of satellite DNA sequences found throughout their genomes. These have been recalcitrant to most genome sequencing methods. At the same time, many nematodes are parasites and some represent a serious threat to human health. There is a pressing need for better molecular characterization of animal and plant parasitic nematodes. The advent of long-read DNA sequencing methods offers the promise of resolving complex genomes. Using Nippostrongylus brasiliensis as a test case, applying improved base-calling algorithms and assembly methods, we demonstrate the feasibility of de novo genome assembly matching current community standards using only MinION long reads. In doing so, we uncovered an unexpected diversity of very long and complex DNA repeat sequences, including massive tandem repeats of tRNA genes. The method has the added advantage of preserving haplotypic variants and so has the potential to be used in population analyses.


Introduction
Human hookworm infections by the parasitic nematodes Necator americanus and Ancylostoma duodenale continue to be a major global health problem. Next generation sequencing techniques open the door to molecular epidemiological monitoring of nematode and helminth parasites in endemic areas. Such studies are, however, hampered by the heterogeneous nature of parasite populations and by the intrinsically complex genome structures of nematodes [1].
The highly portable nature of MinION sequencers makes them well-suited for field studies, and their capacity to generate long sequence reads offer the promise of overcoming both these obstacles. Nippostrongylus brasiliensis is a gastrointestinal nematode that infects rodents. It is widely used as a model for human hookworm (e.g. ref. [2]). Previous attempts to assemble the N. brasiliensis genome using short DNA reads have resulted in a highly fragmented sequence ( Table 1); almost 30% of predicted protein coding genes (6276/22796) are on contigs that are less than 10kb long. We took N. brasiliensis as a test case to evaluate the possibility of generating a genome sequence de novo from a heterogeneous population.
de novo genome assembly based on long DNA reads relies on either hybrid strategies incorporating, for example, short-read DNA sequencing [3][4][5][6] or non-hybrid long-read only methods [7-11] (see [12] for a review). There have now been successful chromosome-scale assemblies of large genomes, primarily using the PacBio single-molecule realtime (SMRT) sequencing platform to generate contigs (e.g. [13]), combined with long-range linking information (e.g. [14,15]) but the approach can still be challenging. We therefore chose a simpler system to evaluate the feasibility of assembling a genome for N. brasiliensis using improved analysis methods.

Results and Discussion
Whole-genome amplification was used to generate DNA from a single adult worm. This was then sequenced with a single run on a MinION Mk1b (see Methods for details). The original base-calling produced 1722835 reads totaling 4.9 Gb of sequence. These reads were processed using a default Canu [16] v1.4 assembly (including error correction and read trimming) giving an unpolished assembly of 4581 contigs for 117 Mb of assembled genomic sequence ( Table 1). We then re-called the same raw nanopore data using Albacore, the production base-caller that recently implemented a transducer algorithm for homopolymer detection, previously a persistent limitation to analysis of MinION reads [17]. Overall, we obtained 2074871 called reads for 6.5 Gb of sequence. This represents a substantial increase (>30%) in length over the previous version. These reads were fed into an up-dated version of Canu (v1.5), with improved read correction and consensus calling, and more accurate graph information in the assembly output, giving an unpolished assembly of 3280 contigs and 119 Mb of assembled genomic sequence. Not only were there local improvements in the calling of homopolymer sequences (Figure 1A), but the contigs showed a much improved length distribution, and a much higher proportion of graphs were resolved (Figure 1B, C), in part because of a improved resolution of repeat sequences (Figure 2A-C).
These results encouraged us to sequence DNA from a population of worms resident in the small intestines of mice. We used 4 sample preparation methods that gave very different results both in terms of yield, which varied more than 2-fold in total, up to 4.7 Gb for a single run, and in read length distribution (Figure 3). In all cases, Albacore out-performed the older base-caller, although interestingly the improvement was far from uniform, varying from less than 10% to greater than 30% in total reads called ( Table 2). As a further gauge of the accuracy of read calling, we compared a random subset of the longer reads with the current N. brasiliensis reference genome. While overall there was generally a good equivalence, especially given the potential genetic difference between the two samples (see Methods), there were numerous instances where the reads were substantially longer than the corresponding sequence in the reference genome. Upon further examination, some of these were revealed to reflect the presence of very long stretches of complex tandem repeats (or VeCTRs) that had been compacted in the WTSI reference genome (Figure 1D, E).
We combined the reads from the 4 samples and fed them into Canu v1.5. This gave an assembly of close to 350 Mb, with a maximum contig length of >2 Mb and an N50 of 210 kb. A total of 251.6 Mb (72.5%) of the sequence, including all of the longest (>1.5 Mb) contigs, was contained within 2594 non-branched sub-graphs of the Canu assembly graph, with 2413 of them made from a single contig, and the others containing an average of 2.15 linked contigs.
The remaining 95.6 Mb was captured by a total of 780 contigs in 89 non-trivial sub-graphs. A small number of these sub-graphs had very complex structures ( Figure S1). Inspection showed this to result from the presence of long non-tandem repeat sequences. Indeed this, together with real haplotypic sequence diversity (see below), was the most common cause of assembly ambiguity. Resolving such structures would require a greater depth of coverage and/or even longer reads.
The final assembly contained a rich diversity of repeat sequences (Figure 2). The repeat with the longest unit length (535 bp) corresponds to a region with 10 tandem copies ( Figure 2D) of an 5S rRNA gene interspersed with an snRNA gene, the source of the spliced leader RNA that is added to many transcripts. The gene sequences and this arrangement are well-conserved in Caenorhabditis elegans, where these pairs are repeated over a region of 16 kb (V:17,118,000..17,132,000; WS260). In the WTSI assembly, on the other hand, equivalent sequences, in single copy, were found at contig ends. There were also transposon-associated repeats, the repeats that correspond to reiterated amino acid motifs in proteins, and the dispersed satellite DNA sequences found in holocentric nematodes [18]. With regard to the other VeCTRs, their length and the constituent repeated sequences are diverse. There was, for example, no sequence similarity among the 5 most compressible VeCTRs ( Figure S2). The longest of them comprised close to 150 copies of a ca. 200 bp repeat, with the (conserved) tRNA-Trp gene, followed by sequence currently unique to N. brasiliensis. Similarly, the shortest of the 5 corresponded to 90 copies of the tRNA-Ser gene interspersed with a N. brasiliensisspecific 80 bp spacer. In C. elegans, there are more than 600 tRNA genes. Some are clustered in small groups but never with such a massively repeated organization. The other 3 VeCTRs contain sequence that is not conserved. Remarkably, VeCTRs have recently been found in the genome of C. elegans, where they account for more than 1 Mb of sequence omitted from the current reference genome (E. Schwarz, personal communication) that was assembled primarily by Sanger sequencing of inserts from cosmid libraries [19]. Since they are not amenable to either short-read NGS or traditional cloning, they may have been overlooked in other species and potentially have a specific but as yet unknown biological role.
Returning to the analysis of the non-trivial sub-graphs, among the less complex ones, 28 linked just 3 contigs. Of these, 2 were due to the presence of VeCTRs and in 2, one contig matched the extremities of 2 non-overlapping contigs, but in a redundant manner ("Plain Link") ( Figure   4A, B). The others showed structures compatible with separate haplotypes. Of these, 12 had 2 contigs with homologous end sequences converging on a common contig ("contained"), a configuration that can also arise when base-calling errors are high. The remaining 12 had stronger support for being genuine haplotypes, including 3 sub-graphs with a bubble structure indicative of haplotype resolution, as supported by examination of the underlying reads ( Figure   4). While attempting to identify a primary genomic haplotype might have some use in determining the true contiguity of the assembly, this assembly represents a community sample and preserves the observed read variation.
In addition to this capacity to uncover haplotypes within a population, overall, there was an impressive increase in the quality of the genome assembly compared to the current WTSI reference genome by standard contiguity criteria ( Figure 1F, Table 1). The WTSI genome scored substantially higher (  [20]. On the other hand, BLAST searches for the "missing" USCO genes identified plausible orthologues in most cases (results not shown). This suggested that even if the assembly continuity was improved, the local sequence quality was not sufficiently high to allow correct gene prediction by BUSCO. We therefore made use of a set of RNA-seq reads generated from our N. brasiliensis strain (GLG et al., unpublished) to correct the sequence of protein-coding genes, while not touching the genome assembly. We tried several methods, and settled on an approach based on genome-guided Trinity that gave the best results as judged by the markedly improved BUSCO scores (with 88% complete; Table 3). Indeed, following this correction the genome scored substantially higher than the WTSI reference, and close to the score of 91% complete BUSCO genes for the de novo assembly from the same RNA-seq reads.
This indicates that the RNA-seq reads cover a substantial fraction of the transcriptome, and that the genome also has excellent coverage of most of the expressed N. brasiliensis genes. There was nonetheless an elevated proportion of fragmented USCOs (see below). We also noted a high proportion of duplicated USCOs. Inspection revealed that some of these were bone fide lineage-specific expansions. For example, the analysis uncovered 3 distinct loci encoding isoforms of fructose 1,6-bisphosphatase (PFAM: PF00316), as predicted also from the WTSI assembly. Pairs of USCOs were also found on homologous contigs. There were 4 such examples in the 12 "heterozygous branch" sub-graphs alone; this presumably reflects haplotypic variants.
While generating a complete high-quality annotation was beyond the scope of this study, we made use of expert knowledge regarding Carbohydrate-Active Enzymes (CAZymes) to provide a complementary insight into the predicted genes compared to the WTSI set [21]. Of the 62 different domain architectures among the 158 well-predicted CAZYme proteins from our assembly, only 36 were represented among the set of 96 such proteins in the WTSI reference proteome. Our assembly included a further 16 domain architectures represented among proteins that were flagged as being incorrectly predicted (N-or C-terminal fragments; (Table   S1). Manual inspection revealed that in most cases, this was a consequence of the presence of non-tandem repeats (of which predicted transposable elements were a subset), where the repetition created ambiguity for transcript assembly (Figure 5). These structures, that also contributed to the presence of fragmented USCOs mentioned above, were even more of a problem for gene prediction in the WTSI assembly; they not only contributed to the fragmentation of USCOs and CAZymes (Table S1), but also introduced contig breaks that could not be bridged in the absence of long reads.
With a draft genome in hand, we then returned to the assembly we had generated from a single worm. Whole-genome amplification methods can suffer from chimera formation and amplification bias. The proportion of WGA reads that mapped over 90% of their length to the WGA assembly was lower than that of WGA reads mapping to the final assembly (41.9% vs 53.3%; Table 4), with a similar distribution of mapped read lengths, indicating that chimerism was not a significant problem here. Since the WGA reads that only mapped to the final assembly (in regions of low sequence coverage) represented 11% of the total number, sequencing depth was not a major limiting factor. As the stringent pairwise mapping also indicated that the WGA assembly captured a third (32.4%) of the predicted genome, it appears that genome amplification was indeed biased; future attempts should use alternate amplification (e.g. primer-free) approaches.
Together, our results support the conclusion that a de novo assembly of a high quality can be obtained using only long reads, even from a heterogeneous population, using a very modest sequencing depth (24X after trimming and correction). They also reveal the need for further improvements in resolving ambiguous contig architectures and in transcript-directed gene structure prediction. Nevertheless, since haplotypic variation could be detected even without RNA-seq-directed sequence correction, they clearly show the potential for using this approach to profile parasite populations, opening the way for detailed molecular epidemiological studies.

Canu 1.5
Changes to Canu relative to v1.4 are documented in the v1.5 release notes (available at https://github.com/marbl/canu/releases/tag/v1.5). Briefly, one major change was a switch in the alignment algorithm used to generate consensus for the corrected reads and final contigs. This improved the corrected reads by slightly increasing their identity and splitting fewer reads. Canu also started using the raw read overlaps to estimate corrected read lengths and used that information to inform its selection of reads for correction. As with all such analysis packages, bugs continue to be identified and corrected. The first release of v1.5, for example, included a bug that erroneously removed heterozygous edges such that in the type of sub-graphs shown in Figure 4A, the results presented would potentially correspond to lower bounds of complexity.
This and other issues have been corrected in recent development versions of Canu and the latest public release.

Nematode cultures
The N. brasiliensis strain used in this study was originally sourced from Lindsey Dent (University

DNA Preparation
Five samples of 25 mg each of frozen worms, cultured, harvested and purified as previously described [22] were sent to Simon Mayes (Oxford Nanopore Technologies) and subject to different methods of sample preparation (see Table S2).

Direct lysis, DNeasy:
Intact worms were added without prior disruption directly into Qiagen buffer G2 (30 mins at 56°C). The lysate was processed using a Qiagen DNeasy kit (ATL / AL, 30 min at 56°C). This yielded 700 ng DNA, 17700 ng RNA. 4. Pipette squashed, G2, Tip20: Worms were mashed against the side of the sample tube using a pipette tip. The cells then underwent chemical lysis in the Qiagen buffer G2 (120 min at 56°C) and lysate was purified using the Qiatip 20 anion exchange column. The DNA was fragmented using a Covaris G-tube prior to library preparation. This yielded 1100ng DNA.
All samples were subsequently processed using the ONT 1D Ligation Sequencing Kit (SQK-LSK108), ligating the ONT adapter mix onto end-prepped and dA-tailed DNA. Each prepared library was loaded onto a different MinION flow cell for sequencing. This is the WGA assembly presented in Figure 1B.

Whole-Genome Amplification and Assembly Experimentation
The WGA reads were then re-called using the updated basecaller (Albacore 1. In order to define the region of raw nanopore sequences for adapter exclusion, the >10k reads were mapped to 50M reads that had been generated by the Wellcome Trust Sanger Institute and that had been used for the existing WTSI assembly: An alternative, less-stringent overlap was also attempted (with trimReadsCoverage=2), but resulted in a less complete assembly. The results of analysis surrounding the different assemblies suggested that the default Canu assembly process of correction, trimming, then assembly produced the best outcome, namely the WGA assembly presented in Figure 1C.

Genome Assembly
Raw reads and called FASTQ files were obtained from ONT (called by workflow "1D RNN Basecalling 450bps FastQ v1.121") following sequencing. The raw reads were re-called using Bowtie2 was used in local mode to map RNA-seq reads to the assembled genome contigs: This produced the final assembly described in the paper. At this stage, we had an assembled genome, but validation of the genome was difficult. We decided to carry out a draft genomeguided transcriptome assembly with Trinity. To evalaute the completeness of the genome, we focused on expressed genes and used a set of Illumina RNA-seq reads to perform a genomeguided transcriptome assembly using Trinity.
The RNA-seq reads were remapped to the corrected assembly for genome-guided Trinity: The assembly that Trinity generated had similar completeness (as measured by BUSCO) to a de novo assembly generated using the same RNA-seq reads (see Table 3). The assembly was, however, very large, with over 350k contigs (see Table 1), most likely due to isoform fragments being included in the transcriptome. We carried out additional filtering steps, reducing the number of assembled contigs while maintaining similar BUSCO completeness scores.

Transcript Filtering
The estimated numbers of mapped RNA-seq reads (as predicted by Salmon [DOI 10.1038/nmeth.4197]) were used to filter transcripts, because the genome-guided assembly was based on RNA-seq expression. RNA-seq reads were mapped to the Bowtie2/Pilon genome-guided assembly, and a threshold of 50 reads for true expression of complete BUSCO genes was chosen for filtering transcripts from the genome-guided Trinity transcriptome. The longest ORF from each transcript was extracted to generate a protein sequence for further collapsing using cd-hit to remove protein sequences that had at least 98% identity to a longer protein, leaving 56980 transcripts. Each of these steps are outlined below.
The RNA-seq reads were mapped to the Trinity-generated transcripts using Salmon : The longest isoform for each gene was identified, producing an isoform-collapsed transcriptome subset, which was clustered at the protein level by CDHIT at 90% identity:    Figure 2. Analysis of repeat sequences within different assemblies. SATFIND [18] was used on contigs > 2.5kb. The total length of each region of repeated DNA sequence is plotted against the repeat's unit length, for the assemblies from DNA amplified from a single worm (WGA), using Canu v1.4 (A) or Canu v1.5 (B), for the WTSI genome assembly (C) and our final assembly (D). The orange and red lines are at 150 bp (typical maximum read length for an Illumina HiSeq run) and 650 bp, respectively. Any VeCTRs with a unit length longer than 150 bp would not be identifiable as repetitive sequence on an Illumina sequencer. Any VeCTRs with a region length longer than 650 bp will be collapsed into a shorter region if only non-mate-paired 150 bp reads are used for the assembly. (E) Alignment of the region corresponding to the longest repeat unit length in (D), with each base represented as a colored line. Although the resolution of repeats is greatly improved compared to the WTSI assembly, such sequences still present a challenge for assembly, as evidenced by the fact that the 5' end of this sequence corresponds to a contig end (tig00023164; coordinates on the left). The color code is as in Figure 1A.  Table S2). The curve for the sequences from the DNA amplified from a single worm (WGA) is shown for comparison.

Figure 4. Classification of sub-graphs and identification of a haplotype signature. (A)
Bandage plots for the 28 simple GFA sub-graphs made of 3 contigs. The box on the right is an enlarged view of a "heterozygous branch" sub-graph with a total length of 451 kb. (B) Dot plots of LAST all-against-all minimum-distance sequence comparisons between the 3 constituent contigs, delineated by the gray lines, for representatives for each of the sub-graph classes. The sum of the contig lengths is indicated. The 451 kb sub-graph is that enlarged in (A). (C) Multiple alignment of contig sequences from this sub-graph and corresponding region of the DNA reads that map to them, 300 bp either side of the defining deletion. The color code is as in Figure 1A.