Canu 1.5
Changes to Canu relative to v1.4 are documented in the v1.5 release notes (available at https://github.com/marbl/canu/releases/tag/v1.5). Briefly, one major change was a switch in the alignment algorithm used to generate consensus for the corrected reads and final contigs. This improved the corrected reads by slightly increasing their identity and splitting fewer reads. Canu also started using the raw read overlaps to estimate corrected read lengths and used that information to inform its selection of reads for correction. As with all such analysis packages, bugs continue to be identified and corrected. The first release of v1.5, for example, included a bug that erroneously removed heterozygous edges such that in the type of subgraphs shown in Fig. 4a, the results presented would potentially correspond to lower bounds of complexity. This and other issues have been corrected in recent development versions of Canu and the latest public release.
Nematode cultures
The N. brasiliensis strain used in this study was originally sourced from Lindsey Dent (University of Adelaide) and maintained at the Malaghan Institute for 22 years by serial passage through female Lewis rats [31]. Ethics approval is overseen and approved by the Animal Ethics Committee of Victoria University of Wellington. The strain used for the WTSI reference genome is not fully documented. In principle, it derives from a line that had been maintained serially at the National Institute for Medical Research at Mill Hill by Bridget Ogilvie starting in the early 1970s and then by Rick Maizel, at Imperial College. Murray Selkirk continued to maintain the strain at Imperial after R. Maizel moved to the University of Edinburgh, where he established parallel lines. These Edinburgh lines were supplemented on a number of occasions with cultures from Imperial College. The standard cycle involved injection of 4000–6000 L3 larvae into Sprague–Dawley rats, but was otherwise similar to that used at the Malaghan (R. Maizel, personal communication).
DNA preparation
Five samples of 25 mg each of frozen worms, cultured, harvested and purified as previously described [31], were sent to Simon Mayes (ONT) and subject to different methods of sample preparation (see Additional file 4: Table S2):
-
1.
Sonication, DNeasy: Worms were disrupted by sonication and then processed using a Qiagen DNeasy kit (ATL/AL, 30 min at 56 °C). This yielded 900 ng of DNA and 1800 ng of RNA.
-
2.
Sonication, G2, Tip20: Worms were disrupted by sonication, then lysed using Qiagen buffer G2 (30 min at 56 °C). The lysate was purified using a Qiatip 20 anion exchange column. This yielded 800 ng of DNA.
-
3.
Direct lysis, DNeasy: Intact worms were added without prior disruption directly into Qiagen buffer G2 (30 mins at 56 °C). The lysate was processed using a Qiagen DNeasy kit (ATL/AL, 30 min at 56 °C). This yielded 700 ng of DNA and 17,700 ng of RNA.
-
4.
Pipette squashed, G2, Tip20: Worms were mashed against the side of the sample tube using a pipette tip. The cells then underwent chemical lysis in the Qiagen buffer G2 (120 min at 56 °C) and lysate was purified using the Qiatip 20 anion exchange column. The DNA was fragmented using a Covaris G-tube prior to library preparation. This yielded 1100 ng of DNA.
All samples were subsequently processed using the ONT 1D Ligation Sequencing Kit (SQK-LSK108), ligating the ONT adapter mix onto end-prepped and dA-tailed DNA. Each prepared library was loaded onto a different MinION flow cell for sequencing.
Genome assembly
Only single sequencing runs were performed for each sample. The MinION is a random-access sequencer. Our previous detailed research on the N. brasiliensis mitochondrial genome [31] and current research on mouse cDNA (DE, unpublished) has demonstrated that coverage is surprisingly uniform. If the same sample were sequenced multiple times, there would be differences in the precise reads obtained, but the accumulated experience of the many labs using the MinION shows that it will not change the genomic distribution of reads nor the systematic errors in base-calling and consensus assembly in any significant way. Raw reads and called FASTQ files were obtained from ONT (called by workflow “1D RNN Basecalling 450 bps FastQ v1.121”) following sequencing. Base-calling was performed remotely at the time of sequencing by Metrichor. The specific base-calling version ID was reported as:
-
chimaera version = 1.23.4
-
dragonet version = 1.23.0
-
event_detection = Analyses/EventDetection_000
-
name = ONT Sequencing Workflow
-
segmentation = Analyses/Segment_Linear_000
-
time_stamp = 2017-Jan-21 07:55:13
The raw reads were recalled using Albacore 1.1.0, which implements a time-domain correction (transducer) for homopolymeric regions in the called sequence. This is also available in the open-source Scrappie (https://github.com/nanoporetech/scrappie) base caller:
for x in $(ls -d [CFED]_??????); do echo ${x}; read_fast5_basecaller.py -t 6 -i ${x} -s called_${x} -o fastq -c r94_450bps_linear.cfg; done
Illumina reads for the WTSI genome assembly were retrieved from the Sequence Read Archive (accession ERR063640). These reads were processed into k-mer counts using Jellyfish v2.2.6 [32], and the resulting histogram used in GenomeScope (see below):
jellyfish count -C --bf-size 20G -t 6 -s 2G -m 21 <(zcat ERR063640_1P.fastq.gz) <(zcat ERR063640_2P.fastq.gz) -o ERR063640_mer_counts.jf
jellyfish histo ERR063640_mer_counts.jf > ERR063640_mer_counts.histo
Using the FASTQ files, rather than just reads from the called pass bin, allowed a maximum amount of sequence information to be recovered. Reads were trimmed by 65 bp at each end to exclude adapters (see below), then Canu v1.5 was run with default parameters. The estimated genome size was 205 Mb as determined using GenomeScope (qb.cshl.edu/genomescope/) and the WTSI Illumina reads. Long-read sequencing of a heterogeneous population would be expected to generate a longer genome size due to the separation of haplotypes. We, therefore, made a conservative estimate of 300 Mb. The assumed genome size alters how many reads are selected for the final Canu assembly. If the coverage of corrected reads is greater than 40×, then only the longest reads are used (up to 40× coverage). In our case, the coverage was below 40× regardless of what parameters were used, so it would not be expected to alter the outcome. Within Canu, assembly parameters are adjusted when read counts are below 20× (e.g. corMinCoverage = 0) as previously described [33]:
## trim reads
pv called_[CFED]_*_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl -t 65 | gzip > called_CFED_65bptrim_albacore_1.1.0.fq.gz
## run Canu
~/install/canu/canu-1.5/Linux-amd64/bin/canu -nanopore-raw called_CFED_65bptrim_albacore_1.1.0.fq.gz -p Nb_ONTCFED_65bpTrim_t1 -d Nb_ONTCFED_65bpTrim_t1 genomeSize=300 M
Bowtie2 was used in local mode to map RNA-seq reads to the assembled genome contigs:
bowtie2 -p 10 --local -x Nb_ONTCFED_65bpTrim_t1.contigs.fasta -1 ../1563-all_R1_trimmed.fastq.gz -2 ../1563-all_R2_trimmed.fastq.gz | samtools sort > 1563_vs_uncorrected_NOCFED.bam
Pilon was used to correct based on the RNA-seq mapping to the genome, with structural reassembly disabled (in case it collapsed introns):
java -Xmx40G -jar ~/install/pilon/pilon-1.22.jar --genome Nb_ONTCFED_65bpTrim_t1.contigs.fasta --frags 1563_vs_uncorrected_NOCFED.bam --fix snps,indels --output BT2Pilon_NOCFED --gapmargin 1 --mingap 10000000 --threads 10 --changes 2>BT2Pilon_NOCFED.stderr.txt 1>BT2Pilon_NOCFED.stdout.txt
Contigs that were entirely composed of homopolymer sequences were identified using grep and removed from the assembly:
## identify homopolymer (and binary division-rich) regions
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl > hplength_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode YR > hplength_YR_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode SW > hplength_SW_BT2Pilon_NOCFED.txt
pv BT2Pilon_NOCFED.fasta | ~/scripts/fastx-hplength.pl -mode MK > hplength_MK_BT2Pilon_NOCFED.txt
## example grep hunt for repeated sequence
cat BT2Pilon_NOCFED.fasta | grep -e '^[AT]\{80\}' -e '^>' | grep --no-group-separator -B 1 '^[AT]\{80\}' | ~/scripts/fastx-length.pl
## exclude contigs and sort by length
~/scripts/fastx-fetch.pl -v tig00010453 tig00024413 tig00024414 tig00023947 | ~/scripts/fastx-sort.pl -l > Nb_ONTCFED_65bpTrim_t1.contigs.hpcleaned.fasta
This produced the final assembly described in the paper. At this stage, we had an assembled genome, but validation of the genome was difficult. We decided to carry out a draft genome-guided transcriptome assembly with Trinity. To evaluate the completeness of the genome, we focused on expressed genes and used a set of Illumina RNA-seq reads to perform a genome-guided transcriptome assembly using Trinity.
The RNA-seq reads were remapped to the corrected assembly for genome-guided Trinity:
bowtie2 -p 10 -t --local --score-min G,20,8 -p 10 -x BT2Pilon_NOCFED_hpcleaned.fasta --rf -X 15000 -1
../1563-all_R1_trimmed.fastq.gz -2 <(pv../1563-all_R2_trimmed.fastq.gz | zcat) 2>bowtie2_1563_vs_BNOCFED_hp.summary.txt | samtools sort > bowtie2_1563_vs_BNOCFED_hp.bam
## Trinity assembly; assume introns can be up to 15 kb in length
~/install/trinity/trinityrnaseq-Trinity-v2.4.0/Trinity --CPU 10 --genome_guided_bam bowtie2_1563_vs_BNOCFED_hp.bam --genome_guided_max_intron 15000 --max_memory 40G --SS_lib_type RF --output trinity_BNOCFED
The assembly that Trinity generated had similar completeness (as measured by BUSCO) to a de novo assembly generated using the same RNA-seq reads (see Table 3). The assembly was, however, very large, with over 350 k contigs (see Table 1), most likely due to isoform fragments being included in the transcriptome. We carried out additional filtering steps, reducing the number of assembled contigs while maintaining similar BUSCO completeness scores.
Transcript filtering
Our RNA-seq data is publicly available from the European Nucleotide Archive of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena/data/view/ERS1809079), where details of the samples can be found.
The estimated numbers of mapped RNA-seq reads (as predicted by Salmon [34]) were used to filter transcripts, because the genome-guided assembly was based on RNA-seq expression. Salmon uses a pseudo-alignment approach and was chosen to correct for reads mapping to the same sequence, but on different isoforms and/or genes rather than alternative alignment-based approaches that are designed around the expectation of a relatively complete and non-repetitive genome. Since this analysis was to estimate the expression distribution, however, the specific mapping strategy is not expected to influence greatly the outcome. RNA-seq reads were mapped to the Bowtie2/Pilon genome-guided assembly, and a threshold of 50 reads for true expression of complete BUSCO genes was chosen for filtering transcripts from the genome-guided Trinity transcriptome. The longest open reading frame from each transcript was extracted to generate a protein sequence for further collapsing using cdhit to remove protein sequences that had at least 98% identity to a longer protein, leaving 56,980 transcripts. Each of these steps are outlined below.
The RNA-seq reads were mapped to the Trinity-generated transcripts using Salmon:
## create Salmon index
~/install/salmon/Salmon-0.8.2_linux_x86_64/bin/salmon index -t Trinity-BNOCFED.fasta -i Trinity-BNOCFED.fasta.sai
## quantify transcript coverage with Salmon
~/install/salmon/Salmon-0.8.2_linux_x86_64/bin/salmon quant -i Trinity-BNOCFED.fasta.sai -1 ../../1563-all_R1_trimmed.fastq.gz -2 ../../1563-all_R2_trimmed.fastq.gz -p 10 -o quant/1563-all_quant -l A
The expression of BUSCO genes was used to set a credible signal cutoff (Additional file 7: Figure S4). A threshold of 50 mapped reads was chosen, as 98% of complete BUSCO sequences had more than 50 mapped reads. The transcripts were subsetted based on their expression scores:
## Subset transcripts based on a threshold of 50 counts
pv Trinity-GG.fasta | ~/scripts/fastx-fetch.pl -i HighCount_Transcripts_Num50.txt > HighCount50_TBNOCFED.fasta
The longest Met to Stop open reading frame was identified for each transcript for protein-based clustering with cdhit:
pv HighCount50_TBNOCFED.fasta | getorf -find 1 -noreverse -sequence/dev/stdin -outseq/dev/stdout | ~/scripts/fastx-isofilter.pl -o > longest_MetStopORF_HC50_TBNOCFED.fasta
## run cdhit
cdhit -T 10 -c 0.98 -i longest_MetStopORF_HC50_TBNOCFED.fasta -o cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta
## identify names of longest representative proteins for each cluster
grep '^>' cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta | perl -pe 's/^>//' > cdhit_0.98_LMOHC50_TBNOCFED.prot.names.txt
## fetch transcripts associated with the representative proteins
pv HighCount50_TBNOCFED.fasta | ~/scripts/fastx-fetch.pl -i cdhit_0.98_LMOHC50_TBNOCFED.prot.names.txt > cdhit_0.98_LMOHC50_TBNOCFED.tran.fasta
The longest isoform for each gene was identified, producing an isoform-collapsed transcriptome subset, which was clustered at the protein level by cdhit at 90% identity:
## extract longest protein for each transcript
cat cdhit_0.98_LMOHC50_TBNOCFED.prot.fasta | ~/scripts/fastx-isofilter.pl > LI_CD98LMOHC50_TBNOCFED.prot.fasta
## cluster at 90% via CDHIT
cdhit -i LI_CD98LMOHC50_TBNOCFED.prot.fasta -o CDLI_CD98LMOHC50_TBNOCFED.prot.fasta
## find associated transcripts
grep '^>' CDLI_CD98LMOHC50_TBNOCFED.prot.fasta | perl -pe 's/^>//' > CDLI_CD98LMOHC50_TBNOCFED.names.txt
cat cdhit_0.98_LMOHC50_TBNOCFED.tran.fasta | ~/scripts/fastx-fetch.pl -i CDLI_CD98LMOHC50_TBNOCFED.names.txt > CDLI_CD98LMOHC50_TBNOCFED.tran.fasta
BUSCO was run on the collapsed transcripts to provide one measure of genome completeness:
## run BUSCO on isoform-collapsed transcripts in long genome mode
python ~/install/busco/BUSCO.py -i../CDLI_CD98LMOHC50_TBNOCFED.tran.fasta -o BUSCO_longgeno_CDLI_CD98LMOHC50_TBNOCFED_nematodes -l \
~/install/busco/nematoda_odb9 -m geno -c 10 --long
## run BUSCO on isoform-collapsed transcripts in transcript mode
python ~/install/busco/BUSCO.py -i../CDLI_CD98LMOHC50_TBNOCFED.tran.fasta -o BUSCO_tran_CDLI_CD98LMOHC50_TBNOCFED_nematodes -l \
~/install/busco/nematoda_odb9 -m tran -c 10
This filtered set had very similar BUSCO scores to the original genome-guided Trinity assembly, despite the number of transcripts being reduced down to 1/6 of their original count, and the length of the transcriptome being reduced down to less than 1/3 of its original size. The number of duplicated BUSCO genes in the filtered set suggests that this set could probably be made smaller with a looser cd-hit-est clustering, although there is a chance that such a reduction may cause gene copies to be clustered together.
The missing USCOs were remapped to our assembled genome to determine whether any of the corresponding genes were actually present in the genome:
blastx -num_threads 10 -query <(pv/mnt/gg_nanopore/gringer/ONT_Jan17/Nb_ONTCFED_65bpTrim_t1/Nb_ONTCFED_65bpTrim_t1.contigs.fasta)
-db missing_busco_list_intersectionBUSCO_nematodes.fasta -outfmt 6 -evalue 1e-3 > BLAST_NOCFED_vs_BUSCO_missing_intersection.tsv
Hits for the same contig/USCO combination were merged using a custom script, and sequence extracted from the assembly to cover the entire matched region:
(~/scripts/blastx_boundaries.pl BLAST_NOCFED_vs_BUSCO_missing_intersection.tsv | while read a b; do samtools faidx/mnt/gg_nanopore/gringer/ONT_Jan17/Nb_ONTCFED_65bpTrim_t1/Nb_ONTCFED_65bpTrim_t1.contigs.fasta ${b} | perl -pe "s/^>/>${a}-/"; done) > BLAST_NOCFED_vs_BUSCO_missing_intersection.fasta
Whole-genome amplification and assembly
DNA extracted from a single adult worm was amplified using a Qiagen Midi RepliG kit. Raw reads and called FASTQ files were obtained from ONT (called by workflow “1D RNN Basecalling 450 bps FastQ v1.121”) following sequencing. Reads were filtered to exclude those from contaminating DNA using OneCodex (http://onecodex.com). Reads that mapped to any DNA in the OneCodex database were excluded from the read set:
(zcat OneCodex_RefSeq_132394.fastq.gz.results.tsv.gz | awk '{if($3 == 0){print $1}}'; zcat OneCodex_OCD_132394.fastq.gz.results.tsv.gz | awk '{if($3 == 0){print $1}}') | sort | uniq -d | gzip > OCunmapped_names_132394.txt.gz
pv 132394.fastq.gz | zcat | ~/scripts/fastx-fetch.pl -i OCunmapped_names_132394.txt.gz | ~/scripts/fastx-fetch.pl -v -i ONTmapped_names_132394.txt.gz | gzip > OCunmapped_ONTunmapped_132394.fastq.gz
Filtered reads from the WGA sample were called using Albacore 1.1.0:
read_fast5_basecaller.py -o fastq -i A_132394 -t 10 -s called_A_132394 -c r94_450bps_linear.cfg
Reads with a length of greater than 10 k were extracted for subsequent analysis:
pv called_A_132394_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl --min 10000 | gzip > 10k_called_A_132394_albacore_1.1.0.fq.gz
To define the region of raw nanopore sequences for adapter exclusion, the >10 k reads were mapped to 50 M reads that had been generated by WTSI and that had been used for the existing WTSI assembly:
bowtie2 -p 10 --no-unal --no-mixed --local -x 10k_called_A_132394_albacore_1.1.0.fa -1 <(pv ~/bioinf/MIMR-2017-Jan-01-GBIS/GLG/ONT/aws/Sampled_50M_ERR063640.R1.fq.gz | zcat) -2 ~/bioinf/MIMR-2017-Jan-01-GBIS/GLG/ONT/aws/Sampled_50M_ERR063640.R2.fq.gz | samtools sort > WTSI_Sampled_50M_vs_10k_called_A_132394.bam
## find position of first mapped Illumina read for each nanopore read
pv WTSI_Sampled_50M_vs_10k_called_A_132394.bam | samtools view - | awk '{print $3,$4}' | sort -k 1,1 -k 2,2n | sort -u -k 1,1 | gzip > firstHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
## determine position of last mapped Illumina read
pv WTSI_Sampled_50M_vs_10k_called_A_132394.bam | samtools view - | awk '{print $3,$4}' | sort -k 1,1 -k 2,2rn | sort -u -k 1,1 | gzip > lastHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
## count positions of first reads
zcat firstHit_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz | awk '{print $2}' | sort -n | uniq -c | gzip > firstBase_counts_WTSI_Sampled_50M_vs_10k_called_A_132394.txt.gz
When we mapped the 5' ends of Illumina reads to the start of nanopore reads with Bowtie2, there was a common register shift of 28–32 bases, corresponding to the presence of adapter sequences in the nanopore reads. Reads were conservatively trimmed by 65 bases at each end to exclude adapters:
pv called_A_132394_albacore_1.1.0.fq.gz | zcat | ~/scripts/fastx-fetch.pl --min 1130 --max 1000000 | \
~/scripts/fastx-fetch.pl -t 65 | gzip > 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz
Canu v1.5 was used to assemble the trimmed reads. The assembly was done in stages (with an assembly at each stage) to determine whether or not particular stages were redundant for the assembly:
## attempt assembly-only with Canu v1.5
~/install/canu/canu-1.5/Linux-amd64/bin/canu -assemble -nanopore-raw 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz -p Nb_ONTA_65bpTrim_t1 -d
Nb_ONTA_65bpTrim_t1 genomeSize=300 M
## attempt assembly + correction
~/install/canu/canu-1.5/Linux-amd64/bin/canu -assemble -nanopore-corrected 65bpTrim_called_A_132394_albacore_1.1.0.fq.gz -p Nb_ONTA_65bpTrim_t2 -d Nb_ONTA_65bpTrim_t2 -correct genomeSize=300 M
## attempt stringent trim with corrected reads
~/install/canu/canu-1.5/Linux-amd64/bin/canu -trim-assemble -p Nb_ONTA_65bpTrim_t3 -d Nb_ONTA_65bpTrim_t3 genomeSize=300 M -nanopore-corrected Nb_ONTA_65bpTrim_t2/Nb_ONTA_65bpTrim_t2.correctedReads.fasta.gz -trim-assemble trimReadsOverlap=500 trimReadsCoverage=5 obtErrorRate=0.25
An alternative, less-stringent overlap was also attempted (with trimReadsCoverage = 2), but resulted in a less complete assembly. The results of an analysis surrounding the different assemblies suggested that the default Canu assembly process of correction, trimming, then assembly produced the best outcome, namely the WGA assembly presented in Table 1.
Canu includes a step of normalization, reducing the shortest reads, if coverage is over a predefined threshold, but that normalization threshold was not triggered for our assembly. We did not attempt a k-mer-based normalization, since this would not resolve the incomplete genome sequence coverage. Selective amplification combined with random sampling by the sequencer means that poorly amplified regions were unlikely to be present in the sequenced reads. We did notice that an assembly combining both WGA and unamplified DNA samples produced a more fragmented genome, which is why a combined assembly is not presented here. It is possible that k-mer normalization of the WGA reads might improve such a hybrid assembly.