Combined use of Oxford Nanopore and Illumina sequencing yields insights into soybean structural variation biology

Background Structural variants (SVs), including deletions, insertions, duplications, and inversions, are relatively long genomic variations implicated in a diverse range of processes from human disease to ecology and evolution. Given their complex signatures, tendency to occur in repeated regions, and large size, discovering SVs based on short reads is challenging compared to single-nucleotide variants. The increasing availability of long-read technologies has greatly facilitated SV discovery; however, these technologies remain too costly to apply routinely to population-level studies. Here, we combined short-read and long-read sequencing technologies to provide a comprehensive population-scale assessment of structural variation in a panel of Canadian soybean cultivars. Results We used Oxford Nanopore long-read sequencing data (~12× mean coverage) for 17 samples to both benchmark SV calls made from Illumina short-read data and predict SVs that were subsequently genotyped in a population of 102 samples using Illumina data. Benchmarking results show that variants discovered using Oxford Nanopore can be accurately genotyped from the Illumina data. We first use the genotyped deletions and insertions for population genetics analyses and show that results are comparable to those based on single-nucleotide variants. We observe that the population frequency and distribution within the genome of deletions and insertions are constrained by the location of genes. Gene Ontology and PFAM domain enrichment analyses also confirm previous reports that genes harboring high-frequency deletions and insertions are enriched for functions in defense response. Finally, we discover polymorphic transposable elements from the deletions and insertions and report evidence of the recent activity of a Stowaway MITE. Conclusions We show that structural variants discovered using Oxford Nanopore data can be genotyped with high accuracy from Illumina data. Our results demonstrate that long-read and short-read sequencing technologies can be efficiently combined to enhance SV analysis in large populations, providing a reusable framework for their study in a wider range of samples and non-model species. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-022-01255-w.

1 Description and testing of the SV refinement pipeline One of the main objectives of this project is to use Illumina data to genotype SVs that were originally discovered using Oxford Nanopore sequencing data. This would enable the use of Illumina data to carry out population-scale genotyping of variants that were initially discovered in a smaller set of samples sequenced with long-read technologies.
Oxford Nanopore data, however, has a high error rate. This makes it challenging to accurately describe breakpoints and sequence content (for insertions) and may negatively impact our ability to genotype the variants discovered with this technology using short reads.
In order to address this issue, we assembled a pipeline that can be used to refine the SVs discovered by Sniffles [24]. By refinement, we mean that the position of the breakpoints (start and end position for deletions, and single location for insertions) and the sequence content (for insertions) can be updated to improve the quality of the variants. The current version of the pipeline ignores SVs other than deletions or insertions due to difficulties in assembling and aligning them properly.
The pipeline is based purely on Oxford Nanopore data (i.e. it does not need Illumina data e.g. for polishing or error correction) and therefore could potentially be used for SV refinement with any Oxford Nanopore or even PacBio dataset.

Details of the steps in the pipeline
The pipeline performs a local assembly of the Oxford Nanopore reads aligned to the region of interest and then uses AGE [40] to align this assembly to the appropriate region of the reference genome. This alignment is then used to update the breakpoints and inserted sequence of the variant if certain quality thresholds are met. This alignment pipeline is largely inspired by the guidelines suggested by the authors of AGE in their newer LongAGE paper [41] and the authors of wtdbg2 [42] and consists of the following steps: 1. For every insertion or deletion, Oxford Nanopore reads overlapping the region of interest are extracted from the NGMLR [24] BAM file. The region of interest is a window spanning ± 200 nucleotides from the SV breakpoints (start and end of the breakpoint for deletions, and single location for insertions).
Once the alignment is completed, the results are parsed and used to extract the location of the regions excised by AGE from both the assembled contig and the reference genome. In the case of a deletion, we expect a fragment of the reference genome to be excised, whereas in the case of a insertion, we expect a fragment of the assembled contig to be excised. From this information, the location of the excised fragment is transformed to referencebased coordinates and the inserted sequence is extracted from the assembled contig (using the reverse complement if applicable). The following metrics are then computed in order to compare the (refined) deletion/insertion suggested by the AGE alignment to the original (raw) one defined by Sniffles: For deletions, the reciprocal overlap between the raw deletion and its refined counterpart. The reciprocal overlap is the minimum of the two following values: proportion of deletion 1 covered by deletion 2, and proportion of deletion 2 covered by deletion 1. This overlap is computed using functions from the GenomicRanges R package [44].
For insertions, the offset of the two insertions, i.e. the (absolute) physical distance in base pairs between the positions of the raw and refined insertion.
For insertions, the Levenshtein (edit) distance between the two inserted sequences, as computed by the R function adist. From this distance, a distance ratio is computed, which corresponds to the Levensthein distance between the two insertions divided by the length of the longest insertion. This provides a relative measure of the difference in sequence content between the two insertions.
In addition to the metrics related to the excised fragments themselves, the percentage of nucleotides that share identity and the percentage of nucleotides affected by gaps in the aligned flanking regions are also extracted from the output of AGE.
Note that SVs longer than 50 kb are excluded from the pipeline because of the high memory and computing time requirements of aligning to those variants using AGE. Although a newer version of AGE called LongAGE [41] requiring less memory has been recently released, we found it unsuitable for our needs due to cryptic error messages and slower speed.

Parameters determining whether SVs are updated or not
Our objective is to update the position of the breakpoints and the sequence content such that the SVs updated in this way are closer to the actual sequence, and thus both easier to genotype from Illumina data and more representative of reality. However, we must also avoid updating the variant in a way that is inaccurate (as would occur if, for example, AGE identified a deletion or insertion different from the target one). Therefore, we established conditions to decide whether or not an SV should be updated. Based on analyses carried out during the development of the pipeline, we have settled on the following thresholds to decide whether a variant should be updated: Reciprocal overlap ≥ 0.5 (for deletions) Offset ≤ 50 (for insertions) Edit distance ratio ≤ 0.5 (for insertions) Percent identity of the alignment ≥ 85 Percentage of gaps in the assembly ≤ 15

Testing the effect of refined SVs on genotyping performance
The main objective of the SV refinement pipeline described above is to improve the genotyping of SVs with Illumina data. In order to verify whether the pipeline indeed improved genotyping, we compared the genotyping precision and sensitivity of raw and refined SVs using three different genotypers: BayesTyper v.

Application of the pipeline
For these testing purposes, we applied the SV refinement pipeline described above to the first four samples for which we obtained Oxford Nanopore data. SVs discovered using Sniffles were filtered as described in the main text (notably, only SVs genotyped as homozygous were kept) and then used as input to the pipeline. Insertions and deletions were refined if they met the required thresholds, which were as described above except for the minimum reciprocal overlap which was set to 0.7 and the maximum distance ratio which was set to 0.45. Table S10 shows the number of variants that could be refined or not for each sample. For all four samples, a majority of SVs subjected to the pipeline met the conditions for refinement.
There can be three main reasons why a variant is not refined: The proposed refined variant did not meet one or several quality thresholds when compared to the raw variant.
The size of the variant was > 50 kb.
No assembly or alignment was produced by the pipeline.
In either of these cases, we simply keep the original representation of the SV as called by Sniffles.

Input SV datasets
In order to assess the effect of refining SVs on genotpying performance, we generated an input dataset of raw variants and another input dataset of refined variants.
In order to generate input datasets for genotyping, the SV sets obtained from the four samples were merged using SVmerge [45] with parameters -maxdist 15 -reldist 0.1 -relsizediff 0.1 -relshift 0.1 -seqspecific. For refined variants, the default random selection behavior of SVmerge when combining several variants together was modified to systematically favor variants that had been updated by the pipeline.
Following this, the VCF files were coordinate-sorted, normalized (using bcftools norm, [46]) and duplicate records were removed.

Genotyping methods
The raw and refined SV datasets were both used as input to three genotyping programs, BayesTyper, vg and Paragraph.
Briefly, k-mers were counted from the adapter-trimmed fastq files using KMC [47] and a Bloom filter was applied using the bayesTyperTools makeBloom utility.
In parallel, BayesTyperTools combine had to be run to make the VCF input files suitable for input to BayesTyper.
Genotyping was then carried out using bayesTyper cluster and bayesTyper genotype sequentially with default parameters except for the use of the mitochondrial and chloroplastic genome sequences as decoy sequences.
Precision-recall curves for BayesTyper were generated by varying the genotype quality (GQ) threshold necessary to make a call.
vg Variation graphs were created from the VCF files using vg construct and then converted to PackedGraph format using vg convert -p. The graphs were then indexed prior to mapping using vg index -L -x to generate the xg index and vg index -g to generate the GCSA index. GCSA indexing was carried out on a version of the graph that had been pruned with vg prune.
Reads were mapped separately for each sample using vg map. These mappings were then used to call the SVs explicitly defined in the input VCF files using vg pack -Q5 and vg call -v.
Precision-recall curves for vg were generated by varying the genotype quality (GQ) threshold necessary to make a call.
Paragraph VCF input files were prepared for input to Paragraph using similar methods to those described in the main text, i.e. variants located less than 500 nucleotides away from chromosome ends were removed and variants were padded with reference nucleotides where required. Samples were genotyped separately using the multigrmpy utility of Paragraph.
Precision-recall curves for Paragraph were generated by varying the minimum number of reads (DP) threshold necessary to make a call.

Benchmarks: raw versus refined variants
Benchmarks were carried out using the sveval R package [30] following the methods described in the main text.
Bayestyper Figure S26 shows the benchmarking results obtained for BayesTyper. Overall, the refined deletions resulted in a higher sensitivity but similar precision. The benefits of SV refinement were obvious for larger deletions, whereas results were sometimes worse for deletions in the range of 100-1000 nucleotides. The improvemement brought by refinement is more obvious for insertions than it is for deletions, most likely because the refinement pipeline also improved the sequence content of the insertions. The poor overall sensitivity of the results obtained with BayesTyper when genotyping Oxford Nanopore SVs is probably due to the small tolerance of this program to errors in SV breakpoint location or sequence content, as has been previously documented [30]. Nevertheless, the improvement in sensitivity obtained by using refined SVs suggests that the SV refinement pipeline indeed improves the quality of the variants.
vg Figure S27 shows the benchmarking results obtained for vg. Refining the deletions does seem to have improved the results, although only very slightly. For insertions, however, the improvement resulting from refinement is obvious, both in terms of sensitivity and precision. Again, this suggests that the breakpoint refinement pipeline improved the accuracy of the sequence content of insertions, such that they could be more accurately genotyped by vg.
Paragraph Figure S28 shows the benchmarking results obtained for Paragraph. Using refined variants results in a slight increase in sensitivity for Paragraph, both for deletions and smaller insertions (< 1,000 bp). The apparent drop in precision for insertions ≥ 1,000 bp is likely due to the fact that the Oxford Nanopore data represented a poor ground truth dataset for these samples in this size range, as explained in the main text. It is therefore expected that an increase in sensitivity from refining the variants will result in an apparent loss of precision in this size range.

Conclusion on the SV refinement pipeline
Overall, the results show that the breakpoint refinement pipeline results in higher genotyping sensitivity than using the raw variants as called by Sniffles. While the results are not immediately obvious when using Paragraph, the improvements obtained using BayesTyper and vg suggest that the quality of the SVs is indeed higher following breakpoint refinement. This is probably due to the high tolerance of Paragraph to errors in SV sequence (as documented in [29]) compared to the two other tools. However, even if the improvement in genotyping performance using Paragraph is not obvious, downstream analyses should still benefit from more accurate SV breakpoints and sequence.

Software availability
The SV refinement pipeline described here is available from https://github.com/malemay/ breakpoint_refinement. 9 2 Detailed methods on the DNA transposable element analysis 2.1 Description of the analysis pipeline To study potential DNA transposable element (TE) activity in more detail, we extracted the SVs corresponding to all DNA TEs in the SoyTEdb database [48] that were matched by at least 3 SVs. This represented 51 TEs in the database and 237 individual SVs.
For each of these SVs, we extracted the Oxford Nanopore reads at the location of the SV and assembled them individually for each sample following methods similar to those used for the breakpoint refinement pipeline. Briefly, reads overlapping positions ± 200 bp from the SV were used for assembly with wtdbg2 and polishing with wtpoa cns after re-aligning the reads to the assembly with minimap2.
The assemblies were then used for multiple alignment across samples in order to identify TE boundaries as well as target site duplication (TSD) and terminal inverted repeat (TIR) sequences. We used the PairwiseAlignment function from the Bioconductor Biostrings package [49] to individually align all assemblies to a section of the reference genome ± 500 bp from the putative TE insertion site. This pairwise alignment was only done in order to extract the part of the assemblies that was of interest.
We then used the aligned sequences obtained from the pairwise alignment as input for multiple alignment. Multiple alignment was done with the MAFFT [50] suite of tools using the command ginsi --reorder which performs multiple global alignment.
We processed multiple alignments so as to only keep high-quality alignments that showed obvious TE insertion polymorphism. First, we manually checked all alignments and only kept those that showed an insertion/deletion polymorphism of the expected length. Second, for a given multiple alignment, several of the sequences often showed very poor similarity to the reference genome, possibly because of issues with their assembly. We removed such sequences by filtering out those that had less than 75% percent identity with the reference in the first and last 500 bp of the alignment (the nucleotides flanking the insertion/deletion site). Typically, only short TEs (< 500 bp, essentially MITEs) remained following these filtering steps because the longer ones were too challenging to assemble and align properly.
A total of 72 SV events representing 25 individual entries in the TE database remained following filtering. The sequences that remained following these filtering steps typically differed only in the presence or absence of the TE sequence. Sequences that contained the TE sequence could be analyzed further to identify TSD and TIR sequences from the TE insertion sequence. 70 out of the 72 SVs had at least one sequence bearing the insertion after filtering. In total, 301 sequences bearing the insertion could be queried for TSD and TIR sequences. 31 of these 301 were extracted from insertions that are present in the reference, whereas the 270 remaining insertion sequences were extracted from the local assemblies of the Oxford Nanopore reads.
We analyzed the insertion sequences using the GenericRepeatFinder software [51] with the following command: grf-main -i input fasta -o . -c 1 -p 30 -s 10 --seed mismatch 8 --min tr 12 The parameters --min space and --max space were adjusted to be -60 and +20 relative to the expected SV size, respectively. The minimum (--min tsd) and maximum (--max tsd) TSD length were set to 2 for the DTT superfamily and 3 for the DTH superfamily, while these values were 5 and 10, respectively, for the DTM superfamily. We used relatively relaxed parameters for seed matching because the presence of TSD and TIR sequences was expected and we wanted to maximize the probability of finding them.
We parsed the output of GenericRepeatFinder to get the location and sequences of the TSD and TIR. In many cases, several potential TSD/TIR sequences could be identified from a given input sequence. For DTT TEs, only entries with a TSD matching the expected TA sequence were kept. For DTH TEs, only entries with a TSD matching either TAA or TTA were kept. Despite those filters, several TSD/TIR candidates often remained for a given sample. Therefore, the candidates were ranked following an automated procedure according to their proximity to the expected insertion location and similarity to expected insertion size. For TEs in the DTM family, the highest priority was given to longer TSDs. For every sample, we considered the TSD/TIR entry with the highest priority as the only match for that sample.
We annotated the multiple alignments with the putative TSD/TIR sequences and visually checked for consistency between these sequences and the location of the insertion. Our analysis relied entirely on the automatically identified TSD/TIR sequences; samples for which the best matching TSD/TIR sequence was not consistent were discarded and were not analyzed further. In total, we identified high-quality TSD/TIR sequences for 42 individual SVs. For each of the TIR sequences identified, we computed the percentage of matching nucleotides between the two inverted repeats and averaged them over all insertion sequences for a given SV.
Some of the TIR sequences were identified directly from the reference sequence and may thus have been of higher quality than the error-prone assemblies generated from the Oxford Nanopore data. Therefore, we only show the results obtained from the de novo-assembled sequences in figure 6c of the main text. This results in 40 individual SVs collectively representing 17 sequences in the SoyTEdb database.

Detailed analysis of a 480-bp Stowaway MITE insertion
A closer look at the multiple alignments revealed that in almost all cases, the sequences without the insertion presented a single occurence of the TSD sequence. This observation is consistent with a scenario where the TE never inserted into the sequence, instead of having excised from it. In a single case however, that of a 480-bp insertion of a Tc1-Mariner (DTT) element (actually a Stowaway MITE) at position 2,257,090 of chromosome Gm04, visual analysis of the multiple alignment revealed that three different alleles are segregating at the insertion site: The reference allele (no insertion at the target position) A 480-bp insertion that corresponds to the TE insertion A 6-bp insertion of nucleotides TACGAG We speculate that this last allele results from the excision of the TE insertion. Indeed, the TA part of the insertion corresponds to the expected TSD sequence for a Tc1-Mariner element. The remaining CGAG nucleotides would likely have been inserted during the repair of the DNA break following excision of the TE. Interestingly, this insertion is by far the one for which the similarity between the two TIR sequences was highest among the ones studied, at 96.3%.
To investigate this question further, we analysed the SNV patterns surrounding the insertion site. If the TACGAG indeed resulted from the excision of the TE, then the haplotypes corresponding to the TE insertion and the 6-bp insertion should be more similar to each other than to the reference sequence, as the excision would have occurred within the TE insertion genetic background.
The SNVs with FILTER = PASS among those discovered by Platypus [52] were extracted from the VCF file of the 102 samples sequenced by Illumina to perform this analysis. The analysis of linkage disequilibrium (LD) using PLINK [53] in a 200-kb window surrounding the insertion location at Gm04:2,257,090 revealed a high-LD block spanning positions 2,220,398 to 2,259,326 that included the location of the insertion. The SNVs in that 39-kb window were therefore extracted from the Platypus VCF file and used for downstream analyses.
By combining the information obtained from Platypus and SV genotyping with Paragraph, we were able to classify samples into three groups depending on their status at the TE insertion locus: absent: Absence of the TE (reference allele) present: Presence of the TE insertion (as genotyped by Paragraph) excised: Presence of the 6-bp insertion putatively left by TE excision (as genotyped by Platypus) The number of samples in each group was as follows: absent: 71 12 present: 9 excised: 14 All genotype calls for the TE insertion were homozygous and were therefore easy to classify as "present" when homozygous for the alternate allele. For the 6-bp insertion, however, some genotype calls were heterozygous. In all such cases, visual analysis of the aligned Illumina reads confirmed that these were most likely mis-genotyped and were actually homozygous for the alternate allele. Therefore, samples that were heterozygous at that site were classified as "excised" in addition to those that were homozygous for the alternate allele. Samples that were called as homozygous for the reference allele at the 6-bp insertion position were classified as "absent". Finally, eight (8) samples for which missing genotype calls did not allow unambiguous classification were discarded and not used for further analysis.
Using this classification, we computed the alternate allele frequency within each of the three allele classes of the 156 SNVs in the 39-kb LD block. These results are visualized in figure 6d of the main text. They show clear similarity between the "present" and "excised" haplotypes, which is indeed consistent with the 6-bp insertion being a remnant from the excision of the 480-bp TE.

Subsampling analysis of four samples sequenced using Oxford Nanopore sequencing
To investigate the effect of the relatively modest average sequencing depth (12X) of the Oxford Nanopore data used in this study, we performed a subsampling analysis of the data for four cultivars.
We adopted a subsampling approach in order to simulate lower coverage from the four cultivars that were sequenced with > 16X sequencing depth (Alta, Maple Isle, AC2001, OAC Lakeview). For each of the cultivars, we sampled 20%, 40%, 60% and 80% of the aligned reads using samtools view --subsample [46]. Five (5) replicates of each cultivar/sampling fraction combination were generated using a different randomly chosen seed for each sample. Each of the 80 subsampled .bam files were processed to generate a catalog of SV calls from each of them and benchmark their results using the same methods as described in the main text: SV calling with Sniffles using parameters --min support 3 --min seq size 1000 --min homo af 0.7 Filtering of variants by removing those meeting the following conditions: -Variants other than insertions, deletions, duplications and inversions -Variants called as heterozygous -Variants overlapping any N in the reference sequence or containing any N in their sequence -Variants located on unanchored scaffolds or organellar genomes -Insertions located less than 20 bp away form any N in the reference sequence -Variants smaller than 50 bp or larger than 500 kb Benchmarking of each such filtered SV dataset with the sveval R package, using the SVs called on the complete (not sampled) dataset for the corresponding cultivar as a ground truth Breakpoint refinement was not performed on the SV datasets generated from these subsamples for computational efficiency reasons. For consistency, the complete (not sampled) truth SV sets against which subsampled datasets were compared were also not breakpointrefined. However, we believe that the impact of breakpoint refinement on these comparisons would have been minimal and should not impact the conclusions drawn from these analyses.
Results of the subsampling analyses are presented in figure S7 (deletions), figure S8 (insertions), figure S9 (duplications) and figure S10 (inversions). In the context of this analysis, "sensitivity" represents the proportion of SVs in the complete dataset found in a given subsample. Similarly, "precision" represents the proportion of SVs found in a given subsample that actually exist in the complete dataset.
The results obtained for deletions (figure S7) show that sensitivity exceeds 75% at a sequencing depth of 10X across all size classes and samples. The precision of deletion discovery is close to perfection across all size classes, and does not vary with sequencing depth. These results indicate that, for deletions, a sequencing depth of 10X is sufficient to detect the majority of variants and that the occurrence of false positive calls does not increase in samples that are sequenced at low depth.
The results obtained for insertions ( figure S8) show that sensitivity exceeds 75% at a sequencing depth of 10X for insertions in the range 50 -1,000 bp but decreases for larger insertions. Therefore, it appears that the effect of sequencing depth becomes more important as insertions increase in size. The precision of insertion discovery varies between 75% and close to 100% depending on insertion size, with the exception of insertions larger than 10 kb for a single sample (AC2001). These results suggest that insertion discovery is less accurate than deletion discovery. However, we observe the same pattern of independence between sequencing depth and precision as was observed for deletions, with the same exception as noted above. This again indicates that the accuracy of insertion calls from Oxford Nanopore data does not decrease in samples that are sequenced at lower depth.
The results obtained for duplications (figure S9) show that duplication discovery is more sensitive to sequencing depth than other SV types, with sensitivity below 75% at a sequencing depth of 10X. This result is not surprising given that duplication calling relies on observing higher coverage on some genomic interval compared to average, and such differences can be assessed more confidently in deeply sequenced samples. This result is also consistent with our observation that the apparent precision of duplication genotyping from Illumina data depends strongly on Oxford Nanopore sequencing depth (see figure S6). The precision of duplication discovery seen in figure S9 is also poorer than for other SV types and decreases slightly with decreasing sequencing depth, consistenly with difficulties in discovering and genotyping duplications as documented elsewhere in this paper.
Finally, the results obtained for inversions (figure S10) show sensitivity values similar to those observed for duplications, with sensitivity reaching between 50% and 75% at a sequencing depth of 10X. Inversion discovery appears to be highly sensitive to sequencing depth, as the relationship between sensitivity and sequencing depth does not seem to plateau. Precision rates are generally excellent (> 90%) and do not depend on sequencing depth, which indicates that inversion discovery will generally be accurate in samples sequenced at lower depth. One exception to this result is smaller inversions (50 -100 bp), as the small number of inversions observed in this size class (see table S1) caused stochastic variation in precision rates depending on the cultivar and subsample.
The observation that the precision rates of deletion, insertion and inversion discovery do not vary with sequencing depth is not surprising. Indeed, one would not expect a subsample of the reads to discover variants that were not already discovered from the full set of reads. However, Sniffles determines the coordinates of SVs by combining the information from all reads supporting a given SV. Therefore, for some variants located in regions where reads are difficult to align properly, subsets of reads may describe variants differently from the complete set of reads. Similarly, variants that are supported by only a few misaligned reads might be called as homozygous from a sample of reads, but as heterozygous from the complete set and therefore filtered out. Both these situations would lead to false positive calls in the subsampled dataset relative to the full dataset, as assessed by the sveval package. Despite the possibility of those situations occurring, our results show that such false positive calls are rare for deletions, insertions, and inversions. Therefore, it is safe to call these SVs from low-depth Oxford Nanopore sequencing data without worrying about the impact of false positive calls. Table S1: Number of filtered variants per SV type and size class discovered by Sniffles for 17 samples subjected to Oxford Nanopore sequencing. These SVs were used as a truth set for the benchmarks presented in this paper.   a Observed difference between the mean allele frequencies of the SVs overlapping the corresponding genic features b p-value computed from a one-sided comparison to the distribution of mean differences obtained from the randomizations (10,000 iterations). The significance threshold was set to α = 0.05 ÷ 6 = 0.008 to correct for multiple testing. c cds: coding sequence d upstream5kb: within 5 kb upstream of a gene       a The length of a single sequence read b The number of sequenced spots. Each spot corresponds to one forward and one reverse read. c The average sequencing depth computed from reads mapped to assembled chromosome pseudomolecules (Gm01 .. Gm20) d An "X" indicates that the sample belongs to the core set from which most Oxford Nanopore samples were selected e An "X" indicates that the sample was sequenced using Oxford Nanopore Technologies sequencing Table S8: DNA extraction and preparation methods for Oxford Nanopore sequencing a SVs that passed all filters and had their breakpoints and/or sequence content updated by the pipeline b Did not pass one or several thresholds applied when comparing the proposed refined SV to the raw SV c SV length above the maximum permissible length (50 kb used here) d No assembly or no alignment produced by the pipeline Figure S1: Graphical summary of the methods used in this paper. The various steps of the analysis are color-coded depending on the data output by that step (see legend at the bottom of the figure). The words in italics in parentheses indicate the name of program used to compute that part of the analysis, where applicable. Arrows between steps indicate that one step is a prerequisite to the step it is pointing to. An exception to this is the dotted arrow indicating that both predicted SV genotypes and the ground truth Oxford Nanopore SV dataset are necessary to benchmarking. Benchmarks were performed for three different candidate SV datasets genotyped using Illumina data: SVs discoverered from Illumina data only, SVs discovered from Oxford Nanopore data only, and the combination of SVs discovered from the Illumina and Oxford Nanopore data. Population-scale analyses were performed on the SV dataset based on combined Illumina and Oxford Nanopore SVs. ONT: Oxford Nanopore Technologies, SV: structural variant, SNV: single-nucleotide variant, GO: Gene Ontology, BP: Biological Process, TIR: terminal inverted repeat, TSD: target site duplication, MITE: miniature inverted-repeat transposable element, PCA: principal component analysis.  table S7 and represent the effective sequencing depth computed from mapped reads. With the exception of one outlier, heterozygosity rates higher than 0.1 are attributable to shallow sequencing, which can lead to genotyping errors resulting in seemingly higher heterozygosity. Heterozygous calls in the context of inbred lines may occur due to various causes including genotype-calling errors in low-depth samples (as shown here), copy number variants generating spurious heterozygous genotype calls [6], or genuine residual heterozygosity. 32 Figure S3: Genotyping sensitivity and precision of (A) deletions and (B) insertions discovered from the Illumina data in non-repeat regions. Only SVs with less than 20% overlap to regions annotated as repeats were used for these benchmarks. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Note that no results were available for deletions larger than 10 kb because these all overlapped repeat regions. 33   inversions discovered from Illumina data (at a minimum threshold of 2 supporting reads) for 17 samples as a function of the average Oxford Nanopore sequencing depth of that sample. Figure S7: Effects of sequencing depth on the sensitivity and precision of deletion discovery from Oxford Nanopore sequencing data. Oxford Nanopore reads of four cultivars were subsampled using subsampling fractions of 20%, 40%, 60% and 80%, with five replicates of each cultivar/subsampling fraction combination. SVs were called on each of these subsamples and (A) sensitivity and (B) precision computed by comparing the resulting SVs to the SVs called on the full (not sampled) dataset for the corresponding cultivar. Points indicate the median value for a given cultivar/subsampling fraction combination, while vertical error bars indicate the span from the minimum to the maximum value of five replicates. Asterisks indicate the value (1 by definition) obtained for the full dataset. Error bars are not visible in cases where the minimum and maximum values are too close to each other. 37 Figure S8: Effects of sequencing depth on the sensitivity and precision of insertion discovery from Oxford Nanopore sequencing data. Oxford Nanopore reads of four cultivars were subsampled using subsampling fractions of 20%, 40%, 60% and 80%, with five replicates of each cultivar/subsampling fraction combination. SVs were called on each of these subsamples and (A) sensitivity and (B) precision computed by comparing the resulting SVs to the SVs called on the full (not sampled) dataset for the corresponding cultivar. Points indicate the median value for a given cultivar/subsampling fraction combination, while vertical error bars indicate the span from the minimum to the maximum value of five replicates. Asterisks indicate the value (1 by definition) obtained for the full dataset. Error bars are not visible in cases where the minimum and maximum values are too close to each other. 38 Figure S9: Effects of sequencing depth on the sensitivity and precision of duplication discovery from Oxford Nanopore sequencing data. Oxford Nanopore reads of four cultivars were subsampled using subsampling fractions of 20%, 40%, 60% and 80%, with five replicates of each cultivar/subsampling fraction combination. SVs were called on each of these subsamples and (A) sensitivity and (B) precision computed by comparing the resulting SVs to the SVs called on the full (not sampled) dataset for the corresponding cultivar. Points indicate the median value for a given cultivar/subsampling fraction combination, while vertical error bars indicate the span from the minimum to the maximum value of five replicates. Asterisks indicate the value (1 by definition) obtained for the full dataset. Error bars are not visible in cases where the minimum and maximum values are too close to each other. 39 Figure S10: Effects of sequencing depth on the sensitivity and precision of inversion discovery from Oxford Nanopore sequencing data. Oxford Nanopore reads of four cultivars were subsampled using subsampling fractions of 20%, 40%, 60% and 80%, with five replicates of each cultivar/subsampling fraction combination. SVs were called on each of these subsamples and (A) sensitivity and (B) precision computed by comparing the resulting SVs to the SVs called on the full (not sampled) dataset for the corresponding cultivar. Points indicate the median value for a given cultivar/subsampling fraction combination, while vertical error bars indicate the span from the minimum to the maximum value of five replicates. Asterisks indicate the value (1 by definition) obtained for the full dataset. Error bars are not visible in cases where the minimum and maximum values are too close to each other. 40 Figure S11: Effect of filtering variants for minimum number of ALT alleles in homozygous genotype calls (homozygous ALT count) on the genotyping performance of (A) deletions and (B) insertions discovered from Illumina data. Each line and color represents a different sample. The curve for a given sample was obtained by varying the filtering threshold for a variant to be considered for benchmarking. The asterisks mark a homozygous ALT count threshold of 4; points to the left of these for a given line represent increasingly stringent filtering threshold values. Figure S12: Effect of filtering variants for minimum number of ALT alleles in homozygous genotype calls (homozygous ALT count) on the genotyping performance of (A) duplications and (B) inversions discovered from Illumina data. Each line and color represents a different sample. The curve for a given sample was obtained by varying the filtering threshold for a variant to be considered for benchmarking. The asterisks mark a homozygous ALT count threshold of 6 (for duplications) or 10 (for inversions); points to the left of these for a given line represent increasingly stringent filtering threshold values. Figure S13: Effect of filtering variants for minimum number of supporting calling tools on the genotyping performance of (A) deletions and (B) insertions discovered from Illumina data. Each line and color represents a different sample. The curve for a given sample was obtained by varying the minimum threshold of number of supporting calling tools for a variant to be considered for benchmarking. The asterisks mark a minimum threshold of one supporting calling tool for a variant to be considered; points to the left of these for a given line represent increasingly stringent filtering threshold values.    Figure S18: Genotyping sensitivity and precision of (A) deletions and (B) insertions discovered from the Oxford Nanopore data in non-repeat regions. Only SVs with less than 20% overlap to regions annotated as repeats were used for these benchmarks. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Note that no results were available for deletions larger than 10 kb because these all overlapped repeat regions. Figure S19: Genotyping sensitivity and precision of (A) duplications and (B) inversions discovered from the Oxford Nanopore data. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Figure S20: Genotyping sensitivity and precision of (A) duplications and (B) inversions discovered from the Oxford Nanopore data in non-repeat regions. Only SVs with less than 20% overlap to regions annotated as repeats were used for these benchmarks. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Figure S21: Genotyping sensitivity and precision of (A) deletions and (B) insertions in the merged dataset of SVs discovered from the Illumina and Oxford Nanopore data. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Figure S22: Genotyping sensitivity and precision of (A) deletions and (B) insertions in non-repeat regions in the merged dataset of SVs discovered from the Illumina and Oxford Nanopore data. Only SVs with less than 20% overlap to regions annotated as repeats were used for these benchmarks. Each line and color represents one of 17 samples. The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call. The asterisks indicate a minimum number of supporting reads of 2; points to the left of these for a given line represent increasingly stringent filtering threshold values. Note that no results were available for deletions larger than 10 kb because these all overlapped repeat regions. PCA were computed based on (A) SVs discovered from Illumina and Oxford Nanopore data, and subsequently genotyped with Illumina data using Paragraph and (B) SNVs called by Platypus from Illumina data. The first three principal components are shown for each PCA. The color of each point indicates the population that the cultivar belongs to according to the fastStructure analysis with SNVs, following the same color scheme as in Figure 4 of the main text. Samples with a maximum q-value below 0.6 (as inferred from fastStructure) are considered admixed and are labeled with asterisks. Figure S24: Neighbour-joining phylogenetic trees of the population of 102 Canadian soybean cultivars derived from (A) SVs discovered from Illumina and Oxford Nanopore data, and subsequently genotyped with Illumina data using Paragraph and (B) SNVs called by Platypus from Illumina data. Tip label colors indicate the population that the cultivar belongs to according to the fastStructure analysis with SNVs, following the same color scheme as in Figure 4 of the main text. Node labels indicate the number of trees (out of 100) of the bootstrap analysis that support a given node/grouping.  Figure S26: Genotyping sensitivity and precision of raw and refined (A) deletions and (B) insertions discovered by Sniffles on the Oxford Nanopore data for four samples when using BayesTyper as a genotyper. Each line represents data for a single sample whose SV breakpoints have either been refined (refined SVs) or not (raw SVs). The points correspond to different filtering thresholds on the genotype quality (GQ) required to support a genotype call. Figure S27: Genotyping sensitivity and precision of raw and refined (A) deletions and (B) insertions discovered by Sniffles on the Oxford Nanopore data for four samples when using vg as a genotyper. Each line represents data for a single sample whose SV breakpoints have either been refined (refined SVs) or not (raw SVs). The points correspond to different filtering thresholds on the genotype quality (GQ) required to support a genotype call. Figure S28: Genotyping sensitivity and precision of raw and refined (A) deletions and (B) insertions discovered by Sniffles on the Oxford Nanopore data for four samples when using Paragraph as a genotyper. Each line represents data for a single sample whose SV breakpoints have either been refined (refined SVs) or not (raw SVs). The points correspond to different filtering thresholds on the minimum number of Illumina reads required to support a genotype call.