Illumina sequencing and read processing
Sample selection and acquisition of Illumina sequencing data has been described in previous work [6]. Briefly, 102 Canadian soybean cultivars and breeding lines were selected to encompass the full range of genetic variation found among Canadian short-season germplasm and sequenced on the Illumina HiSeq 2500 platform. Paired-end reads ranging in size from 100 to 126 nucleotides were obtained depending on the sample (see Additional file 1: Table S7 for metadata on Illumina sequencing data). This sequencing data is available on the NCBI Sequence Read Archive (SRA) through BioProject accession number PRJNA356132 [62].
All reads were adapter- and quality-trimmed using bbduk from the BBtools suite v. 38.25 [63]. We aligned reads using bwa mem v. 0.7.17-r1188 [64] with default parameters. Paired-end alignment mode was used except for reads that were left unpaired following adapter and quality trimming, which were aligned in single-end mode. We used a reference genome consisting of assembly version 4 of the Williams82 reference cultivar [65] concatenated with reference mitochondrion and chloroplast sequences retrieved from SoyBase [66]. Reads aligned using paired-end and single-end mode were then merged, sorted, and indexed using samtools v. 1.8 [46] and read groups were added using bamaddrg [67]. The sorted and indexed BAM files were used as input for all downstream analyses requiring mapped reads.
Structural variation discovery from short reads
We called SVs on all 102 samples using four different tools: AsmVar [68], Manta [69], SvABA [70], and LUMPY-based [71] smoove [72]. We selected this combination of tools based on the complementarity of their SV detection approaches, widespread use within the community, and performance reported in published benchmarks. Manta and LUMPY (on which smoove is based) performed consistently well across various published benchmarks [18,19,20]. SvABA also performed well in the benchmarks published by Kosugi et al. [20]. Finally, we also chose AsmVar because of its purely assembly-based approach (contrasting with the other tools) and its use in a previous large-scale variation discovery project [73]. When calling SVs, we limited the analysis to deletions, insertions, inversions, and duplications since their alterations in sequence can be precisely represented to allow genotyping from the alignment of sequence reads. While we did not explicitly study translocations, alterations in sequence that result from a translocation may still appear in the dataset as an insertion at one genomic location and a deletion at another, even if they are not annotated as such. Similarly, we did not explicitly study copy number variants, but variants defined under this umbrella term are still included in our dataset as deletions and duplications.
AsmVar calls SVs by comparing de novo genome assemblies to a reference genome. Prior to assembly, we merged reads that were still paired after trimming using FLASH v. 1.2.11 [74]. The rationale behind this was that the short size of the inserts in our sequencing data allowed several of the read pairs to be merged into longer sequences. Reads were grouped into three libraries (single-end reads from bbduk, single-end reads merged by FLASH, and paired-end reads left unmerged by FLASH) and assembled with SOAPdenovo2 v. 2.04 [75] using the sparse_pregraph and contig commands, and a k-mer size of 49. Contigs were not further assembled into scaffolds because we aimed to only call SVs whose sequence was entirely resolved. The resulting contigs were aligned to the reference genome using LAST v. 1047 [76] by first calling the lastal command with options -D1000 -Q0 -e20 -j4 and then the last-split command with options -m 0.01 -s30. Variants were called on the LAST alignments using ASV_VariantDetector from the AsmVar tool suite (version of 2015-04-16) with default parameters. The pipeline was run on each sample independently and results were subsequently concatenated to obtain a single AsmVar VCF file. Variants with a FILTER tag other than “.” were filtered out from the resulting call set.
We ran manta v. 1.6.0 with default parameters in 10 batches of 10 or 11 randomly grouped samples because it did not scale well to the whole population. We used the candidate SVs (and not the genotype calls themselves) identified by each run for further processing and filtered them by removing unresolved breakends (SVTYPE=BND). The filtered variants were then converted from symbolic alleles (i.e., DEL, DUP, INS) to sequence-explicit ALT alleles using bayesTyperTools convertAllele v. 1.5 [31] and combined into a single VCF file using bcftools merge (version 1.10.2-105) [46].
We ran SvABA v. 1.1.3 separately on all samples using the command svaba run with options --germline -I -L 6. SvABA produces two different variant sets: one for indels, which are already coded as sequence-explicit, and another for SVs which are coded as paired breakends. We therefore classified SVs into defined types (DEL, DUP, INV) based on breakpoint orientation and converted them to sequence-specific ALT alleles using an in-house R script. The resulting sequence-explicit variants were merged using bcftools merge.
We ran smoove v. 0.2.4 on all samples using a series of commands. First, smoove call was run separately on each sample using default parameters. The variants identified were then merged into a single VCF file using smoove merge, smoove genotype with options -x -d, and smoove paste. Symbolic alleles (<DEL>, <DUP>, and <INV> alleles) were converted to explicit sequence representation using bayesTyperTools convertAllele.
A series of common filters were applied to the SV output of all four tools before using them for downstream analyses. Specifically, we removed variants spanning less than 50 bp or more than 500 kb, those located on unanchored scaffolds or organellar genomes, or any variant that was not classified as either a deletion, insertion, duplication, or inversion. The 500-kb threshold was chosen to ensure computational efficiency in downstream applications, but SVs larger than these are also expected to be very rare in soybean germplasm [32]. We also converted multiallelic variants into biallelic records and standardized the representation of all alleles using bcftools norm.
Oxford Nanopore sequencing
We selected 17 samples for Oxford Nanopore sequencing among those sequenced by Illumina. Sixteen (16) of them were randomly selected among a subset of 56 lines belonging to a core set of Canadian soybean germplasm, while the remaining sample (CAD1052/OAC Embro) had been selected and sequenced before the others based on its higher Illumina sequencing depth. Although sample selection did not explicitly maximize the number of potential SVs assessed, we did verify that the resulting set covered the range of variation found in Canadian soybean germplasm based on an existing phylogenetic tree [6].
Our sample preparation and sequencing protocols evolved throughout the project as we gained experience with Oxford Nanopore sequencing. Therefore, we outline our latest methods here, but more details regarding the procedures used for each sample can be found in Table S8 (Additional file 1). Accessions selected for sequencing were germinated in Jiffy peat pellets (Jiffy Group, Zwijndrecht, Netherlands) on the benchtop. Young trifoliate leaves were collected between 2 and 3 weeks after germination, flash frozen in liquid nitrogen upon harvest, and stored at − 80 °C until DNA extraction. Single trifoliate leaves weighing between 20 and 60 mg were used for each extraction. Liquid nitrogen-frozen leaves were pulverized on a Qiagen TissueLyser instrument (Qiagen, Hilden, Germany) with metal beads for four cycles of 30 s each at 30 Hz. The resulting powder was immediately transferred to a CTAB buffer (2% CTAB, 0.1 M Tris-HCl pH 8, 0.02 M EDTA pH 8, 1.4 M NaCl, 1% (m/v) PVP) and incubated at 60 °C in a water bath for 45 min. The lysate recovered after centrifugation at 3500 rcf for 10 min was then subjected to an RNase A treatment for another 45 min at 60 °C, followed by the addition of an equal volume of 24:1 chloroform:isoamyl alcohol to the sample and stirring to an emulsion. Following centrifugation at 3500 rcf for 15 min, the supernatant was recovered and mixed with a 0.7 volume of cold isopropanol. This mix was stored at – 80 °C for 20 min and centrifuged at 3500 rcf for 30 min, after which the liquid was removed. Tubes were rinsed twice with cold 70% ethanol, with a centrifugation step after each addition of ethanol. After the last rinsing, tubes were left to dry for 3 min after which pellets were resuspended in 100 μl elution buffer (Tris-HCl 0.01 M and EDTA 0.001 M, pH 8) at 37 °C for an hour, and then stored at 4 °C until use.
Samples were size-selected using the Short Read Eliminator kit of Circulomics (Circulomics, Baltimore, MD, USA) following the manufacturer’s instructions. The size-selected DNA resuspended in the SRE kit’s EB buffer was then purified using SparQ magnetic beads and resuspended in ddH2O. Typically, between 500 ng and 1 μg of this DNA was used for Oxford Nanopore library preparation using the SQK-LSK109 genomic DNA ligation kit (Oxford Nanopore Technologies, Oxford, UK). The library was prepared according to the manufacturer’s instructions except for the following details: (1) DNA fragmentation was not performed prior to library preparation, (2) 80% ethanol was used instead of 70% ethanol, (3) the bead elution time following DNA repair and end-prep was increased from 2 to 10 min, (4) the bead elution time following adapter ligation and clean-up was increased from 10 to 15 min and carried out in a water bath set to 37 °C. Typically, between 150 and 400 ng of the prepared library quantified using a Qubit fluorometer (Thermo Fisher Scientific, Waltham, MA, USA) were used as input to a FLO-MIN106D flowcell (R9 chemistry) and run on a MinION for 48 to 72 h using default voltage settings. While most accessions were sequenced on a single flow cell, three accessions for which the initial yield was low (< 9 Gb) were sequenced a second time (using DNA from a different plant) to provide sufficient data for downstream analyses. More details regarding the Oxford Nanopore sequencing of the samples can be found in Table S9 (Additional file 1).
Structural variation discovery from Oxford Nanopore data
Raw FAST5 sequencing files were basecalled on a GPU using Oxford Nanopore Technologies’ guppy basecaller v. 4.0.11 with parameters --flowcell FLO-MIN106 --kit SQK-LSK109. Basecalled FASTQ files obtained from a single flow cell were concatenated into a single file which was used for downstream analyses. Adapters were trimmed using Porechop v. 0.2.4 [77] with the option --discard_middle. Adapter-trimmed reads were aligned using NGMLR v. 0.2.7 [24] with the option -x ont. The resulting alignments were sorted and indexed using samtools.
At this stage, we merged the BAM files of samples that were sequenced on two different flowcells and called SVs using Sniffles v. 1.0.11 [24]. We ran Sniffles with parameters --min_support 3 (minimum number of reads supporting a variant = 3, default = 10), --min_seq_size 1000 (minimum read segment length for consideration = 1000, default = 2000), and --min_homo_af 0.7 (minimum alternate allele frequency to be considered homozygous = 0.7, default 0.8). We chose relaxed parameters compared to the defaults because our samples are inbred cultivars and heterozygosity should therefore be nearly non-existent. In order to evaluate the impact of sequencing depth on SV calling with Sniffles, we performed a subsampling analysis using some of the samples and report the detailed methods and results in Additional file 1 (Supplemental Methods and Supplemental Figures S7 to S10).
We applied a series of filters to the SVs in order to remove any spurious calls that could affect downstream analyses. Any variants called on organellar genomes or unanchored scaffolds were filtered out, along with any variants smaller than 50 nucleotides or larger than 500 kb. We only retained deletions, insertions, inversions, and duplications for further analyses, discarding unresolved breakpoints (SVTYPE=BND) as well as other complex types such as DEL/INV, DUP/INS, INVDUP, and INV/INVDUP. We removed variants called as heterozygous since heterozygous genotype calls are very likely to be spurious in these inbred lines. In order to avoid calling artificial variants in ambiguous regions of the genome (stretches of “N” due to imperfectly assembled regions of the reference genome), we also removed variants that overlapped any “N” in the reference as well as any insertion located less than 20 nucleotides away from any “N” in the reference.
The location of SVs as well as the insertion sequences reported by Sniffles are necessarily imperfect as they are based on error-prone Oxford Nanopore reads (on average 8–10% error rate based on the percent identity of our alignments). We therefore assembled a pipeline to refine the breakpoint location and the sequence content of the deletions and insertions found by Sniffles. Duplications and inversions were not considered for SV refinement because the inherent complexity of these variants made it difficult to accurately assemble them from our data. We briefly describe the pipeline here, but more details can be found in Additional file 1 (Supplemental Methods, Table S10 and Figures S26 to S28). Our breakpoint refinement pipeline starts by locally assembling all reads that were mapped by NGMLR to positions ± 200 bp from the location of the SV using wtdbg2 v. 2.5 [42]. The same reads are then aligned to the assembled sequence using minimap2 v. 2.17-r974 [43] to polish the assembly sequence using the consensus module of wtdbg2. The resulting polished assembly is subsequently aligned to the local region of the reference genome using AGE (commit 6fa60999, github.com/abyzovlab/AGE) [40]. The coordinates of the SV and insertion sequence content are then optionally updated from the information provided by the AGE alignment. When the alignment did not suggest suitable replacement coordinates or insertion content for a given SV, we simply used its representation as initially defined by Sniffles for downstream analyses instead. Following breakpoint refinement, the representation of the alleles was standardized using bcftools norm.
Structural variant genotyping and benchmarking
We genotyped SVs on all 102 Illumina samples with Paragraph v. 2.4a [29] using three different candidate SV sets. The first candidate SV set included only variants discovered from the Illumina data and was used to assess the performance of SV discovery from Illumina data alone. The second candidate SV set included only variants discovered from Oxford Nanopore data and was similarly used to assess the performance of genotyping those variants with Illumina data. The third and last candidate SV set comprised variants discovered using both Illumina and Oxford Nanopore data and was used for the population-scale analyses on population genetics, location of variants relative to gene models, and polymorphic TEs. Despite the superior performance of long-read data for SV discovery, we decided to also include variants discovered from the Illumina data in the final SV set as they encompassed all samples.
For genotyping SVs discovered from Illumina data, the VCF files of all discovery tools (AsmVar, Manta, SvABA, smoove) were merged together using SVmerge (commit 6a18fa3d2, github.com/nhansen/SVanalyzer) [45] with parameters -maxdist 15 -reldist 0.2 -relsizediff 0.1 -relshift 0.1. Parameters were chosen in order to merge slightly differing representations of alleles that were putatively identical from a biological point of view while preserving true allele diversity at a given position.
SVs discovered from Oxford Nanopore data were also merged across samples using SVmerge with the same parameters as described above. However, for Oxford Nanopore variants, we modified SVmerge’s default behavior which selects an allele randomly from a given SV cluster. Instead, we forced the random selection to be made among the alleles that had been refined by the SV refinement pipeline, if any, to favor those alleles whose representation was hopefully closer to biological reality.
For the last candidate SV set combining Illumina and Oxford Nanopore variants, the two datasets described above were merged using SVmerge. The default behavior of SVmerge was again overridden by systematically sampling among the alleles found by Illumina whenever a SV cluster contained alleles found by both Illumina and Oxford Nanopore. Despite the greater power of Oxford Nanopore data in discovering SVs, our reasoning was that if a variant was discovered by both sequencing technologies, then the Illumina data was likely more precise given its higher basecalling accuracy. While many false positive SV calls are likely to have made their way in the candidate SV dataset, our approach has been to feed Paragraph with a candidate set as exhaustive as possible for genotyping and rely on the accuracy of its genotyping model to avoid genotyping false positive SVs. Under this assumption, the genotypes of SVs that simply do not exist in the dataset should be called as homozygous for the reference allele in all samples. Consistent with this expectation, 32.3% of the SVs in the raw output of Paragraph had an alternate allele frequency of 0.
The methods used for genotyping were identical for all three candidate SV sets. We prepared the VCF files for input to Paragraph by removing variants located less than 1 kb away from chromosome ends and padding the allele representations as required by Paragraph. We genotyped the 102 Illumina samples aligned by bwa mem following the recommendations outlined by Paragraph for population-scale genotyping, i.e., the variants were genotyped independently for each sample with multigrmpy, setting the -M option to 20 times the average sequencing depth for the sample.
We compared the genotyping results of the three genotyped datasets against the Oxford Nanopore SV set in order to assess genotyping sensitivity and precision. For this analysis, the set of variants called from the Oxford Nanopore data by Sniffles and subsequently refined was considered to be the truth. Structural variation calls made from Oxford Nanopore data may also be erroneous, especially for smaller variants [2], so this approach of treating Oxford Nanopore dataset as the ground truth is necessarily imperfect but nevertheless provides a good comparison basis for our purposes.
We compared the SV genotype calls to the ground truth set using the R package sveval v. 2.0.0 [30]. For each of the 17 samples for which Oxford Nanopore data was available, we compared the genotype calls made by Paragraph to the SVs identified in the Oxford Nanopore data for that sample. SVs genotyped as homozygous for the alternate allele by Paragraph and present in the Nanopore set were considered true positives, while SVs genotyped as homozygous for the alternate allele by Paragraph but absent from the Nanopore set were considered false positives. Note that, for benchmarking purposes, we essentially ignored heterozygous genotype calls made by Paragraph since the truth set only contained homozygous calls as expected for inbred lines. Sensitivity was defined as the ratio of the number of true positive calls to the total number of SVs in the truth set, and precision as the ratio of the number of true positive calls to the sum of true and false positive calls. We computed sample-wise precision-recall curves for various SV size classes and SV types by using a range of read count thresholds (number of reads required to support a genotype call) to filter the Paragraph genotype calls. We required sveval to explicitly compare insertion sequences by setting ins.seq.comp = TRUE, but we otherwise used default settings. We extended sveval’s functionality by also assessing duplications under the same overlap conditions as the package already provides for deletions and inversions. Benchmarks were performed both on the complete set of SVs and on a subset of SVs located in non-repeat regions in order to assess the contribution of repeats to genotyping errors. A SV was defined as belonging to a repetitive region if it had a 20% or higher overlap to regions in the repeat annotation for the Williams82 assembly version 4 retrieved from Phytozome [78].
For the SVs discovered by Illumina, we computed additional precision-recall curves by filtering the SVs in the dataset genotyped by Paragraph based on two different metrics of SV quality: (1) the number of times the alternate allele is observed in homozygous genotype calls across the whole population (referred to hereafter as the homozygous ALT count) and (2) the number of calling tools (out of a maximum of four) that originally reported the SV. The more stringent homozygous ALT count was used instead of alternate allele frequency as a measure of the frequency of the SV in the population since true SVs are expected to be homozygous for the alternate allele in these inbred lines. Note that both of these quality measures (homozygous ALT count and the number of tools supporting an SV) effectively filter SV records and not individual genotype calls. The objective of these analyses was to see whether filtering on SV frequency or calling-tool support for variants could result in a higher-quality dataset.
Population genetics analyses
We used the set of merged Illumina and Oxford Nanopore SVs genotyped by Paragraph to evaluate whether SV calls could replicate population genetics analyses (structure, PCA, and phylogenetic tree) made from SNV calls. We applied methods similar to Torkamaneh et al. [6] in order to compute population structure for the 102-sample population. We called SNVs using Platypus v. 0.8.1.1 [52] with parameters --minMapQual=20 --minBaseQual=20 --maxVariants=10 --filterReadsWithUnmappedMates=0 --filterReadsWithDistantMates=0 --filterReadPairsWithSmallInserts=0. We filtered Platypus calls to keep only biallelic SNVs located on any of the 20 reference chromosomes. We only retained SNVs with a minor allele frequency ≥ 0.05, proportion of missing sites ≤ 0.4, and heterozygosity rate ≤ 0.1. The resulting 1.27 M SNVs were converted to PLINK BED format [53] and used as input to fastStructure v. 1.0 [79] using k = 5 as determined by Torkamaneh et al. [6]. A PCA was also computed on those SNVs using PLINK v1.90b5.3 with default parameters. Finally, we used the same SNV dataset to build a neighbor-joining phylogenetic tree with the PHYLIP package v. 3.697 [80] using pairwise distances computed with PLINK. We computed support for each node in the tree using a bootstrapping approach by sampling with replacement from the VCF file of SNVs 100 times and computing a new neighbor-joining tree from each such sample. The number of bootstrap trees supporting each node was then assessed using the prop.clades function the ape R package v. 5.4.1 [81].
The same population genetics analyses were also performed on the population-scale dataset of Illumina/Oxford Nanopore SVs genotyped with Paragraph. For these analyses, we filtered SV genotype calls by setting those with less than two supporting reads to missing. We also removed duplications, inversions, and records with a homozygous ALT count < 4 or a proportion of missing sites ≥ 0.4. The resulting filtered SV dataset was used to compute population structure with fastStructure, PCA with PLINK, and neighbor-joining tree with PLINK and PHYLIP using the same methods as described for SNVs. To test for correlation between the pairwise distance matrices computed from SNVs and SVs, we performed a two-sided Mantel test with 999 permutations using the mantel.test function of the ape R package.
Since the statistical methods implemented by the programs used above do not depend on an evolutionary model of changes in DNA sequence but rather simply on genotype calls, these methods could be applied to the SV dataset in exactly the same way as for SNVs.
Potential impact on genes
We annotated deletions and insertions based on their overlap with various gene features. We retrieved the positions of the gene models for Williams82 assembly 4 from Phytozome [78] and determined for each SV whether it overlapped any of the following genic features: coding sequences, non-coding genic sequences, and regions 5 kb upstream of genes. These categories were mutually exclusive, such that an SV overlapping both coding and non-coding sequences was only labeled as “coding sequences.” Similarly, an SV was only labeled as “5 kb upstream” if it did not overlap any genic sequences. The SVs that overlapped none of the features described above were labeled as “intergenic.”
We first used these annotations to assess whether SVs were over- or underrepresented within particular genic features by comparing the observed proportions of deletions and insertions overlapping each feature to what would be expected by chance. We used three different measures of random expectation of the proportion of SVs overlapping genic features. The first measure was a naive comparison to the proportion of the genome corresponding to each genic feature. This comparison is however biased because repetitive regions (which are largely non-genic) are less effectively queried for SVs than non-repetitive genic regions. Therefore, we also replicated the analysis by excluding repeated regions, which provided a second measure of random expectation. Finally, we performed a randomization test by estimating the distribution over the proportions of SVs that would be expected to overlap each genic feature by random chance. This was done by shuffling the start positions of SVs within the 100-kb genome-tiling bins in which they are located 5000 times and annotating them with the genic features overlapped. We used 100-kb bins tiled along the whole genome instead of shuffling the positions genome-wide to take into consideration the heterogeneity of the genome while allowing SVs to be repositioned in a gene-agnostic manner.
We also used the genic feature annotations to study differences in mean alternate allele frequencies of SVs depending on the features they overlapped. We averaged the frequencies of insertions and deletions overlapping each of the four genic features and computed the difference between the mean SV frequencies for each of the six possible pairwise combinations of features. SVs with a frequency of 1 in the population were excluded from this analysis because they might be due to errors in the reference assembly. Statistical significance was assessed using a randomization test by shuffling the genic feature annotations 10,000 times to get a distribution of mean SV frequency differences between feature groups under a random scenario. We computed one-sided p-values by comparing the observed values to the random distributions thus generated, using a significance threshold of α = 0.05 / 6 = 0.0083 to compensate for multiple testing.
Finally, we carried out enrichment analyses of GO [82] Biological Process terms and PFAM domains [83] to assess whether high-frequency gene-impacting SVs were associated with particular biological functions. We identified insertions and deletions with an alternate allele frequency ≥ 0.5 and < 1 among those overlapping coding sequences and found 546 genes overlapped by such SVs. These genes constituted our gene set of interest for the enrichment analyses. We used the GOstats Bioconductor package v. 2.56.0 [54] along with GO and PFAM annotations for Williams82 assembly version 4 retrieved from Soybase on April 20, 2021, to test this gene set for over- and underrepresentation of particular GO Biological Process terms or PFAM protein domains. We only tested GO terms and PFAM domains that were represented by at least 20 and 10 genes, respectively. For the GO terms, we used the conditional test as implemented in GOstats and the GO.db annotation package v. 3.12.1 [84]. We applied a Bonferroni correction to the p-values of both the GO and PFAM enrichment tests by multiplying the p-values by the number of terms/domains tested.
Transposable elements
We annotated TEs in the SVs discovered using the SoyTEdb database [48] downloaded from SoyBase [66]. We queried the deleted or inserted sequences of all deletions and insertions ≥ 100 bp against SoyTEdb using blastn v. 2.11.0+ [85] with default parameters. Any queried sequence that aligned to a TE in the database with at least 80% of the query length and 80% of the length of the TE sequence was considered a match and annotated accordingly with the classification of the best-matching TE. All alignments that matched these criteria had an extremely small E-value (< 10−80) and therefore no additional filtering on this was needed.
The annotated SVs were then used to determine both the proportion of polymorphic TEs belonging to each category and the physical location of polymorphic TEs in the genome. We also computed the proportions of TEs ≥ 100 bp in each category within the reference repeat annotation from Phytozome and compared those to the estimated proportions in the SV dataset. The estimated number of polymorphic TEs within various LTR retrotransposon families and DNA TE types were also compared to the number of non-reference TEs found by Tian et al. [38] to check whether our results were consistent with previous reports.
Soybean DNA TEs have received little attention compared to retrotransposons, which are more prevalent and polymorphic in this species (e.g., [38, 86]). DNA TEs that have TIR typically transpose using a “cut and paste” mechanism. This mechanism generates a TSD upon insertion into the genome and leaves this TSD as well as possible additional nucleotides upon excision due to DNA repair [87]. In order to study the dynamics of polymorphic DNA TEs within our population, we devised a pipeline based on local assembly and multiple sequence alignment of the DNA TE insertions. Briefly, the pipeline locally assembles Oxford Nanopore reads surrounding the sites of polymorphic DNA TEs for all samples using wtdbg2 and aligns these assemblies to each other using MAFFT v. 7.475 [50] before identifying TIR and TSD sequences with Generic Repeat Finder v. 1.0 [51]. For more details on the pipeline, see Supplemental Methods (Additional file 1). Our goal with this pipeline was to determine whether the insertion/deletion polymorphisms at various sites were due to novel TE insertion, TE excision, or a combination of both phenomena. We applied this pipeline to SVs that were annotated as TIR DNA TEs and whose matching sequence in the SoyTEdb database was matched by at least three SVs. We limited ourselves to TE sequences that were matched by at least three SV events under the assumption that TEs present in multiple copies were more likely to have been recently active. For insertions that had both TIR and TSD sequences unambiguously identified, we computed the proportion of matching nucleotides in the alignment of the two terminal repeats and averaged the values across all local assemblies bearing the insertion in order to get a single value for that SV.
Software used
Unless otherwise stated, all statistical analyses and data manipulation were conducted in R version 3.5.0 or 4.0.3 [88] and Bioconductor version 3.08 or 3.12 [89]. Analyses made use of Bioconductor packages Biostrings v. 2.58.0 [49], GenomicRanges v. 1.42.0 [44], Rsamtools v. 2.6.0 [90], rtracklayer v. 1.50.0 [91], and VariantAnnotation v. 1.36.0 [92]. All scripts used for the analyses described in this paper are available on GitHub [93].