Plant materials
The TDr96_F1 line used for WGS was selected from F1 progeny obtained from an open-pollinated D. rotundata breeding line (TDr96/00629) grown under field conditions in the experimental fields of the International Institute of Tropical Agriculture (IITA) in Nigeria. F1 seeds from TDr96/00629 and those obtained from the cross between the parental lines TDr97/00917 and TDr99/02627 used for RAD-seq were germinated on wet paper towels in darkness at 28 °C. After germination, the seeds were transferred to soil (Sakata Supermix A [34]) and grown at 30 °C with a 16-h/8-h photoperiod in a greenhouse at Iwate Biotechnology Research Institute (IBRC) in Japan. Fresh leaf samples were collected for DNA extraction. Additionally, to resequence the F1 progeny used for QTL-seq analysis, lyophilized leaf samples obtained from plants that were grown and phenotyped under field conditions at IITA were used for DNA extraction.
Determination of chromosome number and ploidy level
For chromosome observation, root tips of TDr96_F1 plants generated by in vitro propagation of nodal explants were sampled and fixed in acetic acid-alcohol (1:3 ratio) for 24 h without pretreatment. Fixed root tips were stained with a 1% aceto-carmine solution for 24 h. Samples were prepared by the squash method and analyzed under an Olympus BX50 optical microscope (Olympus Optical Co, Ltd., Tokyo, Japan [35]) at 400× magnification.
Estimation of D. rotundata genome size
The genome size of TDr96_F1 (D. rotundata) was estimated both by FCM and k-mer analyses. FCM analysis was carried out using nuclei prepared from fresh leaf samples of TDr96_F1 and a japonica rice (Oryza sativa L.) cultivar of known genome size (~380 Mb [36]), which served as an internal reference standard. Nuclei were isolated and stained with propidium iodide (PI) simultaneously and analyzed using a Cell Lab Quanta™ SC Flow Cytometer (Beckman Coulter, CA [37]) following the manufacturer’s protocol. The ratio of G1 peak means [yam (281.7):rice (188.7) = 1.493] was used to estimate the genome size of D. rotundata to be ~ 570 Mb (380 Mb × 1.5). k-mer analysis-based genome size estimation [10] was performed with TDr96_F1 PE reads with an average size of ~ 230 bp and a total length of 16.77 Gb (16,771,579,510 bp) using ALLPATHS-LG [11]. k-mer frequency analysis, with the k-mer size set to 25, generated values for k-mer coverage (Kc = 25.66) and mean read length (Rl = 228.8), which were used to estimate the genome size of TDr96_F1 to 579 Mb as follows:
$$ \mathrm{Genome}\kern0.17em \mathrm{Size}=\mathrm{Total}\;\mathrm{PE}\;\mathrm{read}\kern0.17em \mathrm{length}\;\left(\mathrm{bp}\right)\div \mathrm{Read}\kern0.17em \mathrm{coverage}\;\left(\mathrm{Rc}\right) $$
$$ \mathrm{Read}\kern0.17em \mathrm{coverage}\;\left(\mathrm{Rc}\right)=\left[k- mer\;\mathrm{coverage}\;\left(\mathrm{Kc}\right)\times \mathrm{Rl}\left]\div \right[\mathrm{Rl}\hbox{--} k\hbox{--} mer\;\mathrm{length}\;\left(\mathrm{k}\right)+1\right] $$
Whole-genome sequencing
For WGS, genomic DNA was extracted from fresh TDr96_F1 leaf samples using a NucleoSpin Plant II Kit according to the manufacturer’s protocol (Macherey-Nagel GmbH & Co. KG [38]) with slight modifications. Homogenized samples were washed with 0.1 M 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES) buffer to remove contaminating polysaccharides. Just before use, 120 mg polyvinylpyrrolidone (PVP), 90 mg l-ascorbic acid, and 200 μl 2-mercaptoethanol (ME) were added to 10 ml HEPES buffer, and 1 ml of the mixture was used to wash each sample; washing was repeated three times. Additionally, 10 μl 2-ME and 5 μl of 30% polyethylene glycol (PEG)-20000 were added to 1 ml of PL1 buffer (provided with the NucleoSpin Plant II Kit), and twice the recommended volume of buffer (800 μl) was used for cell lysis. Libraries for PE short reads and MP jump reads of various insert sizes including 2, 3, 4, 5, 6, and 8 kb were constructed using an Illumina TruSeq DNA LT Sample Prep Kit and a Nextera Mate Pair Sample Prep Kit, respectively. The PE library was sequenced on the Illumina MiSeq platform, while the MP libraries were sequenced on the HiSeq 2500 platform. Library construction and sequencing of the 20- and 40-kb MP jump sequences were carried out by Eurofins Genomics (Operon [39]) and Lucigen [40], respectively. The 20-kb and 40-kb jump libraries were sequenced on the MiSeq and HiSeq 2500 platforms, respectively. BAC libraries were constructed by Lucigen, and BAC-end sequencing was carried out by Genaris [41] using Sanger sequencing. A total of 30,750 clones corresponding to 3072 Mb of sequence and 5.4× genome coverage were constructed. Of these, 9984 clones were used for BAC-end sequencing, generating a 13.6-Mb sequence in PE fasta format, which was converted to 50-bp PE short reads corresponding to a 0.46-Gb sequence and ~ 0.8× coverage of the estimated 470-Mb D. rotundata genome (Additional file 1: Table S2 and Additional file 2: Figure S2).
De novo assembly
All TDr96_F1 sequence reads in fastq format were filtered for quality using the FASTX-Toolkit version 0.0.13 [42]. For PE reads and MP short jump reads with insert sizes ranging from 2 to 8 kb, only those having sequence reads with a Phred quality score of ≥ 30 (i.e., ≥ 90% of the reads) were retained. Adapter trimming and removal of MP reads with the wrong insert sizes were performed using an in-house pipeline of scripts written in Perl and C++. Quality filtering of the long jump sequences (20-, 40-, and 100-kb insert sizes) was carried out by the suppliers. For the initial de novo assembly, short PE reads and MP jump reads with 2- to 40-kb insert sizes (Additional file 1: Table S2) were assembled using ALLPATHS-LG assembler version R49856 [11]. Further scaffolding of the assembly generated by ALLPATHS-LG was performed using the 100-kb jump MP fastq reads obtained by BAC-end sequencing and the SSPACE PREMIUM 2.3 scaffolding tool with default parameters [13].
Constructing organelle genome sequences
De novo assembly of the D. rotundata mitochondrial genome sequence was performed using mitochondrial DNA isolated from TDr96_F1 leaf samples according to the method of Terachi and Tsunewaki [43] with the following minor modifications. Fresh green leaves (ca. 150 g) were homogenized in 1.5 L of homogenization buffer containing 0.44 M mannitol, 50 mM Tris-HCl (pH 8.0), 3 mM ethylenediaminetetraacetic acid (EDTA), 5 mM 2-ME, 0.1% (w/v) bovine serum albumin, and 0.1% (w/v) PVP. Following DNaseI treatment, the mitochondrial fraction was collected from the interface between 1.30 M and 1.45 M of a sucrose gradient. Mitochondrial DNA was purified by EtBr/CsCl centrifugation at 80,000 rpm for 6 h at 20 °C in a Beckman TLA 100.3 rotor. The DNA band was collected and purified by ethanol precipitation. The resulting mitochondrial DNA (15 ng) was amplified using a REPLI-g Mini Kit (Qiagen, Cat. no. 150023) and used for library construction. The library was sequenced on an Illumina MiSeq sequencer, and the resulting PE reads were assembled de novo using DISCOVAR De Novo [29], generating D. rotundata mitochondria contigs. For scaffolding, MP reads with insert sizes of 2, 3, 4, 5, 6, 8, and 20 kb obtained from D. rotundata genomic DNA (gDNA) were aligned to the D. rotundata mitochondrial contigs. MP reads showing 100% alignment were selected and used for scaffolding of D. rotundata mitochondrial contigs by SSPACE [13] (Additional file 2: Figure S5). To reconstruct the D. rotundata chloroplast genome sequence, the PE reads of TDr96_F1 were aligned to the recently published D. rotundata chloroplast genome sequence [16] (GenBank ID = NC_024170.1) by Burrows-Wheeler alignment (BWA) [44], and chloroplast-derived sequences were identified, amounting to 5,403,420 reads (14.74% of the total size of PE reads generated for TDr96_F1 [Table 1]) matching the assembled 155.4-kb chloroplast genome of D. rotundata.
Evaluation of the completeness of the genomic assembly
To evaluate the completeness of the D. rotundata genome assembly, the assembly was checked for the presence of 248 highly conserved core eukaryotic genes [45] using CEGMA version 2.4 with default parameters [14] (Additional file 1: Table S4). To further assess the completeness of the genome, the successor to CEGMA, Benchmarking Universal Single-Copy Orthologs (BUSCO), was used to check for the presence of 956 BUSCOs with version 1.1.b1 [15] using the early access plant dataset (Additional file 1: Table S5).
Annotation of transposable elements (TEs)
Legacy repetitive sequences, including transposons, were predicted using CENSOR 4.2.29 [46] with the following options: show_simple, nofilter, and mode rough using the Munich Information Center for Protein Sequences (MIPS) Repeat Element Database [47]. Following identification, the repeat elements were classified using mips-REcat [47]. Repetitive sequences were later improved by remodeling using RepeatModeler 1.0.8 [48] and masked with RepeatMasker 4.0.5 [49]. Using the National Center for Biotechnology Information (NCBI) database, one of three other options was used to generate interspersed RepeatModeler-based, interspersed Rebase-based, and Low complexity repeats: “nolow”, “nolow, species Viridiplantae”, and “noint”, respectively. Repeat element content and other statistics were compared between the D. rotundata and A. thaliana TAIR10 [50], B. distachyon v3.1 [51], and O. sativa v7_JGI 323 [52] genomes using the RepeatModeled and RepeatMasked references (Table 1).
RNA-seq
Total RNA was extracted using leaf, stem, flower, and tuber samples collected from a greenhouse-grown TDr96_F1 plant using a Plant RNeasy Kit (Qiagen [53]) with slight modifications. RLC buffer was used for lysis after the addition of 5 μl 30% PEG-20000 and 10 μl 2-ME to 1 ml of buffer. The RNA samples were treated with DNase (Qiagen) to remove contaminating genomic DNA. Two micrograms of total RNA was used to construct complementary DNA (cDNA) libraries using a TruSeq RNA Sample Prep Kit V2 (Illumina) according to the manufacturer’s instructions. The libraries were used for PE sequencing using 2× 100 cycles on the HiSeq 2500 platform in high-output mode. Illumina sequencing reads were filtered by Phred quality score, and reads with a quality score of ≥ 30 (≥90% of reads) were retained (Additional file 1: Table S12). Only one RNA-seq experiment was carried out per tissue/organ (indicated as sample in Additional file 1: Table S12).
Prediction of protein-coding genes
The legacy gene models were generated previously using the legacy repeat-masked reference genome and three approaches: ab initio, ab initio supported by evidence-based prediction, and evidence-based prediction. The ab initio prediction was carried out with FGENESH 3.1.1 [54]. The ab initio supported by evidence-based prediction was performed with AUGUSTUS 3.0.3 [55] using the maize5 training set and a hint file as the gene model support information. To construct the hint file, TopHat 2.0.11 [56] was used to align RNA-seq reads from tuber, flower (young), leaf (young), stem, leaf (old), and flower (old) samples to the D. rotundata reference genome, and Cufflinks 2.2.1 [57] was used to generate gene models from these data. The evidence-based predictions using the Program to Assemble Spliced Alignments (PASA) [58] were generated in a Trinity [59] assembled transcriptome from the RNA-seq data. JIGSAW 3.2.9 [60] was used to select and combine the gene models obtained using the three approaches with the weighting values assigned to the results from FGENESH, AUGUSTUS, and PASA of 10, 3, and 3, respectively. In total, 21,882 consensus gene models were predicted. These gene models were further improved upon using the MAKER [61] pipeline (Additional file 2: Figure S14). Publicly available ESTs and protein sequences from related plant species were aligned to the genome using GMAP [62] and Exonerate 2.2.0 [63], respectively. De novo and reference-guided transcripts were assembled from RNA-seq data from all 18 tissues using Bowtie 1.1.1 [64], Trinity 2.0.6 and SAMtools 1.2.0 [65], and Trinity 2.0.6 and TopHat 2.1.0, respectively. Both sets of assembled transcripts were used to build a comprehensive transcript database using PASA (Additional file 1: Table S13). High-quality non-redundant transcripts from PASA were used to generate a training set for AUGUSTUS 3.1. Gene models were predicted twice using the genome, improved repeat sequences, assembled transcripts, EST and protein alignments, the AUGUSTUS training set, and a legacy set of 21,882 gene models obtained previously using MAKER 2.31.6 [61], retaining all legacy gene models or querying them with new evidence and discarding those that could not be validated. From both MAKER runs, 21,894 and 76,449 gene models were predicted, respectively. A consensus set of gene models from both MAKER outputs was obtained using JIGSAW 3.2.9 [60] at a 1:1 ratio. In total, 26,198 consensus gene models were predicted in the D. rotundata genome. The corresponding amino acid sequences were also predicted for these gene models. To confirm these gene models, the RNA-seq reads were aligned to the CDSs (coding sequences) of the predicted genes using BWA [44] with default parameters. Accordingly, 85.8% of the gene models could be aligned by at least a single RNA-seq read. Functional annotation of the amino acid sequences was performed using the in-house pipeline, AnnotF, which compares Blast2GO [66] and InterProScan [67] functional terms.
Comparative genomics
Pairwise orthology relationships were determined with Inparanoid [68,69,70] using the longest protein-coding isoform for each gene in Arabidopsis thaliana (TAIR10) [50], Oryza sativa japonica (v7.0) [52], Brachypodium distachyon (v3.1) [71], Musa acuminata (v2) [72], Elaeis guineensis (EG5) [73], and Phoenix dactylifera (DPV01) [74]. Orthology clusters across all seven species were determined using Multiparanoid [75]. Sequences for the 12 classes of lectins were obtained from UniProt [76] for the proteomes of A. thaliana (up000006548), B. distachyon (up000008810), and O. sativa (up000059680). Protein alignments for B-lectin class protein sequences from all three of these species and D. rotundata were generated using the program Multiple Alignment using Fast Fourier Transform (MAFFT) [77]. Maximum likelihood trees were constructed based on the concatenated alignments of all 378 B-lectin proteins using RAxML [78] 8.0.2 with 1000 bootstraps. Enrichment of tuber-specific genes was detected using TopHat 2.1.0 to align RNA-seq data from each of the 12 tissues to the genome, with one biological replicate for each tissue. HTSeq 0.6.1 [79] was used to generate raw counts. Then the Bioconductor package DESeq2 1.14.1 [80] was used to compare raw counts of the three tuber tissues against all the other nine tissues (Additional file 1: Table S12) to determine tuber-enriched gene expression based on a log2 fold change > 0 and Benjamini–Hochberg [25] adjusted P value < 0.05.
Gene enrichment analysis of orthology clusters was performed with GOATOOLS [81], using the Holm significance test, and the false discovery rate was adjusted using the Benjamini–Hochberg procedure [25]. The list of enriched genes was filtered for redundant Gene Ontology (GO) terms using REVIGO [82]. For the species phylogeny, protein alignments for each gene with a 1:1 orthologous relationship across all monocot species were generated with MAFFT using the longest protein isoform. Maximum likelihood trees were constructed based on the concatenated alignments of 2381 orthologous protein-coding genes using RAxML 8.2.8 [78] with a JTT + Γ model and 1000 bootstraps.
SynMAP [83] using BLASTZ [84] alignments, DAGchainer [85] (options -D 30 and -A 2), and no merging of syntenic blocks were used as part of the CoGe platform [86] to identify syntenic blocks between the hard-masked pseudo-chromosomes of D. rotundata and scaffolds/contigs of Oryza sativa japonica (A123v1.0), Spirodela polyrhiza (v0.01), and Phoenix dactylifera L. (v3). A syntenic path assembly was then carried out on each of the same three species in SynMap using synteny between the scaffolds/contigs against D. rotundata pseudo-molecules. The syntenic path assembly is a reference-guided assembly that uses the synteny between two species to order and orientate contigs. This approach highlights regions of conservation that were otherwise too shuffled to be clearly observed. Self-self synteny analysis of D. rotundata pseudo-chromosomes was carried out using SynMap Last alignments with default parameters and syntenic gene pair synonymous rate change calculated by CodeML [87].
RAD-based linkage mapping and scaffold anchoring
RAD-seq was performed as previously described [88] with a minor modification. Genomic DNA was digested with the restriction enzymes PacI and NlaIII to prepare libraries used to generate PE reads by Illumina HiSeq 2500 (Additional file 2: Figure S6). Approximately 822.7-Mb and 250.4-Mb sequence reads covering 22.9% and 5.3% of the estimated 504-Mb D. rotundata genome sequence, excluding gap regions, at average depths of 7.2× and 9.8× were generated for the parental lines and F1 individuals, respectively (Additional file 2: Figure S7).
Library preparation and sequencing
For library construction, 1 μg DNA obtained from the two parental lines (TDr97/00917-P1 and TDr99/02627-P2) and the 150 F1 individuals was digested with PacI, which recognizes 5’-TTAATTAA-3’, and a biotinylated adapter-1 was ligated to the digested DNA fragments. The adapter-1-ligated DNA fragments were digested with a second enzyme, NlaIII (5’-CATG-3’). After collecting the biotinylated fragments using streptavidin-coated magnetic beads, adapter-2 was ligated to the NlaIII-digested ends. The adapter-ligated DNA was amplified using primers containing sample-specific index sequences, adapter-1 (F) and adapter-2 (R) sequences, and sequences corresponding to the P7 and P5 primers for Illumina sequencing library preparation (Additional file 2: Figure S6). The PCR products were pooled in equal proportions, purified, and subjected to PE sequencing on the Illumina HiSeq 2500 platform. Detailed information about the primers (P7 and P5) used for Illumina library preparation are given in Additional file 1: Table S20.
Identification of parental line-specific heterozygous markers
RAD-tags were aligned to the D. rotundata reference genome using BWA [44]. The aligned data were converted to SAM/BAM files using SAMtools [65], and the RAD-tags with mapping quality < 60 or containing insertions/deletions in the alignment data were excluded from analysis. Low mapping positions including those with only a single RAD-tag and a mapping quality score of < 30 were also excluded. SNP-index values [28] were calculated at all SNP positions. For linkage mapping, two types of heterozygous markers (SNP-type and presence/absence-type) were identified (Additional file 2: Figure S8). The SNP-type heterozygous markers were defined based on SNP-index patterns of the parental line RAD-tags. For example, positions with SNP-index values ranging from 0.2 to 0.8 in P1 but homozygous in P2 with SNP-index values of either 0 or 1 were defined as P1-specific heterozygous SNPs. A similar procedure was followed to identify P2-specific heterozygous SNP markers. The selected markers were filtered using depth information at all positions. To increase the accuracy of the selected markers, their segregation (1:1 ratio) was confirmed in 150 F1 individuals obtained from a cross between P1 and P2. If the segregation ratio was out of the confidence interval (P < 0.05) hypothesized by the binomial distribution, B(n = number of individuals, P = 0.5), the markers were excluded from further analysis. Only one marker was selected per 10-kb interval based on the number of F1 individuals represented and tag coverage. A total of 1105 and 990 P1- and P2-heterozygous SNP markers were selected, respectively (Additional file 1: Table S7).
The presence/absence-type markers were defined based on the alignment depth of parental line RAD-tags. First, genomic positions that could be aligned by RAD-tags from only one of the parental lines were identified. Additionally, aligned tags should be heterozygous for that particular region. Similar to the SNP-type markers, the segregation patterns of candidate presence/absence-type markers in the F1 progeny were confirmed, and only those that segregated at a 1:1 ratio (as confirmed by binomial distribution filter) were retained. In the F1 progeny, positions with sequencing depths of ≥ 3 and 0 were defined as heterozygous and homozygous, respectively. For a given candidate position/marker, if the number of F1 individuals defined as homozygous or heterozygous was less than 120, the marker was excluded from further analysis. Only one heterozygous position was selected as a marker within a given 10-kb interval. In total, 221 and 282 positions were selected as P1- and P2-specific presence/absence-type heterozygous markers, respectively (Additional file 1: Table S7).
Linkage mapping
To developing parental line-specific linkage maps, P1-Map and P2-Map, recombination fraction (rf) values between all pairs of markers on a given scaffold were calculated for both parents using the recombination pattern of the 150 F1 individuals. To minimize incorrect mapping, scaffolds were divided at positions where rf values exceeded 0.25 from the initial marker position (Additional file 2: Figure S9). Only two flanking (distal) markers per scaffold were selected, corresponding to 477 and 493 P1- and P2-specific markers, respectively. These markers were used to develop P1 and P2 linkage maps according to the pseudo-testcross method [18] using the backcross model of R/qtl [89]. Due to the use of the pseudo-testcross method, the initial maps contained both the coupling and repulsion-type markers. Consequently, the genetic distance in linkage groups was larger than expected. To avoid the effect of repulsion-type markers when calculating genetic distances, these markers were converted to coupling-type markers. If a marker showed a high logarithm of odds (LOD) score and an rf value > 0.5, it was defined as repulsion type and was therefore converted to the coupling-type genotype. This conversion was carried out gradually by changing the threshold of the LOD score from 10 to 5, and then to 3. After converting all repulsion markers to coupling markers, linkage maps were developed using markers showing LOD score > 3 and rf value < 0.25. Accordingly, a total of 21 and 23 linkage groups, each with a minimum of three markers, were generated for P1- and P2-Maps, respectively (Additional file 1: Table S8 and Additional file 2: Figure S10).
Anchoring scaffolds
To develop chromosome-scale pseudo-molecules, TDr96_F1 scaffold sequences were anchored onto the two parental-specific linkage maps using the selected RAD markers. To combine the two maps, the number of scaffolds shared between all possible linkage group (LG) pairs corresponding to the two maps was determined (Additional file 2: Figures S11, S12). LG pairs that shared the largest number of scaffolds were combined using the same scaffolds. Each combined LG represented a pseudo-chromosome, which was designated/numbered according to the P1-Map LG designation (see Fig. 2 and Additional file 2: Figure S11). After combining the two maps to construct the pseudo-chromosomes, P1- and P2-specifc scaffolds were ordered according to their original order in their respective LGs. If the order of scaffolds could not be decided because the order was similar in both the P1- and P2-Maps, the order in P1-LG was adopted (Fig. 2). Finally, the ordered scaffolds were connected by 1000 nucleotides of “N” into a single fasta file for each pseudo-chromosome (Additional file 2: Figure S12).
QTL-seq analysis
DNA samples obtained from the two parental lines, TDr97/00917 (P3, female) and TDr97/00777 (P4, male), as well as samples pooled in equal amounts from 50 male (male-bulk) and 50 female (female-bulk) F1 individuals obtained from the cross between P3 and P4 were subjected to WGS. Libraries for sequencing were constructed from 1-μg DNA samples with a TruSeq DNA PCR-Free LT Sample Preparation Kit (Illumina) and were sequenced via 76 cycles on the Illumina NextSeq 500 platform. Short reads in which more than 20% of sequenced nucleotides exhibited a Phred quality score of < 20 were excluded from further analysis. To perform QTL-seq analysis of F1 progeny, two types of analyses are required. In the first analysis, the SNP index and ΔSNP index are calculated at P4-specific heterozygous positions. The second analysis is performed using P3-specific heterozygous positions. To identify P4-specific heterozygous positions, the P3 “reference sequence” was first developed by aligning P3 reads to the reference genome sequence of D. rotundata and replacing nucleotides of the D. rotundata reference genome sequence with nucleotides of P3 at all SNP positions showing an SNP index of 1 (Additional file 2: Figure S17c). SNP detection, calculation of SNP index, and replacement of SNPs were carried out via step 2 of QTL-seq pipeline version 1.4.4 [90]. Short reads obtained from both the male and female parents were then aligned to the “reference sequence” and heterozygous SNP positions between the two were extracted. A SNP was defined as heterozygous if the same position showed an SNP-index value ranging from 0.4 to 0.6 in one parent and a value of 0 in the second parent. Of the selected markers/positions, only those having enough depth in both parents (>15) were used for analysis of SNP-index values in the bulk-sequenced samples. P3-specific heterozygous positions were identified similarly using the P4 “reference sequence.”
After identifying P4- and P3-specific heterozygous positions, the Illumina reads from the two bulk-sequenced samples (male and female bulks) were aligned to the reference sequences using BWA [44] and subjected to Coval filtering [91] as previously described. When the P3 reference sequence was used for alignment, the SNP-index values were calculated only at all of the P4-specific heterozygous positions. By contrast, when the P4 reference sequence was used for alignment, the SNP-index values were calculated only at the P3-specific heterozygous positions. In both cases, positions with shallow depth (< 6) in either of the two samples were excluded from analysis. The ∆SNP index was calculated by subtracting the SNP-index values of the male bulk from those of the female bulk. To generate confidence intervals of the SNP-index value, an in silico test simulating the application of QTL-seq to DNA bulked from 50 randomly selected F1 individuals was performed as described previously [28] (Additional file 2: Figure S22). The simulation test was repeated 10,000 times depending on the alignment depth of short reads to generate confidence intervals. These intervals were plotted for all SNP positions analyzed. Finally, sliding window analysis was applied to SNP-index, ∆SNP-index, and confidence interval plots with a 1-Mb window size and a 50-kb increment to generate SNP-index graphs (Additional file 2: Figure S18).
Identification of putative W-region by de novo assembly of female and male parental genomes and mapping of bulked DNA from female and male F1 progeny
DNA samples obtained from the two parental lines, TDr97/00917 (P3, female) and TDr97/00777 (P4, male), were separately subjected to de novo assembly. Libraries for sequencing were prepared with a TruSeq DNA PCR-Free LT Sample Preparation Kit (Illumina) and were sequenced for 251 cycles on the Illumina MiSeq platform. Contigs were generated using the DISCOVAR De Novo assembler [29], resulting in P3-DDN and P4-DDN, respectively. Separately, whole-genome resequencing of bulked DNA was performed on bulked DNA samples obtained from 50 female F1 (Female-bulk.fastq) and 50 male F1 (Male-bulk.fastq) progeny, all derived from a cross between P3 and P4. Two reference sequences, P3-DDN and P4-DDN, were combined to generate P3-DDN/P4-DDN. Short reads from the female and male bulks were separately mapped to P3-DDN/P4-DDN using the alignment software BWA [44]. After mapping, the MAPQ scores of the aligned reads were obtained. Under our conditions, if a short read was mapped to a unique position of the reference sequence, the MAPQ score was 60, whereas if the read was mapped to multiple positions, MAPQ was < 60. Since two reference sequences (P3-DDN and P4-DDN) were fused to generate P3-DDN/P4-DDN, most genomic regions were represented twice. Therefore, most short reads mapped to two or more positions, leading to a MAPQ score < 60. The reads that mapped to the P3-DDN/P4-DDN with MAPQ = 60 were judged to be located in either P3- or P4-specific genomic regions. After finding these P3- or P4-specific genomic regions, the depth of short reads that covered the regions for Female-bulk.fastq and Male-bulk.fastq, respectively, was evaluated. If the depth of Female-bulk.fastq was high and the depth of Male-bulk.fastq was 0 or close to 0, such genomic regions were retained as putative W-regions (Fig. 6 and Additional file 2: Figure S20).
DNA markers linked to sex
The primer sequences used for amplification of sex-linked markers sp1 and sp16, as well as the control Actin gene fragment (Dr-Actin), were as follows:
PCR primers for sp1 fragment:
-
sp1-F; 5’-GATCTGGCTTCCTCCATCTTG-3’
-
sp1-R; 5’-GCTTGGGTGGTTAGTTTATTGTTTG-3’
-
PCR primers for sp16 fragment:
-
sp16-F; 5’-AATGTGTTTAACAGGGTGAATTC-3’
-
sp16-R; 5’-GAATTCAGCCGAATATACTTATTC-3’
-
PCR primers for Dr-Actin gene fragment:
-
Dr-Actin-F; 5’-CAGGGAAAAGATGACCCAAATC-3’
-
Dr-Actin-R; 5’-CCATCACCAGAATCCAGCAC-3’
PCR was performed using the following conditions: 30 cycles of 98 °C for 10 s, 55 °C for 30 s, and 72 °C for 1 min. For CAPS analysis of the sp1 marker, the amplified DNA was digested with EcoRI. All PCR products were electrophoresed on 1.5% agarose gels.
Identification of SSR markers
Approximately 4,932,582 bp simple sequence repeat (SSR) motif-containing sequences were predicted in the D. rotundata genome. Within this region, SSR sequences with enough flanking regions were identified and evaluated for use in primer design. Accordingly, 134,101 SSR-containing sequences, excluding those with single base repeats, were identified. The SSR information for these sequences was analyzed using GMATo [92] version 1.2 Build 20130106 with the following parameters: m (minimum motif length) = 2, x (maximum motif length) = 10, and r (minimum repeat number) = 10. The necessary information was obtained for 22,164 SSR-containing sequences in the assembled genome, 12,724 (57.4%) of which were anchored to the genetic map (Additional file 1: Table S19). Primer pairs were designed for 1000 of these sequences using Primer3 [93] software release 2.3.6 with the following parameters: product size = 100–500, primer length = 18–22 bp (optimum 20 bp), GC content = 40–60% (optimum 50%), and Tm = 57–63 (optimum = 60.0).