De novo assembly of cauliflower and cabbage genomes
The inbred lines cauliflower Korso_1401 (hereafter Korso) and pointed cabbage OX-heart_923 (hereafter OX-heart) were selected for genome sequencing (Additional file 1: Figure S1). Approximately 70.0 Gb PacBio sequences were generated for each accession, covering about 120× of the Korso and OX-heart genomes, which had estimated sizes of 566.9 Mb and 587.7 Mb, respectively (Additional file 1: Figure S2). These PacBio reads were de novo assembled into contigs and errors in the assembled contigs were corrected using both PacBio long reads and Illumina short reads (~ 100 Gb for each accession). In addition, a genome map was assembled from 242.2 Gb cleaned BioNano optical map data for Korso and used to connect the assembled contigs. Furthermore, 285.8 and 453.0 million cleaned Hi-C read pairs, among which 58.6 and 140.0 million were valid, were used for pseudochromosome construction for Korso and OX-heart, respectively. The final genome assemblies of Korso and OX-heart comprised 615 and 973 contigs, respectively, with cumulative lengths of 549.7 Mb and 565.4 Mb, and N50 sizes of 4.97 Mb and 3.10 Mb (Additional file 2). A total of 544.4 Mb and 539.1 Mb, accounting for 99.0% and 95.3% of the Korso and OX-heart assemblies, respectively, were clustered into nine pseudomolecules. The Hi-C heatmaps (Additional file 1: Figure S3) and the good synteny between Korso and OX-heart assemblies and the broccoli HDEM assembly [8] (Additional file 1: Figure S4) supported their chromosome-scale structures.
Around 99.8% of the Illumina genomic reads could be mapped back to the Korso and OX-heart assemblies, with 99.6% of the assemblies covered by at least 5 reads. Based on the alignments, the estimated base error rates of the Korso and OX-heart assemblies were 1.23 × 10−5 and 5.6 × 10−5, respectively (Additional file 3). BUSCO analysis [19] showed that 97.2% and 96.5% core conserved plant genes were completely assembled in Korso and OX-heart. In addition, up to 98.0% of the RNA-Seq reads could be mapped to the assemblies (Additional file 4). Together, these results demonstrated the high quality of the Korso and OX-heart assemblies.
Genome annotation and comparative genomics
Approximately 60.7% and 62.0% of the Korso and OX-heart assemblies were annotated as repetitive elements, respectively, similar to that (60.5%) in the B. oleracea var. italica HDEM genome assembly (Additional file 5). Full-length long terminal repeat retrotransposons (LTR-RTs) were then extracted from the Korso, OX-heart, and B. rapa (V3.0) [20] genomes (Additional file 6). Insertion time estimation of these intact LTR-RTs unraveled two LTR-RT bursts that occurred in Korso and OX-heart, around 0.2 and 1.5 million years ago (mya), respectively (Additional file 1: Figure S5). In contrast, in B. rapa, most of the LRT-RT formed recently, with more than 30% of the identified intact LTR-RTs younger than 0.2 mya, compared to 16.3% and 15.9% in Korso and OX-heart, respectively.
The high-quality Korso and OX-heart assemblies allowed us to precisely identify the centromere locations. The determined positions of centromeres on each chromosome in both genomes (Additional file 1: Figure S6) were consistent with the previously determined centromere locations using fluorescent in situ hybridization (FISH) analysis [21]. As expected, repetitive elements were enriched in the centromere regions. Different repeat families displayed clearly different patterns on the chromosomes, e.g., Copia-type LTRs were mainly in centromeres, while Gypsy-type LTRs were in pericentromeric regions (Fig. 1a).
A total of 60,640 and 62,232 protein-coding genes were predicted from Korso and OX-heart genomes, respectively, using an integrated strategy combining ab initio, transcript-based and homology-based predictions. Among these predicted genes, 70.9% and 76.4% were supported by transcriptome evidence, and 91.0% and 90.0% had homologs in other plant species.
Synteny analysis of Korso, OX-heart, B. rapa and A. thaliana genomes confirmed the whole genome triplication (WGT) and subsequent sub-genome divergence in Brassica species [4, 14, 22] (Additional file 1: Figures S7 and S8). Based on these syntenic relationships, we identified the triplicated regions within Korso and OX-heart genomes and divided them into three subgenomes based on their retained gene densities (Fig. 1). As previously reported in B. rapa [22] and B. oleracea [4], the three subgenomes of Korso and OX-heart, LF (the least fractionated), MF1 (the medium fractionated), and MF2 (the most fractionated), showed the same biased retention pattern of duplicated genes during diploidization [23] (Additional file 1: Figure S8a,b). Duplicated gene copies left in different subgenomes displayed diverged gene expression patterns, with the copies located in LF generally having higher expression levels than those in MF1 and MF2 (Additional file 1: Figure S8c,d).
We compared protein sequences of predicted genes from four B. oleracea accessions (cauliflower Korso, pointed cabbage OX-heart, broccoli HDEM and kale like rapid cycling TO1000), three other Brassica species, B. rapa, B. nigra, and the C subgenome of B. napus, five other Brassiaceae species (Aethionema arabicum, Arabidopsis thaliana, Capsella rubella, Thellungiella salsuginea, and Schrenkiella parvula), and two outgroups (grape and papaya). A phylogenetic tree was constructed using 1638 single-copy orthologous genes, which indicated that cabbage and the common ancestor of cauliflower and broccoli diverged about 1.68 mya, the extant B. oleracea and the donor of B. napus C subgenome diverged about 2.27 mya, and Brassica diverged from other Brassiaceae species about 16.18 mya (Fig. 1b).
SVs between genomes of Korso and OX-heart
By taking advantage of the high-quality genome assemblies of Korso and OX-heart, we were able to identify high-confidence SVs through direct genome comparison combined with PacBio long read mapping. The Korso and OX-heart assemblies displayed very high collinearity indicating the balanced rearrangements (inversions and translocations) were not profound between them (Additional file 1: Figure S4). Therefore, in this study, we focused on the unbalanced SVs, mainly indels. A total of 119,156 SVs were identified between genomes of Korso and OX-heart, with sizes ranging from 10 bp to 667 kb with a clear bias to the relatively short ones, and these SVs were more likely to overlap with different types of repetitive sequences except the satellite sequences (Additional file 7 and Additional file 1: Figure S9).
SVs in gene bodies and promoter regions can affect the function or expression of the corresponding genes. The SV regions accounted for 14.5% and 15.0% of the total genome sizes of Korso and OX-heart, 10.0% and 11.3% of the gene regions, and 5.9% and 6.6% of the coding sequences, respectively, suggesting a functional constraint against the occurrence of SVs in genes, especially in coding regions, while no obvious restriction of SVs in promoter regions was detected (Additional file 8). More than half of the annotated genes in Korso (58.5%) and OX-heart (58.6%) were affected by at least one SV in their gene bodies or promoter regions, with a functional enrichment in diverse biological processes, such as cellular component organization, response to stress and stimulus, signal transduction, cell differentiation, embryo development, gene expression and epigenetic regulation, and flower and meristem development (Additional file 1: Figure S10). We detected several previously described SVs in B. oleracea, including the two indels in BoFLC3 related to subtropical adaptation of broccoli [24], and the two indels in BoFRIa related to winter annual or biennial habit of cauliflower and cabbage [25].
Population dynamics of SVs in different B. oleracea morphotypes
Cabbage and cauliflower represent two extreme morphotypes of the B. oleracea species, and identifying genomic variations underlying the formation of their unique phenotypes (e.g., leafy head and curd) would provide novel insights into the molecular regulation of these important traits as well as important information for facilitating breeding. The high-quality SVs that we identified between Korso and OX-heart provided a valuable reference to investigate their dynamics in different morphologically diverged B. oleracea accessions. For this purpose, we performed genome resequencing of 163 B. oleracea accessions, including 89 cauliflower, 65 cabbage, and 9 broccoli accessions. We also collected resequencing data of an additional 108 B. oleracea accessions reported in Cheng et al. [14], including 15 cauliflower, 39 cabbage, 24 broccoli, 18 kohlrabi, four Chinese kale, two curly kale, two kale, two brussels sprout, and two wild B. oleracea accessions (Additional file 9). Among these 271 accessions, 211 were sequenced to a depth of more than 10×. The 119,156 high-quality reference SVs were genotyped in these 271 accessions based on the alignments of genome sequencing reads to the Korso and OX-heart genomes. To assess the accuracy of our SV genotyping, we genotyped the reference SVs in Korso and OX-heart by mapping their Illumina short reads to both these genomes, respectively. More than 86% of SVs could be genotyped, while only 0.1% were falsely genotyped (Additional file 10), suggesting high sensitivity and accuracy of our genotyping. The SV genotyping rate in each accession ranged from 41.3% to 80.2%, with 187 (69.0%) and 254 (93.7%) accessions having a genotyping rate greater than 70% and 60%, respectively (Additional file 1: Figure S11 and Additional file 9). In total, 89,882 (75.4%) SVs were successfully genotyped in more than 50% of the 271 accessions.
SV allele frequency variations among different groups of B. oleracea are mainly a result of domestication for different desirable traits and adaptation to different environments. As expected, SV loci with the homozygous Korso alleles were prevalent in cauliflower accessions, taking up an average of 82.3% of the genotyped SVs in each accession, whereas in cabbage accessions, the homozygous OX-heart alleles were prevalent, with an average frequency of 61.7% (Fig. 2a and Additional file 9). Phylogenetic and principal component analyses (PCA) using the SVs clearly divided cauliflower, cabbage, broccoli, and kohlrabi accessions into different groups (Fig. 2b, c), which were concordant with the patterns revealed by SNP data in our analysis based on the same 271 accessions (Additional file 1: Figure S12) and the previous report based on 119 accessions [14], further supporting that our SV detection and genotyping were highly reliable.
To identify SVs potentially related to the specific traits of cauliflower or cabbage, we extracted a total of 49,904 SVs with significantly different allele frequencies between cauliflower and cabbage populations (Fig. 3a). Among these SVs, 49,285 (98.8%) had significantly higher allele frequencies of Korso genotypes in cauliflowers than in cabbages, while only 550 represented higher OX-heart allele frequencies in cauliflowers than in cabbages. These potentially selected SVs were distributed across the chromosomes without conspicuous hotspots (Additional file 1: Figure S13). Such prevalence of selected SVs across the genome is consistent with the relatively large divergence time (~ 1.68 mya) between the two highly specialized B. oleracea morphotypes and their independent evolution and domestication history (Fig. 1b).
In Korso and OX-heart genomes, 21,111 and 21,400 genes, respectively, overlapped with at least one selected SV in their gene bodies or promoter regions, with 6059 and 6344 overlapping with selected SVs in CDS regions. GO enrichment analyses of these genes with selected SVs revealed that those related to signal transduction, response to stimulus, cell differentiation, cell cycle, embryo development, cell growth and cell death, and flower development were significantly overrepresented (Fig. 3b), some of which showed potential associations with the distinct phenotypes of cauliflower and cabbage, such as flower development.
Selected SVs provide insights into the evolution of cauliflower curd formation
The curd of cauliflower is composed of a spirally iterative pattern of primary inflorescence meristems with floral primordia arrested in their development [26, 27]. The first insight in genetic control of the curd-like structure was achieved through characterization of the Arabidopsis ap1 and cal double mutant with a cauliflower curd phenotype [28]. Subsequently, several studies indicated that the genetic nature of the cauliflower curd appears more complex [29,30,31]. Here, we retrieved a total of 294 genes harboring selected SVs in their promoters or gene regions and whose homologs in Arabidopsis have been reported to function in flowering time and floral development, meristem maintenance and determination, organ size control, and shoot or inflorescence architecture (Additional file 11). In addition, RNA-Seq analysis of five stages from vegetative shoot apical meristem (SAM) to enlarged curd was conducted to reveal the potential roles of SVs in curd formation and development.
Transition from vegetative to generative development
The first stage of curd initiation corresponds with the switch from vegetative to generative development (Fig. 4a). Timely transition to the generative stage in cauliflower is essential for curd formation, while for cabbage a prolonged vegetative stage is needed for the proper development of the leafy head. The MADS box transcription factor FLC, a flowering time integrator in the vernalization and autonomous pathways, acts as a repressor of flowering [32, 33]. Several studies have demonstrated the roles of FLC paralogues in flowering time in diverse B. oleracea morphotypes [24, 34,35,36,37]. A 3371-bp insertion (SV_b_92666a) in the promoter of BoFLC1.1 in Korso was found under strong differential selection, present in 99% and 88% of the cauliflower and broccoli accessions, respectively, while only in 9% of cabbage accessions (Fig. 4b). BoFLC1.1 and its two tandem paralogs (BoFLC1.2 and BoFLC1.3), as well as BoFLC3 were all significantly down-regulated at the transition stage (Fig. 4c). The Korso allele of BoFLC3 contains a 263-bp deletion (SV_w_24534) and a 49-bp insertion (SV_w_24533) in the first intron. The effect of the structure of the FLC first intron on flowering time has been reported in Arabidopsis and cruciferous crops [24, 38, 39]. We found that at these two SV loci, the Korso alleles were predominant in cauliflower (86.7% and 86.4%) and broccoli (96.9% and 92.9%), but rare in the cabbage accessions (9.7% and 8.7%) (Fig. 4b).
The FLC function is activated by FRI [40], which has been identified as a candidate gene in the QTL region for temperature-dependent timing of curd induction in cauliflower [41]. Two FRI homologs, BoFRI1 and BoFRI2, were identified in both Korso and OX-heart genomes. A 743-bp deletion (SV_b_96002) in the promoter region of BoFRI1 characterized the Korso allele. Most of the cauliflowers (98.0%) contained the homozygous Korso genotype, while the majority of cabbages (87.0%) harbored the homozygous OX-heart genotype (Fig. 4b). For BoFRI2, two insertions (12- and 21-bp, SV_w_31837, and SV_w_31838) were identified in its coding region, both displaying significant differences of genotype frequencies between cauliflower and cabbage (Fig. 4b). These two indels have been found to be related to winter annual or biennial habit of cauliflower and cabbage [25]. FES and SUF can form a putative transcription activator complex with FRI to promote FLC expression [42, 43]. The B. oleracea homologs BoFES1.1 and BoSUF4.2 harbored selected SVs in cauliflowers compared to cabbages, and their expression was significantly down-regulated from the vegetative to the transition stage in cauliflower, similar to that of BoFLC1s and BoFLC3 (Additional file 11). Other genes involved in regulating FLC expression, including those involved in epigenetic modification such as the PRC1 and PRC2 complex components BoVIN3, BoVIL2.3, and BoVRN1.1, also harbored selected SVs (Additional file 11). Together, these results suggested that the FLC-related autonomous and vernalization pathways might be affected by the differential SVs between cauliflower and cabbage, possibly contributing to their different timing of switch to the generative stage.
Inflorescence meristem proliferation
The main process following curd initiation is the continuously regular spiral proliferation of undetermined inflorescence meristems that form the curd. Stem cell maintenance and meristem proliferation play key roles in this process. WUSCHEL acts as an auxin response rheostat to maintain apical stem cells in Arabidopsis [44]. We identified a 12-bp in-frame deletion (SV_w_83072) in the second exon and a 21-bp insertion (SV_w_83073) in the first intron of BoWUS2 in Korso. All sampled cauliflower and broccoli accessions had the homozygous Korso genotypes for both SVs, while the Korso alleles were rare (4%) in cabbage (Fig. 4b). The expression of BoWUS2 was significantly up-regulated from vegetative to curd formation, with the highest expression at the curd formation stage (Fig. 4c), implying that these two SVs could play roles in the curd formation.
MP/ARF5 together with ANT and AIL play key roles in auxin-dependent organ initiation and phyllotactic patterning [45, 46]. Selected SVs in promoters and gene regions of their homologs in B. oleracea, BoMP2, BoMP3, BoANT, BoAIL5, BoAIL6, and BoAIL7, were identified (Additional file 11). A 23-bp insertion (SV_w_71238) in the promoter of BoMP2 in Korso is under strong selection in cauliflower (96.1% and 0% in cauliflower and cabbage accessions, respectively). An 11-bp insertion and a 23-bp deletion (SV_w_92482 and SV_w_92481) in the CDS and a 14-bp deletion (SV_w_92433) in the intron of BoMP3 in Korso are under strong selection in cauliflower (95.6%, 96.1%, and 96.2% in cauliflower and 12%, 14%, and 13.3% in cabbage accessions, respectively) (Fig. 4b). Same as BoWUS2, the highest expression of BoMP2 and BoMP3 was also observed at the curd formation stage (Fig. 4c).
Curd maintenance and floral arrest
Cauliflower curd is composed of thousands of inflorescence meristems with floral meristems arrested in development. A large substitution (SV_b_70950) (~ 11.4 kb in OX-heart and ~ 7.7 kb in Korso) in the promoter region of the floral meristem identity (FMI) gene BoCAL was identified under strong selection. Almost all cauliflower (99.0%) and the majority of broccoli (87.5%) accessions shared the Korso allele, while most cabbage accessions (79.2%) harbored the OX-heart allele at this locus (Fig. 4b), suggesting its potential role in curd formation.
Several other FMI genes including BoAP1.2, BoFUL1, BoFUL3, and BoSEP3 were also affected by selected SVs (Additional file 11), and all had relatively low expression at the vegetative, transition and curd formation stages, but significantly higher expression at the curd enlargement stage (Additional file 1: Figure S14). Studies in Arabidopsis suggest that an antagonistic interaction between the inflorescence meristem identity (IMI) gene TFL1 and FMI genes regulates the developmental fate transitions [47,48,49]. The nearly opposite expression pattern of BoTFL1.2 compared to that of FMI genes indicated its repression role (Additional file 1: Figure S14). While no selected SVs were identified in BoTFL1.2, a 13-bp deletion (SV_w_84836) in the intron of its positive regulator BoAGL14 [50] was found under strong selection (Fig. 4b), and BoAGL14 showed the same expression pattern as BoTFL1.2 (Additional file 1: Figure S14), suggesting potentially important roles of both BoTFL1.2 and BoAGL14 in floral identity arrest and inflorescence proliferation for curd formation and maintenance. SVP is a key negative regulator of floral transition [51, 52]. A 420-bp (SV_w_74120) insertion in the promoter of BoSVP1 was found in Korso and 98.1% of other cauliflower and all broccoli accessions, while only in 25.2% of the cabbage accessions (Fig. 4b). BoSVP1 was significantly up-regulated from vegetative to transition stage and kept high expression levels throughout the curd formation (Fig. 4c), indicating its repressor role in flower bud development, as reported in Arabidopsis [53]. A cauliflower curd-specific gene, BoCCE1, was reported to have a potential role in the control of meristem development/arrest [29, 54]. Here, we identified a 1505-bp insertion (SV_b_67089a) covering the entire BoCCE1 gene body in Korso. Genotyping of this insertion revealed that the BoCCE1 gene was present in most cauliflower accessions (97.1%), but absent in most cabbage (86.5%) and broccoli (78.1%) accessions, suggesting a possible role of BoCCE1 in floral arrest, as broccoli buds are arrested at later developmental stages compared to cauliflower buds (Fig. 4b, c).
Curd enlargement and spiral growth
Genes involved in organ size regulation, cell division and expansion, cell cycle, etc. can regulate the curd weight. CYP78A5 (KLU) has been identified in Arabidopsis to prevent proliferation arrest and promote organ growth [55, 56]. The high expression of cauliflower BoCYP78A5 was exclusively detected in curds (Additional file 11), especially at the curd formation and enlargement stages (Fig. 4c). A 2775-bp substitution (SV_b_76292) in the promoter of BoCYP78A5, present in 98.1% of cauliflower accessions while only in 8.2% of cabbage accessions (Fig. 4b), might contribute to the curd-specific expression of BoCYP78A5. BoARL2 (or CDAG1) has been proved to play a role in the promotion of cauliflower curd size [57]. BoARL2 was highly expressed in curd, with the highest expression during the curd enlargement phase of Korso (Fig. 4c). A 269-bp deletion (SV_w_38468) was detected in the promoter of BoARL2, and was present in all cauliflower and most broccoli (96.9%) accessions, while in only 41.7% of cabbage accessions (Fig. 4b).
The spiral arrangement of inflorescences is typical for cauliflower curds [27]. Transcription of the DRNL gene marks lateral organ founder cells in the peripheral zone of the inflorescence meristem [58]. We identified a 258-bp deletion (SV_w_30645) in the promoter of BoDRNL1, which was present in all cauliflower and broccoli accessions while only in 40.6% of cabbage accessions (Fig. 4b). BoDRNL1 was specifically expressed in the curd with the highest expression at curd formation and enlargement stages (Fig. 4c), implying its potential role in determining curd architecture. Selected SVs were also identified in the alpha-tubulin gene BoTUA2 and four BoTUA3 genes (Fig. 4b and Additional file 11), whose homolog in Arabidopsis causes helical growth [59].