- Research article
- Open Access
A genome-wide survey of copy number variations reveals an asymmetric evolution of duplicated genes in rice
BMC Biology volume 18, Article number: 73 (2020)
Copy number variations (CNVs) are an important type of structural variations in the genome that usually affect gene expression levels by gene dosage effect. Understanding CNVs as part of genome evolution may provide insights into the genetic basis of important agricultural traits and contribute to the crop breeding in the future. While available methods to detect CNVs utilizing next-generation sequencing technology have helped shed light on prevalence and effects of CNVs, the complexity of crop genomes poses a major challenge and requires development of additional tools.
Here, we generated genomic and transcriptomic data of 93 rice (Oryza sativa L.) accessions and developed a comprehensive pipeline to call CNVs in this large-scale dataset. We analyzed the correlation between CNVs and gene expression levels and found that approximately 13% of the identified genes showed a significant correlation between their expression levels and copy numbers. Further analysis showed that about 36% of duplicate pairs were involved in pseudogenetic events while only 5% of them showed functional differentiation. Moreover, the offspring copy mainly contributed to the expression levels and seemed more likely to become a pseudogene, whereas the parent copy tended to maintain the function of ancestral gene.
We provide a high-accuracy CNV dataset that will contribute to functional genomics studies and molecular breeding in rice. We also showed that gene dosage effect of CNVs in rice is not exponential or linear. Our work demonstrates that the evolution of duplicated genes is asymmetric in both expression levels and gene fates, shedding a new insight into the evolution of duplicated genes.
Natural variations are the basis of genetic diversity and genome evolution. The detection of natural variations and evaluation of their genetic effects are the keys to understand and interpret the formation of biological phenotypes. Natural variations generally include single nucleotide polymorphisms (SNPs), small InDels (no more than 50 bp), and structural variations (SVs). Copy number variations (CNVs), including deletion and duplication, typically ranged from 1 kb to several Mb , are important source of structural variations [2,3,4]. Many methods have been developed to detect CNV, such as fluorescence in situ hybridization (FISH), quantitative polymerase chain reaction (qPCR), and microarray. However, these methods are not suitable to detect CNVs in natural population, due to the low throughput or the low resolution and sensitivity. With the advantages of next-generation sequencing (NGS) technologies, new approaches and algorithms have been developed to detect novel CNVs in recent years [5, 6]. These methods are mainly based on the individual or combination of the following strategies: read-pair (RP), split read (SR), read depth (RD), de novo assembly (AS) [7,8,9]. The complexity of crop genomes and the structure and distribution of CNVs, make it a challenge to comprehensively and accurately detect CNVs among different germplasms of crop.
The CNVs occurred in the regulatory sequence region will change the gene expression of their flanking regions; nevertheless, the CNVs occurred in the gene region usually show the dosage effect on gene expression, thus affecting the biological phenotype. The dosage effect of CNVs was more obviously observed in human [10,11,12,13], and mice , as genome-wide analysis suggested that 85–95% of detected CNVs were associated with changes in gene expression [10, 14]. However, very few genome-wide analyses of CNVs [15,16,17,18,19,20] and only a few examples of CNVs contributing to phenotypic variation [21,22,23,24,25,26,27] have been reported in crops, but these works were mainly focused on the biological function of a single CNV. Therefore, a large-scale CNV data set with high accuracy will be beneficial to understand the dynamic of genome evolution, provide an insight into the genetic basis of important agricultural traits, and contribute to the crop breeding in the future.
Here, we reported a large-scale analysis of the correlation between CNVs and gene expression levels and revealed CNV’s contribution to genetic diversity of germplasms in rice. We generated genomic and transcriptomic data of 93 accessions of rice and developed a new pipeline, which could comprehensively detect genome-wide CNVs with high accuracy. The correlation analysis between gene copy number and expression level found that approximately 13.1% of genes showed significant correlations. Moreover, the analysis of the expression levels and evolutionary fates of different copies revealed an asymmetric evolution of duplicated genes.
Detection of copy number variations in 93 rice accessions
A total of 93 rice accessions including representative landraces and modern cultivars (Additional file 1: Table S1, Fig. 1a, b) were selected for whole-genome resequencing with average depths about 50× and generated a total of 2.06 Tb of raw reads. Using the Nipponbare RefSeq  (version 7.0) as reference, the coverage of these accessions’ resequencing data ranged from 82.81% to 96.06%. The rice root samples grown in hydroponic culture for 35 days were collected for RNA-Seq. The data volume of each sample was above 5 Gb (ranged from 5.03 to 9.86 Gb) and 576 Gb raw RNA-seq data were generated from the 93 accessions in total. The rate of uniquely mapped reads ranged from 79.64% to 90.95% (Additional file 1: Table S1).
To call CNVs in genomes, we integrated CNVnator  as well as Delly  and developed a new algorithm (named as CtgRef-CNV), which combined read depth and de novo assembly methods (Additional file 2: Figure S1). We firstly assembled the genome of each accession by CtgRef-CNV and then mapped the NGS reads from each accession to its own assembled genome to obtain the depth data. Based on these depth data and alignment results (copy number of each alignment block) between the query genome and the reference genome, we calculated the depth of each fragment of the reference genome, which were used to call CNVs. We then determined the boundary of a CNV using a similar strategy of CNVnator  (see “Methods”). By mapping reads to its own genome, the CtgRef-CNV reduces the potential mapping bias due to the difference between a query genome and the reference genome. In addition, the CtgRef-CNV uses the transformed depth data to call CNVs, which could avoid the disadvantages of the assembly-based method in detecting multi-copy regions with few sequence differences. Then we filtered the CNVs detected by these three methods with strict standards to obtain final CNV data set of each sample. In our filter standards, the read depth, split reads, and discordant read pairs were taken into account. For duplications, the percentage of high-depth areas should be no less than 50%. For deletions, the coverage should be no more than 50%, and the number of the split reads/discordant read pairs within the upstream and downstream 500 bp of the two breakpoints should be no less than 5 (For more details see “Methods”).
The de novo assembled genome sizes of 93 accessions ranged from 317.1 to 406.8 Mb, with contig N50 sizes of 85 accessions larger than 10 kb. The BUSCO results showed the genome completeness of 89 accessions was higher than 90% (Additional file 1: Table S1). To uncover the CNVs landscape among different rice accessions, we called CNVs from these samples using CNVnator, Delly, and our CtgRef-CNV. Compared with the Nipponbare reference genome, the CNV number in the genome of each accession varied from 2000 to 8000 (Fig. 1b), which showed there were significant genomic differences between the rice subspecies. The number of CNVs in Oryza sativa indica (Xian) group  was significantly higher than that in O. sativa japonica (Geng) group  (Fig. 1a, b, and Additional file 2: Figure S3a, P value < 2.2 × 10− 16), suggesting that the number of variations within subspecies was less than that within subspecies in terms of CNV, which is consistent with the SV results in the Zhang et al.  and Zhou et al. . For each 500-kb non-overlap window in the Nipponbare reference genome, more than 7 CNVs on average were detected (Additional file 2: Figure S2), and the distribution of CNVs on either different chromosomes or different chromosome regions was not even (Additional file 2: Figure S2, and Additional file 2: Figure S3c). We detected a total of 32,051 CNVs (Additional file 3: Table S2) from these 93 accessions, of which 120 CNVs were larger than 100 kb. The length showed a typical L-shaped distribution (Additional file 2: Figure S3b).
The validation of CNVs
Several CNVs had been found to be associated with variations of important agronomic traits in rice. For example, we detected duplications harbored GL7 (LOC_Os07g41200)  in seven accessions, and duplication events occurred at the promoter of IPA1 (LOC_Os08g39890)  in two accessions, respectively (Additional file 4: Table S3). All these duplication events were confirmed by PCR experiments (Fig. 1c–f). To further evaluate the validity of our CNV results, 10 random CNV loci were verified in 15 accessions by qPCR (see “Methods”). According to the experimental results, the accuracy of our CNVs was nearly 95% (Additional file 2: Figure S4 and Additional file 4: Table S4). These results indicated that our method was of high accuracy. Moreover, in the phylogenetic tree analysis and principal component analysis (PCA) based on CNVs, the two subspecies are both essentially separated, which are in accordance with the phylogenetic result of SNPs (Additional file 2: Figure S5). In addition, we found that the genes with an extreme high copy number (no less than 10) in more than 5 accessions were common multi-copy genes and most (92.12%) of them showed no less than 10 types of copy number and also most (84.24%) genes did not show obvious population differentiation (VST ≤ 0.2, Additional file 4: Table S5). Except for the genes of unknown function, these multi-copy genes mainly (64.81%, Additional file 4: Table S5) consisted of coding genes of subunits of ribosomal proteins, ATP synthases, cytochromes, and some components of the transcription initiation complex and photosystems. These results confirmed that our approach could comprehensively detect CNVs with high sensitivity.
The power evaluation of CtgRef-CNV pipeline
To validate the accuracy difference among CtgRef-CNV, CNVnator, and Delly, three copy number matrices of the 10 random CNV loci in the 15 accessions (Additional file 4: Table S6) were compared with the qPCR results (Additional file 4: Table S4) separately. According to the copy numbers verified by qPCR, we genotyped these 150 loci into DEL (deletion), DUP (duplication), and CN1 (normal type) to calculate the accuracy of each type of loci in the results of three software (Additional file 4: Table S6, see “Methods”). We found that the DUP accuracy of CNVnator was the highest (93.75%), while Delly had the highest accuracy in DEL detection (96.88%), and the accuracies of DUP (62.50%) and DEL (76.56%) of CtgRef-CNV were both between those of the other two software (Additional file 4: Table S6, and Table S7). In the detection of CN1 loci, there was little difference in the accuracies of the three software (Additional file 4: Table S6, and Table S7). Then, combining with the percentages of the three types of loci in these 15 accessions, we calculated the weighted accuracies of the three software (see “Methods”). And we found that our CtgRef-CNV had a higher weighted accuracy (88.93%) than Delly (86.26%) or CNVnator (84.84%, Additional file 4: Table S7), which indicates that the CtgRef-CNV is recommendable when using NGS data to call CNV.
We also assessed the difference of CNV results identified by CtgRef-CNV, CNVnator, and Delly. To analyze the overlap of high accuracy CNVs between three software, the filtered results of 15 accessions (Additional file 4: Table S8) by our rigid standards were selected as test data. The identification of the same CNV was referred to the method in the Wang et al. study . We found that 10.16~18.97% (totaling 4939) of the CNVs generated by CNVnator, and 22.88~56.33% (totaling 7789) of CNVs generated by Delly, were overlapped with those identified by CtgRef-CNV (Additional file 4: Table S8). And the number of overlapped CNVs between CNVnator and Delly was much higher (22,462; Additional file 4: Table S8), which may be due to that these two algorithms were built on the results of reference-based read mapping. In addition, we also selected two CNV sets without preference from 15 accessions, and screened their accuracies by the IGV software . The first set was the CNVs detected by CtgRef-CNV but not detected by CNVnator, among which 90.67% were also not detected by Delly, and the accuracy was 83.33% (Additional file 4: Table S9). The other set were the CNVs detected by CNVnator but not detected by CtgRef-CNV, among which 68% were also detected by Delly, and its accuracy was 86% (Additional file 4: Table S10). Moreover, the qPCR verification rate of integrative results was nearly 95% (mentioned above; Additional file 2: Figure S4 and Additional file 4: Table S4), which was much higher than that of each software (84.84%, 86.26%, or 88.93%; Additional file 4: Table S7). These results suggested that the integration of multiple algorithms is important for improving the CNV calling results based on NGS data. So, we used the integrative result of the three software as our final CNV set of each rice accession.
The comparisons with published CNVs in rice
The accurate calling of CNVs is very important in genomics area, and many CNV data sets have been published in rice [31, 35,36,37,38]. In total, 641 CNVs were detected between Guangluai-4 and Nipponbare by CGH array . We found that 302 out of their 641 CNVs were also detected by this study (Additional file 4: Table S11). Moreover, our results detected more than 7000 CNVs between Guangluai-4 and Nipponbare, which were not found in previous work . These results indicated our method using NGS data could detect CNVs more comprehensively than the previously used array-based comparative genomic hybridization (CGH) technology. In 2017, a near complete reference genome of indica rice variety Shuhui-498 was assembled and 9909 presence variations (PVs, ≥ 500 bp) in the Nipponbare were identified, compared to the other 17 assembled rice genomes . And 65.34% (6475) of these PVs was overlapped with 49.35% (6833/13,847) of our core DELs (Additional file 4: Table S12). There were 25,380 and 5813 genes identified as “Core” and “Dispensable” genes, respectively, in a pan-genome analysis (Additional file 2: Figure S6) . Using the same criteria, most of the “Core” (23,941) genes were overlapped with the core genes identified in our work (Additional file 2: Figure S6). In addition, we extracted 9632 CNVs no shorter than 1 kb from the genomic variations of 3010 rice accessions  and found that 84.94% (8181) of them were overlapped with our core CNVs (Additional file 4: Table S13). Recently, the SVs of 3000 rice genomes were also analyzed by another group  and 183,943 CNVs (≥ 1 kb) were chosen from their SV results. By the method described in the Wang et al. , totally 52,883 core CNVs were generated. We found that 60.26% (31,865) of those core CNVs were overlapped with our core CNV set (Additional file 4: Table S14). All these results suggested that our CNV set is a valuable supplement to rice genome variation data set.
The impact of CNVs on gene expression
One of the main effects of CNVs is to cause the alternation of gene expression levels [21, 23, 24, 39], by disrupting the gene, and affecting regulatory regions [10, 22, 25, 40, 41]. In this paper, considering the impact of short-read sequencing biases on the boundary identification of CNVs, only the genes, the coverage of which regions were more than 50% by CNVs, were used for correlation analysis. If a gene showed the same copy number in different accessions, the expression levels (TPMs) of the gene in the corresponding accessions would be grouped together. A copy number matrix of 14,435 genes in the 93 accessions was generated and only 2642 genes were selected for the correlation analysis between expression level and copy number by our strict standards (see “Methods”). A significantly positive correlation means that the expression level increases with the increase of copy number, while significantly negative correlation is that the expression level decreases with the increase of copy number (an adjusted P value < 0.05, see “Methods”).
Surprisingly, 82.32% of analyzed genes (Additional file 5: Table S15, and Additional file 6: Table S18) showed no significant correlations between the expression level and copy number and 13.17% of genes were significantly positive correlation (Additional file 5: Table S16 and Additional file 6: Table S18). Moreover, we found that 4.50% of genes showed negative correlation (Additional file 5: Table S17 and Additional file 6: Table S18). All the correlation results were further confirmed by the results of dosage effect analysis (Fig. 2a–c). For the GL7 locus, a significant correlation was detected (Fig. 2d), which is consistent with previous work . Analysis of variance revealed significant differences in expression levels among different copy numbers of approximately 75% of the correlated genes (P value < 0.05, Additional file 6: Table S18).
For positively correlated genes, the rate of increase in the expression level of duplicated genes mainly (account for > 81%) varied from −45% to 185% (Fig. 2e), by making a comparison between the expression level of genes and normal copy number (CN = 1). For each additional gene copy, the rate of the increase (comparing with CN1) in expression level also mainly (account for > 87%) varied from −45% to 185% (Fig. 2e) and this rate decreased as the copy number increased (Additional file 6: Table S19). While on the whole genome, the effect of duplication (not polyploidization) on gene expression was bidirectional, and the growth rate showed an approximately symmetric distribution on both sides of the vertical axis (Additional file 2: Figure S7), which reflected the robustness of gene expression regulation in vivo. In addition, we also found that the effect of tandem duplications on expression was generally stronger than that of dispersed ones, except when the copy number was 5 (Fig. 2f). These results indicated that the gene dosage effect is not exponential or linear, which is consistent with a previous report .
The fates of duplicated genes
The main evolution consequences of duplicated genes are pseudogenization, neofunctionalization (Neo-), subfunctionalization (Sub-), and undifferentiating (Non-) [43,44,45,46,47,48]. The duplicate pairs, both can be assembled from NGS reads, were selected for further analysis. And in total, we identified 8163 gene pairs from assembled contigs of 93 accessions. By using the method of protein domain identification, we found that approximately 5.39% (440/8163) of duplicate pairs experienced the evolutionary divergence of gene function (Neo-/Sub-) and that approximately 36.46% (2976/8163) and 58.15% (4747/8163) of duplicate pairs experienced pseudogenization and undifferentiation, respectively (Additional file 6: Table S20). In our analysis, six stages (including the “Recent,” Fig. 3a) were used to determinate the age of gene duplication events, separated by “Ks (synonymous nucleotide substitution rate) > 0” and four another Ks values corresponding to the four speciation events during the divergence of the Oryza genera, according to previous study . The Ks distribution of the non-pseudogenetic pairs (5338) implied that duplication events occurred throughout the six stages, especially in the recent past (more than 35%, Fig. 3a, Table 1). In each stage, the number of undifferentiating duplicate pairs was absolutely superior (more than 80%, except for 3/5 in the stage II, Table 1). Among the duplicate pairs of the stage I, eight duplicates subfunctionalized, indicating that functional differentiation can also be achieved in a short time. Interestingly, in the most recent two periods (stage I and II), all the functional differentiation pairs were subfunctionalization, but no neofunctionalization was observed. However, in other stages, the proportion of neofunctionalization pairs showed a growing trend with the aging of duplication (Table 1), confirming that subfunctionalization is an intermediate state of neofunctionalization .
To analysis the relationship between selective constraint and functional differentiation of duplicated genes, the Ka (nonsynonymous nucleotide substitution rate), Ks, and Ka/Ks ratio of different groups (Neo-, Sub-, Non-, Gene-Ψ, and Ψ-Ψ) were calculated using the Nei-Gojobori  method. The values of the duplicate pairs involved in pseudogene (Gene-Ψ, and Ψ-Ψ) were extremely higher than those of other duplicate pairs (all the P values were < 2.2 × 10− 16, Fig. 3c–e). 84.18% (2895/3439) of Ka/Ks values of functional duplicate pairs (Neo-, Sub-, and Non-) were less than 1, among which the undifferentiated pairs (Non-) were significantly higher than the differentiated pairs (Neo- and Sub-, P values: 7.292 × 10− 9 and < 2.2 × 10− 16, Fig. 3e), indicating that they were subject to different degrees of selection constraints, strengthened after functional differentiation. Compared with the neofunctionalizated pairs, the subfunctionalizated pairs accumulated significantly fewer nonsynonymous mutations (P value = 6.06 × 10− 4, Fig. 3c) in a slightly longer time (no significance, P value = 0.05929), so selection constraints on the subfunctionalizated pairs were significantly stronger (P value = 8.658 × 10− 4, Fig. 3e), and their sequences were more conservative.
The asymmetric evolution of duplicated genes
Identifying the parent/offspring copies from gene pairs offers the opportunity to characterize the divergence of duplicated genes. A method based on conserved collinearity blocks in population (see “Methods”) was performed, and produced a total of 3129 distinguished duplicate pairs. The duplicated genes with more than two copies (18 groups) were discarded. Our results showed that the proportion of pseudogene copies in the offspring copy (43.30%) was much higher than that in the parent copy (15.56%, Fig. 3i). And correspondingly, in pseudogenes, the proportion of offspring copies (38.22%) was much higher than that of parent copies (13.73%, Fig. 3b). In addition, 77.92% (2424/3111) of the parent copies failed to change their function (Non- in the Fig. 3i). Thus, the parent copies mainly maintained the function of the original genes, especially when the offspring copies became pseudogenes, the percentage of the parent copies kept the original function intact was further increased (89.92%, Additional file 2: Figure S8). So, our results confirmed that the offspring copy is more likely to become pseudogene in rice and duplicated genes are asymmetrical in their evolutionary fates.
The copy-specific variations (CSVs), same to the singly unique nucleotides (SUNs) , were used to assign gene expression level. So the duplicate pairs without sequence divergence were excluded from our copy expression analysis. We totally obtained 548 duplicate pairs, whose copy-specific expression level could be split in the root RNA-Seq data from root tissue (Additional file 6: Table S21). Our results showed that the expression level of the major expressing copy was significantly higher than that of the minor one (Fig. 3f, P value < 1.0 × 10− 7), and most (97.57%) of these major-minor-expressed duplicate pairs could be traced back to the stages before the divergence of Asian rice (> 0.55 mya, Ks > 0.0072, Additional file 2: Figure S9). More interestingly, we found that the expression levels of offspring copies were also significantly higher expressed than that of the parent copies (Fig. 3g, P value = 4.44 × 10− 5). In the other way, the percentage of major copy in the offspring copies (64.84%) was higher than that in parent copies (35.16%, Fig. 3h). Hence, our results supported that the offspring copy is more likely to primarily express after the duplication event and duplicated genes are also asymmetrical in their expression levels. In conclusion, the asymmetric evolution of duplicated genes in rice is reflected in both the evolutionary fates and the expression levels.
Copy number variations reflect the evolution of genes and genomes, leading to local or population-specific adaptations to different environments and enriching population diversity. In recent years, several CNVs have been reported that they have contribution to the formation of the phenotypic diversity of important agronomic traits in rice, such as the duplication of GL7 , deletions in qSW5/GSE5 , duplication of the promoter of IPA1 , and deletion of sh1 . However, all these studies focused on CNVs of individual locus. In order to uncover the relationship between CNVs and domestication on a genome-wide scale, we developed methods that could detect CNVs with high accuracy. Based on NGS data, the available methods for detecting CNVs have different disadvantages [7,8,9]. The read-pair (RP) methods are less effective in the low-complexity regions with repeat and are less accurate in detecting the actual copy number. The read depth (RD) methods fail to identify the precise breakpoints of CNVs. The split read (SR) methods rely on the read length and have low sensitivity in the low-complexity regions of the reference genome. The assembly-based (AS) methods consume a lot of computing resources and are difficult to accurately identify multiple copies with similar sequences. For CNV detection, a common strategy is to combine different methods. Here, we developed a pipeline integrating CNVnator (RD) , Delly (RP and SR) , and CtgRef-CNV (AS and RD), which add the advantages and avoid disadvantages of different methods. The shortcoming of our approach is that the filtering standards are strict and some real CNVs lacking reliable evidence (e.g., split reads, paired-end reads) to support may be discarded. Therefore, considering the high accuracy of results in rice, we believe that our method is powerful for the CNV analysis of other diploid species based on NGS data.
A genome-wide detection of CNVs had been performed previously in rice by CGH technology. They detected 641 CNVs between Guangliai-4 and Nipponbare . Using deep NGS data, the first pan-genome of cultivated and wild rice was constructed, and a total of 10,872 presence-absence variations (PAVs) of genes in 67 accessions were also provided . Then, in the 3000 Rice Genomes Project, 93,683 structural variations (SVs, containing 22,427 CNVs) and 48,098 gene PAVs were called among 453 high-coverage rice accessions . In this study, we provided a high reliability CNVs set of 93 representative rice accessions, among which many genes had a gradient in copy number of the population. Therefore, our results may be beneficial to identify candidate genes regulating important agronomic traits. Our work enriches the understanding of the genetic basis of the formation and domestication of rice important phenotypes and provides insights into breeding of elite rice varieties.
Our large-scale analyses on the dosage effect of CNVs and gene expression in rice revealed that dosage effect was not exponential or linear and that the effect of tandem duplications on expression was generally stronger than that of sporadic ones. However, to our surprise 82.32% of analyzed genes (2175 genes) show no significant correlation between copy number and expression level and 4.5% of them (119 genes) showed negative correlation. There may be several possibilities for these results. First, copy number may not a dominant factor in affecting gene expression . Second, instead of increasing their copy numbers, the alternative splicing of genes could be a more effective approach to adaption to abiotic stress in rice . Third, the promoter or enhancer pairs of some duplicated genes may be differentiated, leading to differentiation of expression patterns [56,57,58], which can also weaken the effect of copy number on expression level. Moreover, the expressions of some genes are regulated by a negative feedback mechanism ; thus, the transcription levels could not always increase significantly with the increase of copy number. In addition, many trans-eQTLs (distance > 1 Mb, or on different chromosomes) were reported to be the predominant source of expression variation and contribute ~ 2-fold more to gene expression variance than local eQTLs [60,61,62]. So the trans-eQTL is also an important factor that influences expression level of genes, and our resequencing and transcriptomic data will provide great help for the identification of trans-eQTLs and local eQTLs in rice.
Gene duplication is an important source of the origin of novel genes. It is widely believed that a novel gene is to become a pseudogene due to the accumulation of inactivating variations, but there are few large-scale analyses to reveal the fate of the two copies produced by gene duplication. Results from the Drosophila showed that neofunctionalization mainly occurred in the offspring copy, while the parent copy tended to retain the original function . Our results support that parent copies tended to retain their original function, but the proportion of neofunctionalization in the parent copy or in the offspring copy showed no significant difference. More importantly, our results suggest that the offspring copies were more likely to be pseudogenized, thus exhibiting a different evolutionary fate from the parent copies. In addition, our work illustrates whether the evolution of duplicated genes is symmetrical or asymmetrical. Early works indicated that duplicated genes were found to evolve symmetrically, based on the comparison on the evolution rate in 39 organisms . Latter, a wide range of cross-species or genome-wide evidence supported that the evolution of duplicated genes were asymmetrical. Their conclusions were based on the difference between two copies of duplicated genes on selection pressure , rate of evolution [65,66,67,68,69,70,71], and expression patterns [58, 65, 67, 68, 72,73,74,75]. Here, we detected the difference on the expression level and evolutionary fate between parent and offspring copies and proposed an asymmetric evolutionary model for the fate and expression of duplicated genes (Fig. 4). It is the offspring copy that tends to be more highly expressed and more likely to become pseudogene, which also reflects the difference in selection constraints between parent and offspring copies. The functional differentiation of duplicated genes provides an opportunity for the formation of new traits. Thus, to connect the evolved new functions of duplicated genes with certain traits in given landraces and cultivars would shed new light on the molecular design in crop breeding.
Many important traits were reported to be affected by CNV. Herein, we developed a comprehensive pipeline and strict standards to call CNVs and provided a high accuracy CNV set for rice functional genomics research and molecular breeding application. Combined with transcriptome results, approximately 13% of the genes showed a significant correlation between their expression levels and copy numbers and gene dosage effect in rice was not exponential or linear. Based on the analyses on the evolutionary fates and expression levels of duplicated genes, we confirmed that maintenance of ancestral gene function and pseudogenization were the two main evolutionary fates and provided a new perspective for asymmetric evolution: the offspring copy mainly contributed to the expression levels and seemed more likely to become a pseudogene, whereas the parent copy tended to maintain the function of ancestral gene.
Plant growth, DNA, and RNA sequencing
The seeds of 93 rice accessions were grown according to Sun et al. , with some modifications: After germinating for 2 days at 28 °C, the seeds with consistent shoot were selected to place into bottom-cut PCR plate and grown for 7 days with nutrient solution varying from one quarter to half strength. The 7-day-old seedlings of uniform size and vigor were transplanted into holes in a cystosepiment placing over the top of the pots. The plants were grown for 3 weeks with half strength nutrient solution which was replaced with fresh solution every 3 days. The full chemical composition of the International Rice Research Institute (IRRI) nutrient solution contained 1.25 mM NH4NO3, 0.3 mM KH2PO4, 0.35 mM K2SO4, 1.0 mM CaCl2, 1.0 mM MgSO4·7H2O, 0.5 mM Na2SiO3, 20.0 μM Fe-EDTA, 9.0 μM MnCl2, 0.39 μM (NH4)6Mo7O24, 20.0 μM H3BO3, 0.77 μM ZnSO4, and 0.32 μM CuSO4 (pH 5.4–5.6). Plants were grown in a green house, with 70% relative humidity under fluorescence while light (150–200 μM m-2S-1) at 16-h light/8-h dark temperatures of 30/28 °C photoperiod.
For 93 rice accession genome resequencing, high-purity genomic DNA was extracted using QIAamp DNA Mini Kit (Qiagen). For genome resequencing, the DNA qualification, library construction, and resequencing was executed at Novogene (Beijing). High-throughput DNA Sequencing Pair-end sequencing was performed on Illumina® NovaSeq platform, with the read length of 150 bp at each end and average of 20 Gb sequencing data for each library.
The total RNA was extracted using TRIzol® Reagent (Invitrogen, Lot: 180702). All roots of the five seedlings of each accession were sampled for total RNA extraction. For rice transcriptome sequencing, the RNA qualification, library preparation, and RNA-Seq was executed at Berry Genomics Corporation (Beijing). After clustering of the index-coded samples performing on a cBot Cluster Generation System through Illumina Cluster Kit according to the manufacturer’s instruction, the library preparations were sequenced on an Illumina® Hiseq2500 platform and 125-bp paired-end reads were generated. The transcriptomic sequencing data of each accession was an average of 5 Gb.
RNA-Seq data analysis
The Trimmomatic  package (version 0.32) was used to obtain clean reads. The adapters, Ns, and low-quality bases were removed, and the trimmed reads with a length less than 36 bp were also dropped. The Nipponbare RefSeq  (version 7.0) was used as the reference. All the clean reads were mapped to the reference using Tophat2  (version 2.1.1) with default parameters. Next, reads with more than one reported alignments were excluded and TPM (transcripts per million) values were calculated with TPMCalculator  (version 0.0.3). Genes with TPM ≥ 0.5 were regarded as expressed genes.
Resequencing reads filtering, mapping, and SNP calling and filtering
The Nipponbare RefSeq and its annotation were downloaded from MSU Rice Genome Annotation Project . Reads containing adaptor sequences and low-quality reads were removed using cutadapt  (version 1.5) and SolexaQA  (version 3.1.3) according to the following criteria: (i) the Ns percent of one end > 5%, (ii) average quality score Q < 20, (iii) length < 75 bp after trimming. The filtered reads were mapped to the Nipponbare RefSeq (version 7.0) using BWA-MEM  (version 0.7.15-r1140). The mapped reads were sorted, and PCR duplicates were removed by SAMtools  (version 1.5). The mapped reads were also removed with a mapping quality < 30. Variants were called by SAMtools and BCFtools  (version 1.6). The raw SNPs were filtered using the following criteria: (i) the QUAL ≥ 100; (ii) raw read depth varies from 1000 to 50,000; (iii) MAF ≥ 5%; and (iv) biallelic sites.
Copy number variations calling
Copy number variations were identified using our comprehensive pipeline (as previously mentioned), which consists of CNVnator  (version 0.3.4), Delly  (version 0.7.3; just for deletion calling), and CtgRef-CNV. Only CNVs with length no less than 1000 bp were selected to perform subsequent analyses.
For CNVnator, each accession was tested by several bin sizes from 100 to 1000 bp to make sure the ratio of average of RD (read depth) signal to standard deviation is between 4 and 5. For Delly package, only its DEL results were used in this project.
For the CtgRef-CNV pipeline, data was processed as follows:
De novo assembly of each genome. The contigs were assembled using SOAPdenovo2  package (version 2.04). The gaps in the draft assembly results were filled by the GapCloser  (version 1.12). The N50 length of the assemblies was evaluated after the small contigs of < 200 bp were excluded. The completeness of assembled genomes was evaluated by the BUSCO , using the “liliopsida_odb10” dataset as the reference.
Read mapping to its contigs. The clean reads were mapped to the contigs ≥ 1000 bp using the same package as mentioned above. The depth data was calculated through a non-overlapping sliding window method (the window size was 250 bp in our study) and corrected using the same method in the previous study .
Contig-reference alignment blocks. The contigs in (ii) were aligned with the Nipponbare RefSeq (version 7.0) by means of the nucmer program in the package MUMmer (version 3.23)  with the parameter “--maxmatch -c 90 -l 40.” The short alignment contig fragment contained in another one was filtered when its length rate (short/long) was less than 0.8. For two overlapping alignment contig fragments, (1) both were retained if the overlapping rate (overlap/longer) was no less than 0.7; otherwise, (2) the shorter alignment was removed if the length rate (short/long) was < 0.2.
Depth data of the reference. Based on the contig-reference alignment blocks, we transformed the contig depth into the reference depth result. If one contig was aligned to multiple regions in the reference, the mapped read depth of it would be subdivided equally to each aligned regions in the reference. On the other hand, for one region in the reference, its depth was accumulated by the subdivided depth from mapped contigs. Then the average depth of non-overlapping bins and chromosomes were calculated.
CNV calling. The bins with CN_index (Average_depthbin /Average_depthChr) < 0.5 or ≥ 1.5 were selected for the candidates of deletion and duplication, respectively. The adjacent bins were linked to intervals if the CNV types were the same. Then the intervals were extended through the same approach as described in CNVnator algorithm. These extended intervals were candidated CNVs to be filtered.
Filtering and integrating CNVs
The raw CNV results were filtered depending on calling method. For CNVnator CNV calls, regions with a copy number estimate between 0.8 and 1.4 were removed firstly . A credible deletion should meet the following three conditions: (i) at least 5 discordant read pairs (with an insert size close to the size of the interval) or split reads supported within 500 bp upstream and downstream of the breakpoint; (ii) the coverage and (iii) the dp10_cvrg (coverage of sites with a depth ≥ 10) to be no more than 50%. For duplications, they should ensure as follows: (i) the coverage and (ii) the dp10_cvrg to be no less than 80%, and (iii) the dup_cvrg (coverage of sites with a CN_index ≥ 1.4) to be no less than 50%. For Delly (DEL) results, the copy number estimate should be less than 0.5, and other restrictions were the same as above. When filtering the results of CtgRef-CNV, the restrictions were a little stricter. The CNV calls with a copy number estimate between 0.4 and 1.7 were removed. In addition to the above conditions, the algn_cvrg (coverage of the alignment blocks) of a deletion should be < 50%, and ≥ 90% for a duplication. The filtered CNV calls were firstly merged by accession and integrated by the reference to the standard of same SVs reported by Wang et al. .
The gene annotation of CNVs
The transposon genes are firstly filtered from the gff3 file in the MSU Rice Genome Annotation Project (release 7) . And a candidate CNV gene was defined if no less than 50% of its gene body was covered by a CNV. In order to improve the accuracy of gene annotation, the gene with copy number of 0 was recalled to normal type (CN = 1) if the read coverage was ≥ 50% and the CN_index was ≥ 0.5. All the copy number matrices of accessions were merged into the population matrix.
The qPCR verification of CNV results
The qPCR analysis was used to identify the relative copy number. Ten loci were randomly selected at 10 chromosomes among 15 accessions (Additional file 4: Table S5). The primers (Additional file 6: Table S22) were designed using Primer3  (version 0.4.0) website. The OsACTIN2 (LOC_Os05g36280) was used as inner reference gene. For each sample, the same amount of genomic DNA was used as template, according to the protocol of the ChamQ™ SYBR® Color qPCR Master Mix (Q441-02, Vazyme Biotech Co., Ltd., Nanjing, China), using the ABI StepOnePlus Real-Time PCR System, with three technical replicates. Amplification reactions were initiated with a denaturing step (95 °C for 10 min), followed by 40 cycles of denaturing (95 °C for 15 s), annealing, and extension (60 °C for 35 s). Data were analyzed by 2−(∆∆Ct) method  to obtain relative copy number.
The verification of the reported tandem duplication of GL7 and the promoter of IPA1
The PCR analysis was performed using specific primers of duplication detection of GL7  and the promoter of IPA1  (Additional file 6: Table S22) on a T100™ Thermal Cycler (Bio-Rad) according to the manufacturer’s instructions. Briefly, in a 30 μL reaction system including 2 μL diluted DNA, 1.5 μL primers (10 μM/L), 4 μL dNTP (2.5 mM/L), 3 μL 10 × PCR buffer (Mg2+ plus), and 0.3 μL rTaq (5 U/μL, TaKaRa, R001B), amplification reactions were initiated with a denaturing step (98 °C for 2 min), followed by 30 cycles of denaturing (98 °C for 10 s), annealing (55 °C for 30 s), and extension (72 °C for 1 min). After PCR amplification, the products were detected by 1% agarose gel.
The phylogenetic analysis and PCA
The filtered SNPs were used to calculate nucleotide diversity (π) and F-statistics (FST), then the SNPs in the sliding windows with a FST ≤ 0.4 and a π no less than 0.1 × π_aver (the genome-wide average π) were used for phylogenetic analysis. The matrix of CNVs was used for genotyping by splitting duplications and deletions. If a duplication occurred in an accession, we assigned it to 1; otherwise, it was assigned to 0. It was the same for deletions. All the genotype values of CNVs were merged to construct the Neighbor-Joining tree by the APE package  (version 5.2) in R. The topological robustness was assessed by bootstrap analysis with 1000 replicates. For the PCA, CNVs with a copy number type > 3 or MAF < 0.03 were firstly removed. The copy number matrix was transformed into the plink format using our custom Perl script. Then the PCA was performed using the PLINK  (version 1.9).
The calculation of weighted accuracy
In this study, we used the weighted accuracies to estimate the comprehensive accuracies of CNVnator, Delly, and CtgRef-CNV. And it was determined as follows: Rw= ∑Pt ∗ Rt, where Pt and Rt are the percentages and accuracies of the DUP (CN ≥ 2), DEL (CN = 0), and CN1 types, respectively. The percentage of each type of loci was computed based on the copy number matrix of the 15 accessions. And the accuracy depended on the called and qPCR results. For the DEL, and CN1 loci, only if the CNcall (copy number called by software) of a locus in an accession was equal to the CNqPCR (copy number verified by qPCR), we treated it correct. And for the DUP locus, if the CNcall was 0, or 1, we said it was completely wrong; however, if the CNcall was no less than 2, we would treat it partially (not equal to the CNqPCR) or completely (equal to the CNqPCR) correct and the number of correct DUPs would be accumulated by CNMin/CNMaj, where the CNMin and CNMaj were minor and major values of the array: [CNcall, CNqPCR], respectively. In particular, if the CNcall of a DUP locus was the same as its CNqPCR, the number of correct DUPs would also be accumulated by 1.
The correlation analysis
The CNVs were annotated genes sample by sample, and then all the gene copy number matrix were integrated to the total matrix of population. Only the genes impacted by CNV (minimum 50% gene model overlap) and with a change fold (max/min) no less than 1.1 were selected to carry out the correlation analysis between copy number and gene expression level. The TPM outliers (out of the range of μ ± 3σ) and its copy number were filtered. Then the copy number less than three replicates was removed. If the remaining copy number was of only two types, and the smaller copy number was equal to 0, these genes were also discarded before the correlation analysis. To minimize the impact of population structure, the copy numbers of genes filtered by the above standards were used to calculate the VST values  and only the genes with VST values no more than 0.4 were performed for the next analyses. A significance level of 0.05 was set for the corrected t-test P values (Benjamini–Hochberg method ).
The copy-specific expression
The duplicate pairs, of which the copy number is in agreement with the number of duplicates in the assembly contigs, were selected. The pseudogenes and the coding sequences (CDSs) were predicted using GeneWise  (version 2.4.1). Then the paired CDSs were aligned by ClustalW2  (version 2.0.12) to identify copy-specific variations (CSVs), similar to the method described . The CSVs were used to count copy-specific reads in the RNA-Seq data to calculate the expression level of each copy. Within each pair, we classified the copy with higher expression level as “major” copy and its partner as the “minor” copy. The expression difference between “major” copy and “minor” copy was defined as the expression difference of the corresponding duplicate pair. If the expression level of the major copy was more than two times as that of minor one, we said that there was a phenomenon of copy dominance expression in this pair.
The Ka, Ks, and the divergence time calculations
The CDSs of the non-pseudogenic duplicate pairs used in copy-specific expression analysis were translated into proteins. The number of nonsynonymous substitutions per nonsynonymous site (Ka) and the number of synonymous substitutions per synonymous site (Ks) were calculated by the yn00 program in the PAML4 packages  using the Nei-Gojobori  method. The alignment involved in pseudogene was performed by MASCE  (version 2.03) software. The time since divergence (T) of duplicates was calculated as T = Ks/2μ, where μ corresponds to the absolute substitution per synonymous site per year, and here, we use substitution rate of the grass Adh sequences (6.5 × 10− 9) .
The prediction of the parent and offspring copies
Based on the alignment position of each duplicate pair in the Nipponbare genome, four adjacent genes (two upstream and two downstream) were aligned with the assembled genomes of accessions with the normal copy number. Given the short length of assembled contigs using the NGS reads, no less than two adjacent genes were assembled into the same contig with the target gene which was the effective evidence that it was the same colinearity near the target gene between the accession and the Nipponbare. If there were more than 10 accessions (or most of the accessions) supporting the gene position in the Nipponbare, the colinearity was considered to be conserved. Then the same criterion was enforced in all the alignment results. The copy in the colinearity block was identified as the parent copy and the others were the offspring copies. However, for tandem duplicate pairs, it might be impossible to distinguish which one was the parental copy and which was the offspring.
The identification of neofunctionalization and subfunctionalization
All the protein sequences of the non-pseudogenic duplicate pairs and the corresponding references were used for the domain analysis through the InterProScan  (version 5.21-60.0), with an E-value no more than 1.0 × 10− 5. A copy contained fewer domains than its reference counterpart which meant there was a subfunctionalization event, whereas a copy containing more or new domains indicated a neofunctionalization event. If the duplicate pair had the same domains, it was not possible to determine whether there was neofunctionalization or subfunctionalization event without the evidence of expression patterns.
Availability of data and materials
The raw resequencing data of the 93 accessions have been deposited to the SRA at the NCBI under the BioProject IDs PRJNA522896  and PRJNA535372. And for the raw transcriptome data, the BioProject ID at the NCBI was PRJNA539946. All the SRA accessions are listed in the Additional file 1: Table S1. The scripts used in the CtgRef-CNV pipeline are available under the link: https://github.com/flzh628/CtgRef-CNV.
Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437–55.
Iafrate AJ, Feuk L, Rivera MN, Listewnik ML, Donahoe PK, Qi Y, Scherer SW, Lee C. Detection of large-scale variation in the human genome. Nat Genet. 2004;36(9):949–51.
Sebat J, Lakshmi B, Troge J, Alexander J, Young J, Lundin P, Maner S, Massa H, Walker M, Chi M, et al. Large-scale copy number polymorphism in the human genome. Science. 2004;305(5683):525–8.
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, et al. Global variation in copy number in the human genome. Nature. 2006;444(7118):444–54.
Meyerson M, Gabriel S, Getz G. Advances in understanding cancer genomes through second-generation sequencing. Nat Rev Genet. 2010;11(10):685–96.
Alkan C, Coe BP, Eichler EE. Genome structural variation discovery and genotyping. Nat Rev Genet. 2011;12(5):363–76.
Zhao M, Wang Q, Wang Q, Jia P, Zhao Z. Computational tools for copy number variation (CNV) detection using next-generation sequencing data: features and perspectives. BMC Bioinformatics. 2013;14(Suppl 11):S1.
Lye ZN, Purugganan MD. Copy number variation in domestication. Trends Plant Sci. 2019;24(4):352–65.
Pirooznia M, Goes FS, Zandi PP. Whole-genome CNV analysis: advances in computational approaches. Front Genet. 2015;6:138.
Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, de Grassi A, Lee C, et al. Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007;315(5813):848–53.
Alaei-Mahabadi B, Bhadury J, Karlsson JW, Nilsson JA, Larsson E. Global analysis of somatic structural genomic alterations and their impact on gene expression in diverse human cancers. Proc Natl Acad Sci U S A. 2016;113(48):13768–73.
Chiang C, Scott AJ, Davis JR, Tsang EK, Li X, Kim Y, Hadzic T, Damani FN, Ganel L, Montgomery SB, et al. The impact of structural variation on human gene expression. Nat Genet. 2017;49(5):692–9.
Ma X, Liu Y, Liu Y, Alexandrov LB, Edmonson MN, Gawad C, Zhou X, Li Y, Rusch MC, Easton J, et al. Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature. 2018;555(7696):371–6.
Henrichsen CN, Vinckenbosch N, Zollner S, Chaignat E, Pradervand S, Schutz F, Ruedi M, Kaessmann H, Reymond A. Segmental copy number variation shapes tissue transcriptomes. Nat Genet. 2009;41(4):424–9.
Saintenac C, Jiang D, Akhunov ED. Targeted analysis of nucleotide and copy number variation by exon capture in allotetraploid wheat genome. Genome Biol. 2011;12(9):R88.
Mace ES, Tai S, Gilding EK, Li Y, Prentis PJ, Bian L, Campbell BC, Hu W, Innes DJ, Han X, et al. Whole-genome sequencing reveals untapped genetic potential in Africa's indigenous cereal crop sorghum. Nat Commun. 2013;4:2320.
McHale LK, Haun WJ, Xu WW, Bhaskar PB, Anderson JE, Hyten DL, Gerhardt DJ, Jeddeloh JA, Stupar RM. Structural variants in the soybean genome localize to clusters of biotic stress-response genes. Plant Physiol. 2012;159(4):1295–308.
Li YH, Zhou G, Ma J, Jiang W, Jin LG, Zhang Z, Guo Y, Zhang J, Sui Y, Zheng L, et al. De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nat Biotechnol. 2014;32(10):1045–52.
Maldonado dos Santos JV, Valliyodan B, Joshi T, Khan SM, Liu Y, Wang J, Vuong TD, de Oliveira MF, Marcelino-Guimaraes FC, Xu D, et al. Evaluation of genetic variation among Brazilian soybean cultivars through genome resequencing. BMC Genomics. 2016;17:110.
Li B, Kremling KAG, Wu P, Bukowski R, Romay MC, Xie E, Buckler ES, Chen M. Coregulation of ribosomal RNA with hundreds of genes contributes to phenotypic variation. Genome Res. 2018;28(10):1555–65.
Cook DE, Lee TG, Guo X, Melito S, Wang K, Bayless AM, Wang J, Hughes TJ, Willis DK, Clemente TE, et al. Copy number variation of multiple genes at Rhg1 mediates nematode resistance in soybean. Science. 2012;338(6111):1206–9.
Zhang ZH, Mao LY, Chen HM, Bu FJ, Li GC, Sun JJ, Li S, Sun HH, Jiao C, Blakely R, et al. Genome-wide mapping of structural variations reveals a copy number variant that determines reproductive morphology in cucumber. Plant Cell. 2015;27(6):1595–604.
Wang Y, Xiong G, Hu J, Jiang L, Yu H, Xu J, Fang Y, Zeng L, Xu E, Xu J, et al. Copy number variation at the GL7 locus contributes to grain size diversity in rice. Nat Genet. 2015;47(8):944–8.
Goettel W, Ramirez M, Upchurch RG, An YQ. Identification and characterization of large DNA deletions affecting oil quality traits in soybean seeds through transcriptome sequencing analysis. Theor Appl Genet. 2016;129(8):1577–93.
Zhang L, Yu H, Ma B, Liu GF, Wang JJ, Wang JM, Gao RC, Li JJ, Liu JY, Xu J, et al. A natural tandem array alleviates epigenetic repression of IPA1 and leads to superior yielding rice. Nat Commun. 2017;8:14789.
Shen RX, Wang L, Liu XP, Wu J, Jin WW, Zhao XC, Xie XR, Zhu QL, Tang HW, Li Q, et al. Genomic structural variation-mediated allelic suppression causes hybrid male sterility in rice. Nat Commun. 2017;8:1310.
Soyk S, Lemmon ZH, Sedlazeck FJ, Jimenez-Gomez JM, Alonge M, Hutton SF, Van Eck J, Schatz MC, Lippman ZB. Duplication of a domestication locus neutralized a cryptic variant that caused a breeding barrier in tomato. Nat Plants. 2019;5(5):471–9.
Kawahara Y, de la Bastide M, Hamilton JP, Kanamori H, McCombie WR, Ouyang S, Schwartz DC, Tanaka T, Wu J, Zhou S, et al. Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice. 2013;6(1):4.
Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21(6):974–84.
Rausch T, Zichner T, Schlattl A, Stutz AM, Benes V, Korbel JO. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012;28(18):I333–I9.
Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F, et al. Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature. 2018;557(7703):43–9.
Zhang J, Chen LL, Xing F, Kudrna DA, Yao W, Copetti D, Mu T, Li W, Song JM, Xie W, et al. Extensive sequence divergence between the reference genomes of two elite indica rice varieties Zhenshan 97 and Minghui 63. Proc Natl Acad Sci U S A. 2016;113(35):E5163–E71.
Zhou Y, Chebotarov D, Kudrna D, Llaca V, Lee S, Rajasekar S, Mohammed N, Al-Bader N, Sobel-Sorenson C, Parakkal P, et al. A platinum standard pan-genome resource that represents the population structure of Asian rice. Sci Data. 2020; 7(1):113.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92.
Yu P, Wang C, Xu Q, Feng Y, Yuan X, Yu H, Wang Y, Tang S, Wei X. Detection of copy number variations in rice using array-based comparative genomic hybridization. BMC Genomics. 2011;12:372.
Du H, Yu Y, Ma Y, Gao Q, Cao Y, Chen Z, Ma B, Qi M, Li Y, Zhao X, et al. Sequencing and de novo assembly of a near complete indica rice genome. Nat Commun. 2017;8(1):15324.
Zhao Q, Feng Q, Lu H, Li Y, Wang A, Tian Q, Zhan Q, Lu Y, Zhang L, Huang T, et al. Pan-genome analysis highlights the extent of genomic variation in cultivated and wild rice. Nat Genet. 2018;50(2):278–84.
Fuentes RR, Chebotarov D, Duitama J, Smith S, De la Hoz JF, Mohiyuddin M, Wing RA, McNally KL, Tatarinova T, Grigoriev A, et al. Structural variants in 3000 rice genomes. Genome Res. 2019;29(5):870–80.
Somerville MJ, Mervis CB, Young EJ, J. SE, del Campo M, Bamforth S, Peregrine E, Loo W, Lilley M, Perez-Jurado LA, et al. Severe expressive-language delay related to duplication of the Williams-Beuren locus. N Engl J Med 2005; 353(16):1694–1701.
Lee JA, Madrid RE, Sperle K, Ritterson CM, Hobson GM, Garbern J, Lupski JR, Inoue K. Spastic paraplegia type 2 associated with axonal neuropathy and apparent PLP1 position effect. Ann Neurol. 2006;59(2):398–403.
Kleinjan DA, van Heyningen V. Long-range control of gene expression: emerging mechanisms and disruption in disease. Am J Hum Genet. 2005;76(1):8–32.
Veitia RA, S. B, Birchler JA. Gene dosage effects: nonlinearities, genetic interactions, and dosage compensation. Trends Genet 2013; 29(7):385–393.
Ohno S. Evolution by gene duplication. Berlin: Springer-Verlag; 1970.
Hughes AL. The evolution of functionally novel proteins after gene duplication. Proc Biol Sci. 1994;256(1346):119–24.
Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J. Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999;151(4):1531–45.
He X, Zhang J. Rapid subfunctionalization accompanied by prolonged and substantial neofunctionalization in duplicate gene evolution. Genetics. 2005;169(2):1157–64.
Assis R, Bachtrog D. Neofunctionalization of young duplicate genes in Drosophila. Proc Natl Acad Sci U S A. 2013;110(43):17409–14.
Wang J, Tao F, Marowsky NC, Fan C. Evolutionary fates and dynamic functionalization of young duplicate genes in Arabidopsis genomes. Plant Physiol. 2016;172(1):427–40.
Stein JC, Yu Y, Copetti D, Zwickl DJ, Zhang L, Zhang C, Chougule K, Gao D, Iwata A, Goicoechea JL, et al. Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat Genet. 2018;50(2):285–96.
Rastogi S, Liberles DA. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol. 2005;5:28.
Nei M, Gojobori T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 1986;3(5):418–26.
Sudmant PH, KJ O, Antonacci F, Alkan C, Malig M, Tsalenko A, Sampas N, Bruhn L, Shendure J, Project G, et al. Diversity of human copy number variation and multicopy genes. Science. 2010;330(6004):641–6.
Duan P, Xu J, Zeng D, Zhang B, Geng M, Zhang G, Huang K, Huang L, Xu R, Ge S, et al. Natural variation in the promoter of GSE5 contributes to grain size diversity in rice. Mol Plant. 2017;10(5):685–94.
Choi JY, Zaidem M, Gutaker R, Dorph K, Singh RK, Purugganan MD. The complex geography of domestication of the African rice Oryza glaberrima. PLoS Genet. 2019;15(3):e1007414.
EZ, Wang L, Zhou J. Splicing and alternative splicing in rice and humans. BMB Rep. 2013;46(9):439–47.
Chen KN, Zhang YB, Tang T, Shi SH. Cis-regulatory change and expression divergence between duplicate genes formed by genome duplication of Arabidopsis thaliana. Chin Sci Bull. 2010;55(22):2359–65.
Zou C, Lehti-Shiu MD, Thomashow M, Shiu SH. Evolution of stress-regulated gene expression in duplicate genes of Arabidopsis thaliana. PLoS Genet. 2009;5(7):e1000581.
Zou Y, Huang W, Gu Z, Gu X. Predominant gain of promoter TATA box after gene duplication associated with stress responses. Mol Biol Evol. 2011;28(10):2893–904.
Eckardt NA. Positive and negative feedback coordinate regulation of disease resistance gene expression. Plant Cell. 2007;19(9):2700–2.
Albert FW, Bloom JS, Siegel J, Day L, Kruglyak L. Genetics of trans-regulatory variation in gene expression. eLife. 2018;7:e35471.
Grundberg E, Small KS, Hedman ÅK, Nica AC, Buil A, Keildson S, Bell JT, Yang T-P, Meduri E, Barrett A, et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet. 2012;44(10):1084–9.
Wright FA, Sullivan PF, Brooks AI, Zou F, Sun W, Xia K, Madar V, Jansen R, Chung W, Zhou Y-H, et al. Heritability and genomics of gene expression in peripheral blood. Nat Genet. 2014;46(5):430–7.
Kondrashov FA, Rogozin IB, Wolf YI, Koonin EV. Selection in the evolution of gene duplications. Genome Biol. 2002;3(2):RESEARCH0008.
Han MV, Demuth JP, McGrath CL, Casola C, Hahn MW. Adaptive evolution of young gene duplicates in mammals. Genome Res. 2009;19(5):859–67.
Conant GC, Wagner A. Asymmetric sequence divergence of duplicate genes. Genome Res. 2003;13(9):2052–8.
Zhang P, Gu Z, Li WH. Different evolutionary patterns between young duplicate genes in the human genome. Genome Biol. 2003;4(9):R56.
Cusack BP, Wolfe KH. Not born equal: increased rate asymmetry in relocated and retrotransposed rodent gene duplicates. Mol Biol Evol. 2006;24(3):679–86.
Liu SL, Baute GJ, Adams KL. Organ and cell type-specific complementary expression patterns and regulatory neofunctionalization between duplicated genes in Arabidopsis thaliana. Genome Biol Evol. 2011;3:1419–36.
Zhou Z, Zhou J, Su Z, Gu X. Asymmetric evolution of human transcription factor regulatory networks. Mol Biol Evol. 2014;31(8):2149–55.
Miao Z, Zhang T, Qi Y, Song J, Han Z, Ma C. Evolution of the RNA N6-Methyladenosine Methylome mediated by genomic duplication. Plant Physiol. 2020;182(1):345–60.
Johnson DA, Thomas MA. The monosaccharide transporter gene family in Arabidopsis and rice: a history of duplications, adaptive evolution, and functional divergence. Mol Biol Evol. 2007;24(11):2412–23.
Huminiecki L, Wolfe KH. Divergence of spatial gene expression profiles following species-specific gene duplications in human and mouse. Genome Res. 2004;14(10a):1870–9.
Lan X, Pritchard JK. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science. 2016;352(6288):1009–13.
Tirosh I, Barkai N. Comparative analysis indicates regulatory neofunctionalization of yeast duplicates. Genome Biol. 2007;8(4):R50.
Guschanski K, Warnefors M, Kaessmann H. The evolution of duplicate gene expression in mammalian organs. Genome Res. 2017;27(9):1461–74.
Sun H, Tao J, Liu S, Huang S, Chen S, Xie X, Yoneyama K, Zhang Y, Xu G. Strigolactones are involved in phosphate- and nitrate-deficiency-induced root development and auxin transport in rice. J Exp Bot. 2014;65(22):6735–46.
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29(7):644–52.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013;14(4):R36.
Vera Alvarez R, Pongor LS, Marino-Ramirez L, Landsman D. TPMCalculator: one-step software to quantify mRNA abundance of genomic features. Bioinformatics. 2019;35(11):1960–2.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17(1):10–2.
Cox MP, Peterson DA, Biggs PJ. SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics. 2010;11:485.
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25(14):1754–60.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. Genome project data processing S. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience. 2012;1(1):18.
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2.
Yoon S, Xuan Z, Makarov V, Ye K, Sebat J. Sensitive and accurate detection of copy number variants using read depth of coverage. Genome Res. 2009;19(9):1586–92.
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. Versatile and open software for comparing large genomes. Genome Biol. 2004;5(2):R12.
Hardigan MA, Crisovan E, Hamilton JP, Kim J, Laimbeer P, Leisner CP, Manrique-Carpintero NC, Newton L, Pham GM, Vaillancourt B, et al. Genome reduction uncovers a large dispensable genome and adaptive role for copy number variation in asexually propagated Solanum tuberosum. Plant Cell. 2016;28(2):388–405.
Untergasser A, Cutcutache I, Koressaar T, Ye J, Faircloth BC, Remm M, Rozen SG. Primer3-new capabilities and interfaces. Nucleic Acids Res. 2012;40(15):e115.
Livak KJ, Schmittgen TD. Analysis of relative gene expression data using real-time quantitative PCR and the 2−ΔΔCT method. Methods. 2001;25(4):402–8.
Paradis E, Claude J, Strimmer K. APE: analyses of phylogenetics and evolution in R language. Bioinformatics. 2004;20(2):289–90.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet. 2007;81(3):559–75.
Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical and powerful approach to multiple testing. J R Stat Soc B. 1995;57:289–300.
Birney E, Clamp M, Durbin R. GeneWise and Genomewise. Genome Res. 2004;14(5):988–95.
Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, et al. Clustal W and Clustal X version 2.0. Bioinformatics. 2007;23(21):2947–8.
Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24(8):1586–91.
Ranwez V, Douzery EJP, Cambon C, Chantret N, Delsuc F. MACSE v2: toolkit for the alignment of coding sequences accounting for frameshifts and stop codons. Mol Biol Evol. 2018;35(10):2582–4.
Gaut BS, Morton BR, McCaig BC, Clegg MT. Substitution rate comparisons between grasses and palms: synonymous rate differences at the nuclear gene Adh parallel rate differences at the plastid gene rbcL. Proc Natl Acad Sci U S A. 1996;93(19):10274–9.
Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, McWilliam H, Maslen J, Mitchell A, Nuka G, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014;30(9):1236–40.
Wang Y, Shang L, Yu H, Zeng L, Hu J, Ni S, Rao Y, Li S, Chu J, Meng X, et al. A strigolactone biosynthesis gene contributed to the green revolution in rice. Mol Plant. 2020;13:1–10. https://doi.org/10.1016/j.molp.2020.03.009.
We thank the National Medium Rice Genebank at China National Rice Research Institute, Chinese Academy of Agricultural Sciences for providing seeds of rice accessions, and the High-Performance Computing Center (Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences) for computing and bioinformatics support. We also thank Jiangsu Collaborative Innovation Center for Modern Crop Production for their support.
This work was supported by Grant from National Key Research and Development Program of China (2016YFD0101801) and Grants from the Science Technology and Innovation Committee of Shenzhen Municipality (JSGG20160429104101251, JCYJ20170303154319837, JCYJ20170412155447658). G.X. and J.R. are supported by the Agricultural Science and Technology Innovation Program of CAAS.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file 4: Table S3. The tandem duplications identified around GL7 loci and the promotor of IPA1. a equal to Average_depthCNV/Average_depthChr. b the read with a insert size close to the length of the CNV. c the region within 500 bp upstream and downstream of the breakpoint were used to detect the discordant reads. Table S4. The copy number matrix of the 10 CNVs identified by qPCT. Table S5. Genes with high copy number (≥10) in more than 5 accessions. Table S6. The copy number matrices of 10 loci in 15 accessions called by CNVnator, Delly, and CtgRef-CNV. The DUP, DEL, and CN1 loci, were marked as yellow, purple, and green, respectively. Table S7. The weighted accuracies of three softwares based on the qPCT results of 10 loci in 15 accessions. Table S8. The comparisons of CNV results generated by three methods. Table S9. The accuracy of the selected 150 CNVs detected by CtgRef-CNV but not detected by CNVnator. Table S10. The accuracy of the selected 150 CNVs detected by CNVnator but not detected by CtgRef-CNV. Table S11. The comparison of CNVs between Yu et al. and present study. Table S12. The comparison of CNVs between Du et al. and present study. Table S13. The comparison of CNVs between Wang et al. (3 K rice genomes) and present study. Table S14. The comparison of CNVs between Fuentes et al. (3 K rice genomes) and present study.
Additional file 6: Table S18. The statistics of correlation analysis. a for each gene, the TPM values were divided into groups by different copy numbers. Difference analysis were performed in R. If there was/were significant difference between at least two groups of TPM values, this gene was defined as Exp_diff gene. Table S19. The changes in the average rate of increase with the copy number gradient of positively correlated genes. Values are means ± s.d.. Table S20. The fate of duplicated genes. a corresponds to the pairs at least one copy pseudogenized. b-d correspond to the pairs at least one copy neo-functionalized, sub-functionalized, and neo-functionalized/sub-functionalized, respectively. e corresponds to the pairs whose copy numbers were in line with the assembly results of 93 rice accessions. Table S21. The number of copy specific reads. Table S22. Primers used in this study.
About this article
Cite this article
Zhao, F., Wang, Y., Zheng, J. et al. A genome-wide survey of copy number variations reveals an asymmetric evolution of duplicated genes in rice. BMC Biol 18, 73 (2020). https://doi.org/10.1186/s12915-020-00798-0
- Copy number variation
- Gene expression
- Duplicated gene
- Evolutionary fate
- Asymmetric evolution