In the RT model of intron loss, recombination between intron-containing genomic DNA and cDNA reverse transcribed from mature mRNA results in the loss of introns from the gene [12, 13]. Previous support for the RT model includes the loss of adjacent introns, 3'-side bias of intron loss, 5'-side bias of extant introns, and intron losses biased to genes highly expressed and germline expressed. However, conflicting results have been reported for adjacent intron loss and positional bias of lost introns (see Background). The 5'-side bias of extant introns is less disputable [13, 49, 50], but alternative explanations have been posited that have not been disproved . Although germline expression of IL genes has not been widely reported [24, 30], no conflicting evidence is available. However, we propose an alternative explanation. A large number of studies have revealed an association between transcription and DNA damage including DNA double strand breaks [51–54]. In addition, the NHEJ repair of double strand breaks was recently suggested to cause intron loss . Germline-expressed genes might have a higher frequency of intron loss resulting from the repair of transcription-associated DNA damage. Similarly, the bias of intron loss in highly expressed genes may also be explained by the NHEJ model .
In this study, we adopted a new method to test the RT model based on the common process of reverse transcription between intron loss (as proposed by the RT model) and the formation of PPs. IL genes and parental genes of PPs share properties that may facilitate reverse transcription, such as being translated on free cytoplasmic ribosomes. More importantly, we found a positive correlation between the frequency of intron loss and the abundance of PPs. Our results strongly indicate that reverse transcription is a necessary step in intron loss. IL genes in mammals were found to be highly expressed . We also found that the IL genes of mice and rats have significantly higher expression level than NIL genes (Additional file 4). The abundance of PPs is correlated with the expression level of their parental genes , especially in rodents whose PPs are relatively young [55, 56]. Therefore, the shared feature of high expression between IL genes and parental genes of PPs suggests a common mechanism (that is, reverse transcription). It could be seen that, in mammals, highly expressed genes provide more substrates for reverse transcription, which in consequence leads to both high frequency of intron loss and high abundance of PPs. It should be noted that the correlations of gene expression level with the frequency of intron loss and the abundance of PPs are not necessarily applicable to all species. As gene expression and genome-wide RT activity evolve rapidly, the present gene expression level that can be used in analyses is not necessarily reflecting that at the time of intron loss and PP formation. A previous study showed that the correlation between PP abundance and the expression level of the parental genes of PPs is stronger for young pseudogenes than for old ones .
Beside direct recombination of cDNA with genomic DNA, the RT model has another sub-model: recombination or gene conversion of genomic DNA by intronless PPs . If a PP reciprocally recombines with genomic DNA, the intron lost from the functional gene should appear in the PP. If it is gene conversion of genomic DNA by intronless PPs, the exonic sequences flanking lost introns should be more similar to PPs than exonic regions that are unlikely conversed. By searching these evolutionary traces, we attempted to test this sub-model with our dataset of intron loss. Unfortunately, no convincing results were obtained.
There is another possible but less likely explanation for the correlation between intron loss frequency and PP abundance. Highly expressed genes may be more likely to lose introns to reduce metabolic load and the probability of mis-splicing than lowly transcribed genes. Because highly expressed genes have generated more PPs, the intron loss frequency and PP abundance are linked together superficially by high expression level. The metabolic load of introns was previously supposed to be a selective force to intron length reduction in highly expressed genes [57–60]. However, the energetic cost of a long intron in a highly expressed gene was found to be too trivial to act as a selective force for intron loss or intron size reduction in organisms with small effective population sizes like humans and mice . The splicing of each pre-mRNA molecule has a certain probability of error. A highly expressed gene that has a large number of pre-mRNA molecules to be spliced is thus expected to have more mis-spliced products. However, a recent study revealed that the frequency of splicing error is positively correlated with intron length but not with gene expression level, probably because highly expressed genes generally have small introns .
Using the abundance of PPs to test the RT model has limitations. Mammalian genomes have an especially high content of PPs and abundance of retrotransposons. In some organisms, transposable elements are subject to constant turnover . If PPs have the same fate as retrotransposons in these organisms, the abundance of PPs would not reflect the affinity of the mRNA molecules of parental genes to RT. As a consequence, the abundance of PPs would not correlate with the frequency of intron loss, even if it occurred by the mechanism proposed by the RT model. In spite of this, we examined whether a correlation between frequency of intron loss and abundance of PPs exists in Drosophila and Arabidopsis
thaliana using the datasets of intron loss previously published [27, 64]. None of the IL genes were found to have produced any PPs. Meanwhile, in the NIL genes, a very small proportion produced PPs in Drosophila or Arabidopsis (0.23% for Drosophila and 0.73% for Arabidopsis). Fisher's exact tests showed that the differences between IL genes and NIL genes are not significant in either Drosophila or Arabidopsis (P > 0.6 for both cases). Considered just from this result, it seems that the RT model is not the major mechanism of intron loss in either Drosophila or Arabidopsis. However, previous studies suggested the RT model might be the major mechanism of intron loss in Drosophila but not in Arabidopsis [21, 24, 27]. Considering the very low percentage of parental genes of PPs in Drosophila and Arabidopsis, it is unreliable to reach a conclusion on the RT model based on PP analysis.
The RT model, the genomic deletion model and the double-strand-break repair model attempt to describe how intron loss occurs. In evolution, the fate of a new mutation may be fixed, eliminated or randomly lost depending on its effect on the fitness of the host organism and the population size of the organism. Except that intron losses are neutral and thus randomly lost or fixed, the mutational models of intron loss cannot fully account for the pattern of intron loss. The inaccuracy of the mutational models to account for the patterns of intron loss in many studies may also be explained by the selective fixation of some special cases of intron loss [65–67]. Mice and rats have more frequent intron loss than humans [30, 68] but a similar abundance of PPs . With the assumption that introns are slightly deleterious and thus intron loss is selectively favored, this difference could be explained by the difference in the efficiency of natural selection between humans and rodents.