 Research article
 Open access
 Published:
Gene birth in a model of nongenic adaptation
BMC Biology volume 21, Article number: 257 (2023)
Abstract
Background
Over evolutionary timescales, genomic loci can switch between functional and nonfunctional states through processes such as pseudogenization and de novo gene birth. Particularly, de novo gene birth is a widespread process, and many examples continue to be discovered across diverse evolutionary lineages. However, the general mechanisms that lead to functionalization are poorly understood, and estimated rates of de novo gene birth remain contentious. Here, we address this problem within a model that takes into account mutations and structural variation, allowing us to estimate the likelihood of emergence of new functions at nonfunctional loci.
Results
Assuming biologically reasonable mutation rates and mutational effects, we find that functionalization of nongenic loci requires the realization of strict conditions. This is in line with the observation that most de novo genes are localized to the vicinity of established genes. Our model also provides an explanation for the empirical observation that emerging protogenes are often lost despite showing signs of adaptation.
Conclusions
Our work elucidates the properties of nongenic loci that make them fertile for adaptation, and our results offer mechanistic insights into the process of de novo gene birth.
Background
An organism’s genome consists of several types of loci that are known to perform diverse functions: these functional loci include genes, gene regulatory regions, and sequences maintaining chromosome structure [1]. However, measuring the proportions of the functional parts of genomes is a subject that is fraught with disagreements, which stem from problems with our conception of what constitutes a functional genomic locus [2]. While evidence of selection at a genomic locus is a strong indicator of function, problems arise when one considers the emergence of functionality in a previously nonfunctional locus. The subject of emergence of functionality is especially relevant for the study of de novo gene birth, whereby novel genes are born from previously nongenic loci [3]. Specifically, de novo gene birth involves a previously nongenic locus gaining the ability to express a functional product. Hereafter, our definition of functional product includes both proteins and RNA.
De novo gene birth is a widespread phenomenon: well known examples of de novo genes exist across multiple evolutionary lineages, such as plants [4], fungi [5], and animals [6], including humans [7]. At the same time, measuring the rate of de novo gene birth is not straightforward [3]: firstly, evolutionary conservation forms a basic criterion for a sequence to even be considered a “gene,” and candidates for de novo emerged genes necessarily show some level of conservation. Secondly, the only conclusive criteria for proving de novo emergence is to identify the noncoding ancestral sequence and identify enabler mutations that first allowed expression of this sequence [8]. These two criteria are at odds with each other, since the likelihood that traces of the homologous ancestral sequence are lost increases with time and, hence, with conservation level. Therefore, strict criteria for choosing candidate “genes” and strict criteria to ascertain “de novo emergence” lead to an undercount [9]. On the other hand, genomic analyses based on synteny [10] and phylostratigraphy [11] find indirect evidence of de novo gene emergence based on the lack of a homologous ancestral coding sequence. These indirect counts are most likely to be overcounts, and thus the proposed prevalence of de novo gene birth remains contested [12]. In parallel, new methods to detect de novo genes based on machine learning techniques are also being developed [13].
Our understanding of the range of possible mechanisms of de novo gene birth comes from several wellknown case studies. The proposed mechanisms are typically expressed as a series of substeps involving gains and losses of stable transcription, open reading frames [3], and sequences for nuclear export of RNA [14]. Case studies have also uncovered how certain genomic regions, such as introns [15], and certain tissues like the brains and testes of animals [3] favor the emergence of de novo genes. These mechanisms serve as very useful descriptors of the conditions under which de novo gene birth becomes likely. Nevertheless, in order to formulate a predictive theory of de novo gene birth, it is necessary to connect these complex descriptions with empirically measured rates and effects of simple genomic changes such as mutations and structural variations. In this work, we attempt to bridge this gap: we envisage that contributions of genomic loci to organismal fitness, which is often measured in terms of the organism’s growth rate, can be used as a correlate of the functionality of these loci. For example, in [16], the physiological activity of putative yeast protogenes was demonstrated when overexpression of these genes impacted growth rate. In this study, we use an evolutionary model to explore how the contribution of a nonfunctional genomic locus to organismal fitness might increase over time due to accumulating mutations, and investigate the implications of this process for de novo gene birth.
The effects of novel, spontaneous mutations on growth rate have been experimentally measured for various organisms [17]. The outcomes of these studies are represented as DFEs (distributions of fitness effect), which are distributions of the magnitude of mutational effects on growth rate. In our model, we sample biologically reasonable distributions of fitness effect of mutations using recently measured DFE parameters for Chlamydomonas reinhardtii [18] as reference. Notably, observations in [18] indicate that the DFE of specific regions of the C reinhardtii genome, such as exons, introns or intergenic sequences, are similar to each other and to the DFE of the whole genome. However, in general, the DFE is known to vary across different regions of the genome [19] and across different species [20]. Particularly, we expect most mutations incident on nonfunctional genomic loci to be neutral. Accordingly, we sample a wide range of DFEs in the vicinity of the DFE reported in [18], and a large number of the sampled DFEs mostly produce neutral mutations.
Over a time scale of millions of years, in addition to small mutations (< 50 bp), one can also expect large structural variations (from 50 bp up to several megabases) to impact the evolution of genomic loci [21]. While the rate of structural variation is estimated to be hundreds of times slower than the rate of small mutations, its effect is likely to be much larger [22]. Of particular importance to our question is the possibility that the entire genomic locus under consideration gets deleted. Therefore, we test in our model whether an initially nonfunctional locus can survive in the face of locus deletion for long enough to allow for adaptation.
Finally, we consider the particular case of de novo gene birth, whereby a previously nonfunctional locus gains the ability to express a functional product (protein/RNA). Recent studies report how new genes gain expression [23] and functionality [24] over time. Measurements from mutational scans of proteinencoding genes indicate that the overall fitness contribution of a gene is a combination of the adaptive value of the gene product and its expression level [25]. We envisage that equivalently, during the process of de novo gene birth, mutational fitness effects can be decomposed into the effect on adaptive value and the effect on expression level. In the model, we use the DFE, together with empirical measurements of mutational effects on expression, to extract a scenario of the evolution of the adaptive value of the de novo gene product.
We find that the DFE alone is insufficient to determine the fate of nongenic loci, and the interplay of mutation rate and selection plays a crucial role. We tested the predictions of our model under two mutation rate regimes: in nature, most organisms have a mutation rate in the range \(10^{11}\)\(10^{8}\) mutations per basepair per generation (Additional file 1: Table 1 [18, 26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46]―the lower end of mutation rates corresponds to the germline of ciliate Tetrahymena thermophila [35] and the higher end corresponds to the plant Arabidopsis thaliana [44]). This range falls under the low mutation rate regime, whereas population dynamics in laboratory growth competition experiments such as in [47], where multiple variants of a gene compete against each other for a small number of generations, are comparable to the high mutation rate regime.
Our model produced contrasting pictures for these regimes: In the high mutation rate regime, we find that a wide range of biologically reasonable DFEs allow functionalization of nongenic loci. Moreover, this gain of functionality occurred despite the antagonistic effects of locus deletion, particularly for the C reinhardtii DFE parameters. In the special case of de novo gene birth, mutational effects on adaptive value had a shorttailed distribution, thus implying that the rare, extreme mutations that are characteristic of DFEs are driven solely by mutational effects on expression level.
On the other hand, in the more natural low mutation rate regime, we find that conditions that allow functionalization of nongenic loci are much more stringent, and populations are very sensitive to the effects of locus deletion. Moreover, in the case of de novo gene birth, mutational effects on adaptive value had a longtailed distribution; this implies that in the low mutation rate regime, the rare, extreme mutations in DFEs are driven by both mutational effects on expression level and adaptive value.
However, under both the low and high mutation rate regimes, we find that mutations in adaptive value, rather than expression level are the major drivers for the sustained fitness increase over evolutionary time. Our results can be tested experimentally using high throughput mutational scans on random initial sequences; such experiments stand to offer quantitative insights into the process of de novo gene birth.
Model of nonfunctional locus adaptation
We set up a population genetic framework to model wellmixed finite populations of fixed size N, composed of asexually reproducing, haploid individuals. Fitness of an individual represents growth rate in the exponential phase, which is equivalent to the quantities considered in experiments that measure DFEs (e.g., [18]). In this work, for any individual i, we consider the evolution of the fitness contribution F(i) of a single locus in its genome. We are interested in the probability that the locus persists in the population and that its fitness contribution increases above some large, predetermined threshold. In this study, we use a fitness threshold of 0.1, which is much larger than the typical effect sizes of incident mutations.
In particular, we examine the special case of de novo gene birth, where the functional product of a gene can be either protein or RNA. We decompose the fitness contribution into two quantities: functionality, or adaptive value of the expression product (A(i)), and its expression level (E(i)). Note that this decomposition represents the limiting case where the whole fitness contribution of the de novo gene is due to the gene product. It is likely that for emerging genes, the relative contribution of the gene sequence due to its role in gene regulation, or chromosome arrangement is comparable to the contribution of the expression product, and this relative contribution is itself an evolvable feature of the locus. Our definition of fitness of the product is not tied to any specific function, and we assume that \(F(i) = A(i) \times E(i)\). We note that genes exert their effect on organismal fitness within the context of a molecular interaction network and never work in isolation [48]. Formally, in our definition, adaptive value A represents the maximal contribution to organismal fitness made by the expression product, conditional upon the underlying cellular molecule interaction network. Hence, A is a simple measure that incorporates the effects of all other genes. In the model, this maximal fitness contribution is realized at the maximal expression level, \(E = 1\). And \(\Delta A\) is the change in the maximal fitness contribution of a variant due to a mutation. Notice here that while this decomposition is fairly straightforward for unicellular organisms, the case of multicellular organisms is more complicated: in general, we can expect different cell types to express the same gene at different levels. Therefore, in multicellular organisms, fitness is a function of the expression profile of the gene across all cell types. Hence, we limit our discussion of de novo gene birth to unicellular organisms.
The locus of interest is nongenic, with initial adaptive value \(A_0(i) = 0\), and an initial expression level \(E_0(i) = 10^{3}\) for all individuals. The initial expression level captures leaky expression of intergenic regions [49], which is estimated to be 1000fold smaller than the level of highly expressed genes [50].
A single timestep is the time it takes for one mutation to occur in the locus. Now, the size of a single locus is very small compared to the whole genome. Hence, it is reasonable to assume that the rest of the genome accrues many more mutations in the time it takes for a single mutation to be incident on our locus of interest. The duration of a timestep for different organisms is equivalent to the inverse of the average mutation rate measured for these organisms: For a locus of \({\sim }100\) base pairs, a single model timestep can range between \(10^{2}10^{5}\) years for different organisms (Fig. 1B, see also Additional file 1: Table 1). That is, a single timestep in the model encompasses many generations. Hence, populations at different timesteps of the model are nonoverlapping; the population at timestep \(t+1\) is composed entirely of the offspring of individuals in the timestep t (Fig. 1A). Offspring incur mutations at each timestep, which affect the locus fitness (\(\Delta F(i)\)).
In the case of de novo gene birth, \(\Delta F(i)\) can be decomposed into mutational effects on adaptive value, \(\Delta A(i)\), and expression level, \(\Delta E(i)\):
Offspring can also incur structural variations, which in the model leads to the deletion of the locus in that individual. The probability of locus deletion d represents the rate of structural variation relative to mutation rate.
In this study, we separately investigate two qualitatively different scenarios:

When mutation rate is high, and the number of generations between two model timesteps is \(<{2000}\), we model the probability that an individual leaves an offspring in the next timestep as being proportional to the fitness F(i) of the locus.

When mutation rate is low, and the number of generations between two model timesteps is \(>{2000}\), a single variant gets fixed in the population before the next mutation arrives. This is in contrast with the high mutation rate regime, where each model timestep is initialized with multiple variants (Additional file 1: Section 2). Here, we model the probability of fixation of an individual i at each model timestep as being proportional to its fitness F(i). This increase in fixation probability with fitness is consistent with studies that infer evolutionary outcomes in simpler situations with homogeneous populations and a single mutant [51] (we describe how we update populations with high and low mutationrates in the “Method to update population fitness” section).
For simplicity, we do not consider the compounding effects of competition with established genes and genetic interactions such as linkage and epistasis in the main manuscript (see Additional file 1: Section 3 for a discussion of evolution in the presence of other nongenic loci and established genes, and the effect of diminishing returns epistasis). Whenever \(F(i) \le 1\), we consider the locus lethal and such individuals cannot produce offspring. We update populations for 1000 timesteps, equivalent to 0.1–100 million years, depending on the organism and size of the locus (Additional file 1: Table 1).
Fitness effects of mutations (\(\Delta F(i)\)) are drawn from the characteristic DFE of the locus (Fig. 1C). Multiple studies indicate that longtails are important features of DFEs, which can be captured by the general form of longtailed gamma distributions [52, 53]. Therefore, we choose to follow [18] and represent DFEs as twosided gamma distributions and characterize them using four parameters: (i) average effect of beneficial mutations p, (ii) fraction of beneficial mutations f, (iii) average effect of deleterious mutations n, and (iv) the shape parameter s, where distributions with lower s are more longtailed. In our simulations, we first decide whether a new mutation is beneficial or deleterious by using f as the probability of beneficial mutations. We then draw mutational effect sizes: we use a gamma distribution with parameters p and s for beneficial mutations, and a gamma distribution with parameters n and s for deleterious mutations (“Method to update population fitness”).
Mutations in the model represent the mutation types included in [18], which were singlenucleotide mutations and short indels (insertions or deletions of average length \(\le 10\) bp) [54]. Note that the quantity reported in experimental studies is the DFE across the whole genome. We account for differences in DFEs across species and locations along the genome by sampling widely across these four parameters p, f, n, s. Consistent with the expectation that mutations at nongenic loci should be mostly neutral, the parameters we sample include a large number of DFEs which produce a majority of neutral mutations (Additional file 1: Section 4).
We also use empirical measurements to estimate the distribution of mutational effects on expression. Studies indicate that mutational effects on expression from established promoters follow a heavytailed distribution [55]. More relevant to our study of de novo gene birth are the recent measurements of mutational effects on expression from random sequences [46], which follow a power law distribution, \(\textrm{Pr}(\Delta E) \sim \Delta E^{2.25}\). We assume in the model that each mutation has two components: a component that affects the adaptive value of the product (\(\Delta A(i)\)) and a component that affects expression level (\(\Delta E(i)\)). At each time step, we use the above power law distribution to draw \(\Delta E(i)\). We then calculate values of mutational effects \(\Delta A(i)\) using Eqs. (1) and (2), given distributions of mutational effects on fitness and on expression level (“Method to update expression level and adaptive value”; see also Additional file 1: Section 5 for possible deviations from the powerlaw \(\Delta E\) distribution due to the very small initial values E(i)).
In all, we survey 432 parameter sets―108 DFE parameters p, f, n, s, and 4 parameter values for d, representing the probability of locus deletion―(Fig. 1C). We simulated replicate populations by performing 100 independent stochastic simulations with the same set of parameters for population sizes \(N={100}, {1000}\) (“Surveying the space of DFE and locus deletion parameters in populations of various sizes”). Simulating replicate populations allows us to estimate the probability that nongenic loci with properties comparable to model parameters can adapt and become functional.
At the end of each simulation, we trace the ancestry of each locus in each individual (“Tracing ancestry to find fixed mutations”) in order to track the fitness contribution of the last common ancestor of the locus: for populations in the high mutation rate regime, the ancestry of all individuals at some timestep t can be traced back to a single individual at some previous timestep \(tt_{fix}\), whereas, for populations in the low mutation rate regime, all individuals at timestep t are the progeny of the single variant that gets fixed in the timestep \(t1\). We count the number of replicate populations in which the locus is still retained at timestep \(t = {1000}\), and the last common ancestor of individuals at \(t={1000}\) is fitter than the predetermined fitness threshold of 0.1 (Fig. 1A).
Results
Tuning the number of neutral mutations
Many studies report the abundance of deleterious mutations compared to beneficial mutations, but there is plenty of evidence to the contrary as well [56]. We do not have extensive empirical measurements of the DFE of nongenic loci, nevertheless, since these loci are not expected to be expressed at high levels, it makes intuitive sense to assume that most nongenic mutations should be nearly neutral. In order to reconcile the above factors, we explore a wide range of DFEs around the DFE reported for C reinhardtii in [18].
To test the diversity of DFEs we sample in this study, we count the number of positiveeffect, negativeeffect, and neutral mutations that the different DFEs produce. Theoretically, we expect mutational effects of magnitude \(10^{3}\) to be neutral for a population of size N = 1000. We use a more stringent condition and count mutations with effect size \(< 0.5 \times 10^{3}\) as effectively neutral (we also demonstrate neutrality of these mutations in Additional file 1: Section 4). Thus, we count mutations of effect \(\ge 0.5 \times 10^{3}\) as positive effect mutations, mutations with effect \(\le 0.5 \times 10^{3}\) as negative effect mutations, and mutations with effect between \(0.5 \times 10^{3}\) and \(0.5 \times 10^{3}\) as neutral mutations.
In particular, we measured how the parameters of the model control the fractions of positive, neutral, and negative effect mutations (Fig. 2A) and the magnitudes of positive and negative effect mutations (Fig. 2B). Importantly, we find that the fraction of neutral mutations decreases with the shape parameter (s).
Out of the 108 DFEs sampled, 50% of the different DFEs produced more neutral mutations than positive or negative effect mutations. In 52% DFEs, negative effect mutations were more numerous than positive effect mutations, and in 70% DFEs, average effect size of negative mutations was larger than that of positive mutations. Eighteen out of 108 DFEs had more neutral mutations than positive or negative mutations and produced larger and more numerous negative effect mutations than positive effect mutations, whereas only 1 DFE had fewer neutral mutations than either positive or negative mutations and smaller and fewer negative effect mutations than positive effect mutations. Overall, the parameters we sample cover a broad range of DFEs; many, but not all, of the sampled DFEs match our intuitive beliefs and produce mainly neutral and deleterious mutations.
Conducive parameters for adaptation
We want to understand which of the 108 DFE parameter sets (p, n, f, s) that we test are conducive to adaptation. In order to do this, we simulate the evolution of loci for which incident mutations follow a given DFE in populations of size \(N={1000}\). We perform this simulation in 100 replicate populations for each of the 108 DFEs. These simulations do not include locus deletion (i.e., \(d=0\)). We also test how mutation rate influences the fate of nongenic loci: We perform two separate sets of these 10,800 simulations corresponding to population dynamics at high and low mutation rates respectively.
In the model, we keep track of the ancestors of all the individuals at any given timestep. After a sufficient number of timesteps, we find that all individuals at some timestep t come from a single common ancestor: In the low mutation rate regime this common ancestor occurs in the immediately previous generation \(t1\), while in the high mutation rate regime, the common ancestor can be traced back to \(\sim 100\) timesteps ago. We call this most recent common ancestor of all individuals at timestep t the “last common ancestor.” Equivalently, we say that the variant of the locus present in this “last common ancestor“ is fixed in the population at time t. In this analysis, we check whether or not the fitness contribution of the variant in the last common ancestor of all individuals at timestep \(t={1000}\) has crossed the threshold of \(F=0.1\). We call the fraction of replicate populations in which the fitness threshold is crossed the conducivity of the corresponding DFE. And we call a given set of DFE parameters conducive if its \(conducivity\ge 0.5\); i.e., the fitness of the last common ancestor for individuals at timestep \(t={1000}\) crosses 0.1 in at least 50 out of 100 replicate populations.
High mutation rate: rare beneficial mutations are sufficient for adaptation
We find that a majority (80 out of 108) of DFE parameter sets are conducive (Fig. 3A grey bars, see Additional file 1: Fig. S8 for \(N={100}\)). Moreover, the bimodality of the grey histogram in Fig. 3A indicates that DFE parameters tend to either be highly conducive or highly repressive to adaptation. As one can anticipate, the conducive DFE parameter sets tend to have high values for the magnitude (p) and the frequency of beneficial mutations (f) and low values for the magnitude of deleterious mutations (n) (Fig. 3A, inset―blue bars, see also Additional file 1: Fig. S8(A), inset). In particular, for the DFE parameters closest to C reinhardtii, 97% of the N = 1000 replicate populations (\(52\%\) of N = 100 replicate populations) crossed the fitness threshold. Parameters representing nonconducive DFEs showed the opposite trends; they tend to have low values for the magnitude (p) and the frequency of beneficial mutations (f) and high values for the magnitude of deleterious mutations (n) (Additional file 1: Fig. S9). Nonconducive DFEs are highly repressive, and for 24 out of the 28 nonconducive DFEs, none of the \(N={1000}\) replicate population crossed the fitness threshold.
We also compared the number and magnitudes of positive and negative effect mutations: 21% of DFEs had more numerous and larger positive effect mutations, and 100% of these DFEs were conducive. However, 45% of the DFEs had more numerous and larger negative effect mutations, and 49% of these very stringent DFEs were also conducive. Among the stringent DFEs, conducivity was strongly negatively correlated with the number of negative effect mutations (Pearson’s correlation coefficient \(R = {0.60}\), pvalue = \({4.5 \times 10^{6}}\)).
In a principal components analysis (PCA) of the proportions and magnitudes of positive, negative, and neutral effect mutations, the first three components capture a large part (77.5%) of the variance in modelgenerated data (Additional file 1: Section 6.3). DFE conducivity correlates well with all three principal components (PCs). In particular, the first PC is highly correlated with the shape parameter (s) (Additional file 1: Table 2). This is in line with our observation that a substantial number of model loci where positive effect mutations were small and rare could adapt and cross the fitness threshold of 0.1 (Additional file 1: Fig. S6(B)). We find that parameters with low values of the shape parameter (s) can compensate for low frequencies and small average effect size of beneficial mutations (Additional file 1: Figs. S11(A), S12). Notably, while the conducivity of DFEs is strongly negatively correlated with the proportion of negative effect mutations (Pearson’s correlation coefficient \(R = {0.68}\), pvalue = \({4.9 \times 10^{16}}\)), it is positively correlated with both the proportion of positive effect mutations and neutral mutations (R = 0.35, (pvalue = \({1.8 \times 10^{4}}\)), R = 0.3 (pvalue = \({1.6 \times 10^{3}}\)) respectively.
Overall, we find that in the high mutation rate regime, even very rare largeeffect beneficial mutations are sufficient for adaptive evolution. We further test our conclusion in an alternative scenario: we use a nonconducive DFE (\(p=0.001, n=0.005, f=0.25, s=0.1\)), augmented by extremely rare (probability = 0.0001), large effect positive mutations (\(\Delta F = 0.15\)). Consistent with our findings, the evolution of the locus under this scenario leads to adaptation (Additional file 1: Section 7).
Low mutation rate: adaptation requires frequent and large beneficial mutations
In this regime, 43 out of 108 DFE parameter sets are conducive (Fig. 3A dull green bars). While the histogram of DFE conducivity is bimodal, there are many more parameters with intermediate conducivity. Consistent with the case of high mutation rates, conducive DFE parameter sets tend to have high values for the magnitude and the frequency of beneficial mutations (p and f) and low values for the magnitude of deleterious mutations (n) (Fig. 3A, insetgreen bars); correspondingly, parameters representing nonconducive DFEs tend to have low values for the magnitude (p) and the frequency of beneficial mutations (f), and high values for the magnitude of deleterious mutations (n) (Additional file 1: Fig. S9).
However, in the low mutation rate regime, the requirements for gene birth are much more strict, and for the DFE parameters closest to C reinhardtii, none of the replicate populations crossed the fitness threshold (Fig. 3B, right panel). Overall, 24% of DFEs that had more numerous and larger positive effect mutations were conducive, and only 1.5% of the very stringent DFEs that had more numerous and larger negative effect mutation were conducive.
Moreover, in the principal components analysis, DFE conducivity correlates well with the second and third, but not the first PC (Additional file 1: Table 2). In line with this, the shape parameter (s), which is highly correlated with the first PC, does not compensate for small and low frequency beneficial mutations (Additional file 1: Table 2, Fig. S11(B)). Consistently, the conducivity of DFEs is strongly negatively correlated with the proportion of negative effect mutations (Pearson’s correlation coefficient \(R = 0.58\), pvalue = \({4.8 \times 10^{11}}\)) and positively correlated with the proportion of positive effect mutations (\(R = 0.50\), pvalue = \({3.3 \times 10^{8}}\)), but shows weak and nonsignificant correlations with the number of neutral mutations (\(R = 0.1\), pvalue = 0.3). Thus, in contrast with the high mutation rate regime, rare beneficial mutations are not sufficient to drive adaptation in the low mutation rate regime.
Population dynamics and fitness trajectories
We next looked at the trajectories of population averaged fitness in order to find out whether there are significant regularities in the dynamics of adaptation.
High mutation rate: fitness trajectories transition from a mutation dominated to a selection dominated phase
We find that the fitness trajectories of populations where the fitness threshold is crossed have a typical shape: first, the average population fitness decreases to reach a minimum value. Beyond this, the average population fitness increases roughly linearly (Fig. 3B, left panel). We intuit that these two phases can be explained by the opposing effects of new incident mutations and of selection: in the first phase where the population average fitness decreases, fitness is likely dominated by the effects of new mutations which are more likely to be deleterious/neutral (in 52% DFEs, negative effect mutations were more numerous than positive effect mutations, and in 70% DFEs, average effect size of negative mutations was larger than that of positive mutations). In the second phase, population average fitness increases because the effects of selection become apparent after a sufficient amount of time has passed and dominate over the effect of new, incoming mutations.
These fitness trajectories are reminiscent of the dynamics of learning through adaptive strategies in gambling problems, where an initial phase of loss of capital due to the cost of learning is followed by recovery [57]. Although, at the level of a single organism, the fitness effects of new mutations are independent of the history of mutations that came before, natural selection endows populations with a form of memory [58, 59].
Two numbers indicate the point in the trajectory at which selection leads to consistent improvement in fitness: minimum average fitness and time at which minimum fitness is achieved (Fig. 3B, left panel). As expected, populations with lower minimum fitness spend longer in the decreasing fitness regime (Pearson correlation coefficient between minimum fitness and time of minimum fitness = − 0.87, with pvalue = \(5.5 \times 10^{34}\); see Supplementary Fig. S9). We interpret the time of minimum fitness as representing the amount of time it takes for the effect of selection to start dominating over the effect of new incident mutations. In the high mutation regime, DFE parameters, especially p and f, are significantly correlated with time of minimum fitness (Pearson’s \(R = 0.41\) (pvalue = \({9.3 \times 10^{6}}\)) and \(0.52\) (pvalue = \({1.9 \times 10^{8}}\)), respectively; see Additional file 1: Table 3)).
Low mutation rate: fitness trajectories always governed by incident mutations
In simulations with low mutation rates, a single variant gets fixed in the population at each model timestep. Therefore, although the absolute values of fitness vary across model timesteps, the distribution of relative fitness remains unchanged and are governed by the fitness effects of incident mutations. Hence, in the low mutation rate regime, “learning” occurs during the many rounds of selection in the generations intervening two mutational timesteps and is not visible at the timescale of model timesteps. Consistently, even for conducive DFE parameters, fitness trajectories do not have a selection dominated phase of persistent fitness increase (Additional file 1: Section 9).
Dynamics of adaptation, locus loss, and retention
We next test the effect of large structural variations that can delete the entire locus in an individual on the chances that the locus can adapt and become fixed in the population. Note that we are concerned with the general case of gain of functionality by a locus, which may or may not be due to expression of a product (protein or RNA). In the particular case where the locus expresses a protein product, there are many ways to erase functionality, such as by disrupting mRNA export from the nucleus, ribosome binding site, or truncating gene products through nonsense mutations. Here, we only study the effect of locus deletion, which affects all loci irrespective of their particular mode of function.
High mutation rate: mutations drive adaptation despite the effect of locus deletion
When \(d>0\), The effect of locus deletion can be understood in terms of a competition between two subpopulations: the subpopulation that has lost the locus, and therefore lacks any fitness contribution from it, and the subpopulation that retains it (Additional file 1: Section 10). The probability that the subpopulation that has lost the locus takes over increases with time of minimum fitness as calculated for the case where \(d=0\): the longer the average fitness remains negative, the more probable is the loss of the locus from the whole population. Therefore, fewer replicate populations with DFEs such that minimum fitness is reached later go on to cross the fitness threshold of 0.1 (Fig. 3C, Additional file 1: Fig. S13(B)). As a consequence, the number of conducive parameter sets (out of 108 sets overall) reduces from 80 at \(d=0\) to 74 at the plausible value of \(d=0.005\) and to 69 and 50 at the inordinately high values of \(d=0.01\) and 0.05, respectively. Particularly, the DFE closest to C reinhardtii, for which minimum fitness and time of minimum fitness averaged across all replicate populations are \(0.035\) and 55.7 respectively, is still conducive at \(d=0.005\) (Fig. 3C, red dotted line).
Low mutation rate: structural variations lead to locus loss despite adaptation
In this regime, a single variant gets fixed at every time step, which results in populations being very sensitive to locus deletion. The dynamics can still be understood in terms of subpopulations, which now compete for fixation during the many intervening generations within one model time step: At every time step, the locus gets deleted in a small fraction of the population, and there is a finite probability that an individual from this subpopulation takes over and becomes fixed in the population. This probability is proportional to the relative fitness of the subpopulation that has lost the locus (Additional file 1: Section 2).
Hence, adaptation in this regime requires consistent fixation in successive timesteps of variants from the subpopulations that contain the locus. Additionally, the larger the magnitude of beneficial mutations that are fixed in subsequent timesteps, the faster the decrease in relative fitness of the subpopulation that has lost the locus. Consistent with this picture, many populations where the fitness of the locus does not grow fast enough lose the locus despite having crossed the threshold of 0.1 (Fig. 3D, brown points): for the plausible value of \(d=0.005\), for 29 DFE parameter sets \(>50\%\) replicate populations cross the fitness threshold of 0.1, and for 9 of these 29 parameter sets, the locus is subsequently lost from \(>95\%\) of the replicate populations. This result is in agreement with the observations in Drosophila, where protogenes that get fixed in populations are very often lost [60] despite showing signs of adaptation [61].
For \(d=0.005\), all 29 DFE parameters for which the locus is retained (Fig. 3D, black points) have more numerous positive effect mutations than negative effect mutations, and \(75\%\) of these have larger positive effect mutations than negative effect mutations. For the high value of \(d=0.01\), 20 DFE parameter sets are conducive to adaptation, and the locus is subsequently lost from \(>95\%\) of the replicate populations for all of these parameter sets. For \(d=0.05\), there are no conducive DFE parameter sets.
Functionality and expression as drivers of adaptation
In the model, we draw mutational fitness effect (\(\Delta F\)) and mutational effect on expression level (\(\Delta E\)) independently from their respective distributions. We then derive \(\Delta A\) using the relation in Eq. (2). Note that the distribution of \(\Delta A\) we report is a population measure which depends on the standing fitness variation. Hence, we expect the distinct population dynamics in the high and low mutation rate regimes to lead to distinct distributions of \(\Delta A\). The shape of the distribution of \(\Delta A\) that we derive here is an important indicator and reflects how single small mutations affect the functionality of newly emerging genes.
We also ask whether fitness increase during de novo gene birth is driven more by changes in adaptive value or expression level. As a measure of the strength of driving, we use correlations between the trajectory of population averaged fitness and the trajectories of population averaged adaptive value and expression level (although instantaneous values of fitness, expression level, and adaptive value have a simple relationship (\(F = A \times E\)), the trajectories of these quantities over evolutionary time also include the effect of selection and heritability, and their correlations are therefore nontrivial). To illustrate, although \(\Delta F(i)\) and \(\Delta E(i)\) for mutations are drawn from independent distributions, the process of selection effectively links fitness and expression level and imposes correlations between their evolutionary trajectories (compare Additional file 1: Figs. S19(A) and S20). The correlations between the trajectories of fitness, expression level, and adaptive value measure the degree to which the direction of changes in average fitness levels (increase/decrease) reflect the directions of change in average expression levels and adaptive values in the population.
High mutation rate: functionality drives sustained adaptation, while expression drives extreme mutational events
In this regime, our decomposition of fitness into expression level and adaptive value yields an exponential distribution for mutational effects on adaptive value (Fig. 4A, Additional file 1: Section 12). The shorttailed nature of \(\Delta A\) distribution implies that mutations with large effects on adaptive value are highly unlikely. Therefore, the extreme mutational effects on fitness, which underlie the long tails of DFEs, are likely to be due to large changes in expression level. We anticipate that our result on the \(\Delta A\) distribution can be measured experimentally; for instance, one could generate random mutants of known protogenes, and measure the fitness effects of these mutations using techniques demonstrated in [16]. We also expect that dynamics in the high mutation rate regime should be more suited to capture conditions in laboratory experiments (as in [47]).
At the same time, correlations between the trajectories of fitness, expression level, and adaptive value indicate that increase in fitness was driven more by the adaptive value than by expression level: the distribution of Pearson’s correlation coefficients for adaptive value is sharply peaked at 1, whereas that of expression level is spread broadly (Fig. 4B, Additional file 1: Fig. S19).
Overall, we find that in the high mutation rate regime, sustained adaptation during gene birth is driven more by the product’s adaptive value rather than its expression level, while extreme mutational events, which become important in facilitating adaptation when most beneficial mutations are small and infrequent (i.e., small f and p), are more likely to be due to changes in expression level.
Low mutation rate: expression and adaptive value both drive extreme mutational events
In this regime, our decomposition of fitness into expression level and adaptive value yields a heavytailed powerlaw distribution for mutational effects on adaptive value (i.e., a distribution with a powerlaw tail: Fig. 4C). Therefore, in contrast to the high mutation rate regime, extreme mutational effects on fitness are driven by large changes in both expression level and adaptive value. However, correlations between the trajectories of fitness, expression level, and adaptive value indicate that similar to the high mutation rate regime, increase in fitness was driven more by the adaptive value than by expression level (Fig. 4D).
Discussion
A majority of studies in genomics and genetics are concerned with the function and evolutionary course of known genes and their regulation. Recent discoveries have shifted our focus towards the evolution of nongenic loci, particularly, experimental studies that demonstrate the adaptive potential of random sequences [62,63,64,65]. Furthermore, genomics studies that indicate the frequent occurrence of de novo gene birth demonstrate a need for general, theoretical investigations of the evolution of nongenic loci [11]. In this work, we attempt to describe the process of functionalization of nongenic genomic loci in a simple population genetic model. In terms of the Pittsburg model proposed in [2], we investigate the process by which a genomic locus with physiological implications (contribution to organismal fitness) transforms into a locus with evolutionary implications. We explore the space of biologically reasonable effects of spontaneous mutations and identify features of genomic loci that are conducive to functionalization. We also demonstrate how the outcomes of nongenic evolution depend on mutation rate.
Our results present a contrasting picture for organisms with high mutation rates versus those with low mutation rates: in organisms with high mutation rates, our model predicts that a wide range of parameters that govern mutational fitness effects (DFE) are conducive to locus functionalization. Conducive DFEs include many conservative DFEs, where positive effect mutations are small and few. Adaptation in this regime was also robust to the antagonistic effects of structural variation that leads to locus deletion.
On the other hand, mutation rates for most organisms fall in the range \({10^{11}}\)\({10^{8}}\) per basepair per generation and are thus in the low mutation rate regime. In this regime, conditions for locus functionalization were much more stringent. Importantly, populations in this regime are sensitive to structural variations that cause locus deletion, and even adaptive loci are often lost. Thus, our model supports the view that gain for functionality by most nongenic sequences should be rare. This is in congruence with the observation that most de novo genes are born in regions adjacent to established genes, where sequences have an increased chance of expression though readthrough [66] and perhaps also a lower probability of deletion due to linkage.
In the special case where nongenic adaptation leads to de novo gene emergence, we use a simple relation to resolve the fitness contribution of de novo genes into its adaptive value and expression level. Using the distributions for fitness and expression level, we could infer the distribution of adaptive value, which is difficult to measure directly. We envisage that such a method, where information about measurable quantities can be used to infer constraints on distributions of quantities that are difficult to measure, can be extrapolated to get a deeper view into the mechanism of de novo gene birth: expression level can be further broken down into various molecular mechanisms such as the accessibility and affinity of the locus to polymerases, and the adaptive value can be expressed as a composite of the stability, foldability, and interactions of expression products.
Our results also suggests that the process of adaptation is likely to be different for de novo genes and established genes: we find that in the case of de novo gene birth, the increase in fitness was driven more by the adaptive value than by expression level in both the low and the high mutation rate regimes. This effect is likely to be a special feature of de novo gene birth, where initially both adaptive value and expression levels are very low, whereas in the case of established genes, evolution of expression level is known to play a role in adaptation [67,68,69].
In this study, we present a minimal framework to study nongenic adaptation. Below, we discuss some important limitations of this framework, and suggest directions in which the model can be expanded which are particularly relevant to the study of de novo genes:

Initial conditions: In our study, the locus of interest initially has no adaptive value and expression level corresponding to pervasive, leaky expression. However, induced expression studies on random sequences indicate that intergenic loci can be expected to have nonzero adaptive value [65]. Additionally, some nongenic regions where de novo genes have been found, such as 3′ ends of established genes, have higher expression levels than others [66]. These results warrant a wider sampling of initial values of adaptive values and expression levels.

RNA versus protein de novo genes: A de novo gene in our model can represent either RNA or protein gene, and both protein and RNA products of genes can contribute to organismal fitness [70]. However, protein and RNA gene emergence are expected to be different: (a) sequence requirements for translation are more stringent than the requirements for transcription; a protein product necessarily requires a ribosome binding site, start and stop codons, and the correct reading frame. This makes the probability that a locus first starts expressing a protein lower and increases the probability that mutations can destroy the product. Therefore, the forms of \(\Delta E\) and deletion probability d are likely to be different for proteins versus RNA. (b) ’Enabler mutations’ that allow a noncoding region to first start producing a protein product can be expected to have a large effect on fitness. Thus, improved models should consider protein and RNA gene emergence separately. This limitation also points to empirical measurements that could aid in the formulation of improved theoretical studies: for example, for de novo genes whose enabler mutations are known (e.g., [9, 71]), fitness effects of these enabler mutations can be experimentally measured.

Fitness contribution of DNA sequence versus expression product: Our decomposition of fitness contribution into adaptive value (A(i)), and expression level (E(i)) of the gene product represents the limiting case where the whole fitness contribution of the de novo gene is due to the gene product. However, especially for emerging genes, it is likely that a nonnegligible part of the fitness contribution comes from the DNA sequence itself due to its role in gene regulation or chromosome organization. While current experimental efforts are focused on the molecular basis of functionality of de novo gene products (such as in [16, 66]), delineating the relative contribution to fitness by the gene product versus the DNA sequence constitutes an important direction for future experiments. It would also be important to understand how this relative contribution of expression product versus gene sequence evolves during gene birth. In future versions of the model, this aspect can be incorporated in the form of a variable \(r \in [0,1]\), such that
$$\begin{aligned} F_{t+1}(i) = F_{t}(i) + \Delta F_{t}(i) ~~~~\text {} from (1)\\ \mathrm {and,~~}r_{t+1}(i) \times F_{t+1}(i) = (A_{t}(i) + \Delta A_{t}(i)) \cdot (E_{t}(i) + \Delta E_{t}(i)) ~~~~\text {}~(2^*) \end{aligned}$$where r represents the fraction of the fitness contribution that can be attributed to the gene product. In the current model, we explore a space where \(r=1\).

Genetic interactions: For simplicity, we consider genomic loci as independent, and do not include genetic interactions such as linkage and epistasis. However, many de novo genes are found in the vicinity of wellestablished genes [8]. Moreover, evolutionary trajectories of loci that are genetically linked to strongly selected, wellestablished genes can deviate significantly from our model prediction and hence warrant a separate treatment [72]. In future studies, one could extend the present model to investigate the evolutionary trajectories of two or more interacting loci, some of which represent established genes and have high initial fitness contribution (see Additional file 1: Section 3 for a discussion of a multilocus models). Alternatively, our model can also be used to represent mutation scan experiments, such as in [46] where the genomic background is kept constant. In this case, the timesteps in the model represent rounds of experiments involving mutagenesis and artificial selection.

Multicellular organisms: The same gene can have different expression levels in different celltypes of multicellular organisms, thus the adaptive value of a gene is now a function of its expression profile across different celltypes. Therefore, the decomposition of fitness contribution of a gene into its expression level and adaptive value is not straightforward for multicellular organisms.

Other important processes: In this work, we study the role of mutations caused by small indels and polymorphisms, for which quantitative DFE measurements are available. However, evidence also indicates a role for other molecular processes such as transposition, in de novo gene birth [60]. Thus, a more complete model would include fitness effects of transposition; although transposition is a less frequent event than indels/ polymorphisms, and we do not have extensive quantitative measurements of fitness effects.

DFE sampling: We sample a broad range of DFEs around an experimentally measured distribution of mutational fitness effects, which makes it likely that we cover the space of real DFEs. However, the distribution of real DFEs across genomic loci and different organisms is not wellknown, and our sampling of DFEs is not expected to be identical to the spread of real DFEs.
In addition, the generality of our results is likely to be limited due to dearth of relevant data. Most importantly, we use experimental measurements of DFE and mutational effects on expression that are taken from different organisms: in different organisms, distinct mechanisms produce mutations; therefore, the frequencies of different mutation types and its effects may vary, although the longtailed nature of DFEs [52, 53] and mutational effects on expression [46, 55] have both been observed in independent studies, measurements performed in the same organism could provide important details, for instance, the correlations between the effect of a mutation on expression and on fitness. Secondly, the DFE of loci remain constant in our model, while mutational fitness effects are known to vary over evolutionary time due to various causes, such as change of environment or diminishing returns epistasis [73,74,75]. An extended model that includes a consideration of DFE variability would provide valuable insight into the robustness of our results (Additional file 1: Section 3.3).
We anticipate that our results can be tested and the shortcomings of our model can be addressed through experiments, especially mutational scans such as those in [46]: for example, one could design experiments that monitor the fitness effects of mutations on random sequences and concomitantly detect expression from these random sequences. Alternatively, the evolution of adaptive value of expression products can be directly examined in experiments where random sequences are placed under constitutive, high expression promoters (such as in [62, 65]); in this case, the fitness effects of mutations directly correspond to the adaptive value of the product. These experiments, together with theoretical approaches like ours, provide us with means to test and compare the adaptive potential of nonfunctional genomic sequences and the general mechanisms of de novo gene birth across various organisms.
Conclusions
Our study represents the first population model to analyze the mechanism of de novo gene birth within the context of nongenic adaptation. Specifically, we characterize nongenic sequences in terms of their DFE: an empirically measurable quantity. We demonstrate how the DFE, a property of the genomic locus, and mutation rate, an important population dynamical quantity, interplay to influence the fixation of emerging genes. In its current minimal form, our model is already able to capture intriguing phenomena, such as the loss of adaptive protogenes [60, 61]. We discuss how our model can be expanded to include essential features relevant to de novo gene birth, such as the transcriptional piggybacking of emerging genes on nearby established genes [66], as well as the importance of untangling the fitness contributions of the DNA sequence, RNA product, and the protein product of newly emerging genes. Our framework can be used to explore a rich space of processes relevant to nongenic adaptation, and in conjunction with experiments such as mutational scans of intergenic sequences, it can significantly contribute to the discovery of mechanisms of de novo gene birth.
Methods
Surveying the space of DFE and locus deletion parameters in populations of various sizes
We look at populations of sizes \(N = 1000\) and scan across DFEs with parameters \(p = [0.001, 0.003, 0.005]\), \(f = [0.25, 0.5, 0.75]\), \(n = [{0.001}, 0.005, 0.01]\), and \(s = [0.1, 0.3, 0.6, 0.9]\). We look at locus deletion probabilities \(d = [0, 0.005, 0.01, 0.05]\). We perform simulations for each parameter set in both high and lowmutation rate regimes.
In addition, we also look at populations of size \(N=100\), for which we scan across parameters \(p = [0.001, 0.003, 0.005]\), \(f = [0.25, 0.5, 0.75]\), \(n = [0.001, 0.005, 0.01]\), \(s = [0.3, 0.6, 0.9]\), and \(d = [0, 0.005, 0.01, 0.05]\). For \(N=100\) populations, we test only the high mutation rate regime.
For each parameter set, we simulate 100 replicate systems. In all, we look at 118,800 systems: 86,400 \(N={1,000}\) systems and 32,400 \(N=100\) systems. All codes used to generate and analyze data are written in Python3.6.
Relative fitness of individuals
For a population of size N, fitness of individuals at timestep t are stored in the realvalued vector \(F_t\) of length N, where the fitness of any individual i is \(F_t(i)\). We also keep track of the individuals that have lost the locus due to deletion in the binary vector \(L_t\) of length N, such that \(L_t(i) = 1\) implies that individual i contains the locus at timestep t, and \(L_t(i) = 0\) implies individual i has lost the locus. Here, \(L_t(i)=0\) automatically implies \(F_t(i)=0\).
In the model, only individuals with fitness \(> 1\) are viable and capable of producing progeny. And individuals in the current population that produce progeny are selected on the basis of their normalized relative fitness.
Let \(\textrm{minfit}_t\) be the minimum fitness among viable individuals in \(F_t\).
We define \(\textrm{allfit}_t = \sum _{j} \left( 1 + F_t(j)  \textrm{minfit}_t \right)\), for j such that \(F_t(j) >1\). The normalized relative fitness of individuals is then given by the binary vector \(\textrm{relfit}_t\) of length N, where
Therefore, even if \(F_t(i)=0\), \(\textrm{relfit}_t(i)\) can be nonzero if \(\textrm{minfit}_t <0\).
Mutation dynamics
Progeny of the current population incur mutations. The mutation effects are drawn from 2sided gamma distributions governed by the parameters p (average effect of beneficial mutations), f (fraction of beneficial mutations), n (average effect of deleterious mutations), and s (shape parameter). The values of fitness effects of mutations incurred by each individual at timestep t is stored in a real valued vector \(\textrm{mut}_t\) of length N, where
Here, \(\Gamma \left( \kappa ,\theta \right)\) represents a number drawn from the gamma distribution with shape parameter \(\kappa\) and scale parameter \(\theta\), and \(\textrm{Ber}(f)\) is the Bernoulli random variable which equals 1 with probability f.
Progeny can also lose the locus with probability d. Thus, the updated fitness levels of the population is given by \(F_{t+1}(i) = 0\), if \(Anc_{t+1}(i)\) did not contain the locus or if the individual loses the locus in the current time step. Otherwise, \(F_{t+1}(i) = F_t\left( \textrm{Anc}_{t+1}(i) \right) + \textrm{mut}_t(i)\).
Method to update population fitness
In the model, the dynamics of selection and mutations interplay and effect how variants of the locus are inherited across model timesteps. We keep track of the ancestry of each individual across timesteps. We do this in order to trace back the last common ancestor. The conducivity of DFE parameters in the model depends upon whether or not the fitness contribution of the variant of the locus in the last common ancestor of individuals at \(t={1000}\) crosses the threshold of 0.1. Let \(\textrm{Anc}_{t+1} \in \mathbb {N}^{NX1}\) be the list of individuals chosen from the current timestep t to leave progeny. In other words, \(\textrm{Anc}_{t+1}\) is the list of ancestors of the population at timestep \(t+1\).

For populations with high mutation rate (\({<{2000}}\) generations between mutational timesteps): \(\Pr (\textrm{Anc}_{t+1}(j) = i) \propto \textrm{relfit}_t(i), \forall i,j \le N\), that is, more than one individual from timestep t leaves offspring in timestep \(t+1\).

For populations with low mutation rate (\({>{2000}}\) generations between mutational timesteps): a single variant gets fixed in the population, that is, \(\textrm{Anc}_{t+1}\) contains a single individual. The probability that an individual i is fixed in the population during the intervening generations between two mutational timesteps is proportional to relative fitness \(\textrm{relfit}_t(i)\) (see Additional file 1: Section 2).
Method to update expression level and adaptive value
In the model, we assume \(F(i) = A(i) \times E(i)\) for any individual i. For a population of size N, expression levels of the locus at timestep t are stored the real valued vector \(E_t\) of length N, where the expression level of some individual i is \(E_t(i)\). For an individual that has lost the locus due to deletion, \(L_t(i)=0\), which automatically implies \(E_t(i)=0\).
Initially, the expression level of the locus across the population is distributed around 0.001 and reflects leaky expression. At each time step, a mutation is incurred in the locus of interest by each individual in the population. The effect of each mutation has two independent components: one that effects expression level (\(\Delta E\)) and one that effects adaptive value (\(\Delta A\)). Therefore, at each time step, the expression levels and adaptive values across the population change as individuals are selected and their progeny incur mutations.
The effect of mutations on expression level incurred by each individual at timestep t is stored in real valued vector \(\Delta E_t\) of length N. The magnitude of \(\Delta E_t(i)\) are drawn from a power law distribution such that \(Pr(\Delta E_t(i) = x) = x^{2.25}\) for \(x\ge 0\). We assume that a \(\Delta E_t(i)\) is negative with probability 0.5.
The updated expression levels of the population are therefore given by \(E_{t+1}(i) = 0\), if \(Anc_{t+1}(i)\) did not contain the locus or if the individual loses the locus in the current time step. If the individual does contain the locus, \(E_{t+1}(i) = E_t\left( \textrm{Anc}_{t+1}(i) \right) + \Delta E_t(i)\).
Note that the values of expression level in the model are bounded within [0.001, 1] corresponding to leaky expression and maximal possible expression respectively. In the simulation, whenever \(E_{t+1}(i) < 0.001\) or \(E_{t+1}(i)>1\), we reset it to 0.001 and 1, respectively. Since the initial expression levels are very low, \(E_{t+1}(i)\) never crossed 1 in any simulation. In a run of 1000 time steps, \(E_{t+1}(i)\) crosses 0.001 on average 40 times (Additional file 1: Fig. S7).
We then calculate the corresponding changes in the adaptive value of the locus at each time step: \(A_t(i) = F_t(i)/E_t(i)\). From this, we can calculate the change in adaptive value due to mutation as \(\Delta A_t(i) = A_{t+1}(i)  A_t\left( Anc_{t+1}(i)\right)\).
Tracing ancestry to find fixed mutations
This method allows us to identify the mutant that gets fixed in the population at timestep \(t = 1000\). In the model, we count and report the number of replicate populations in which the fitness contribution of this fixed mutant crosses the threshold of 0.1.
For populations with high mutation rate: in order to find the fitness value of the mutant that gets fixed in the population at timestep t, we start with the list of ancestors of individuals \(\textrm{Anc}_t\) at timestep t. Let \(X_t = \{i, \forall i \in \textrm{Anc}_t\}\) be the set of unique ancestor identities. We then recursively find \(X_{tn} = \{i, \forall i \in \{\textrm{Anc}_{tn}(j),\, \forall j \in X_{tn+1}\}\}\) as the set of unique ancestor identities for \(n = 1,2,3...t_0\), where \(X_{tt_0}\) is the first singleton set encountered.
This set contains the individual which occurred at timestep \(tt_01\), whose locus variant is inherited by every individual at timestep t. The fitness value of the mutant fixed in the population at timestep t is then \(F_{tt_01}(i), \;\textrm{where}\; i \in X_{tt0}\).
For populations with low mutation rate: the fitness value of the mutant that gets fixed in the population at timestep t is simply \(\textrm{Anc}_t\).
Availability of data and materials
All data generated or analyzed during this study are included in this published article, its supplementary information files and publicly available repositories. Codes and data generated in this study are available at the Dryad repository: mani, somya (2023), data for ‘Gene Birth in a Model of Nongenic Adaptation’, Dryad, Dataset, https://doi.org/10.5061/dryad.fbg79cnxx.
Abbreviations
 DFE:

Distribution of fitness effects
References
Consortium EP, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57.
Keeling DM, Garza P, Nartey CM, Carvunis AR. The meanings of ‘function’ in biology and the problematic case of de novo gene emergence. Elife. 2019;8:e47014.
Van Oss SB, Carvunis AR. De novo gene birth. PLoS Genet. 2019;15(5):e1008160.
Takeda T, Shirai K, Kim Yw, HiguchiTakeuchi M, Shimizu M, Kondo T, et al. A de novo gene originating from the mitochondria controls floral transition in Arabidopsis thaliana. Plant Mol Biol. 2023;111(12):189–203.
Cai J, Zhao R, Jiang H, Wang W. De novo origination of a new proteincoding gene in Saccharomyces cerevisiae. Genetics. 2008;179(1):487–96.
Zhuang X, Yang C, Murphy KR, Cheng CHC. Molecular mechanism and history of nonsense to sense evolution of antifreeze glycoprotein gene in northern gadids. Proc Natl Acad Sci. 2019;116(10):4400–5.
Li CY, Zhang Y, Wang Z, Zhang Y, Cao C, Zhang PW, et al. A humanspecific de novo proteincoding gene associated with human brain functions. PLoS Comput Biol. 2010;6(3):e1000734.
McLysaght A, Hurst LD. Open questions in the study of de novo genes: what, how and why. Nat Rev Genet. 2016;17(9):567–78.
Zile K, Dessimoz C, Wurm Y, Masel J. Only a single taxonomically restricted gene family in the Drosophila melanogaster subgroup can be identified with high confidence. Genome Biol Evol. 2020;12(8):1355–66.
Vakirlis N, Carvunis AR, McLysaght A. Syntenybased analyses indicate that sequence divergence is not the main source of orphan genes. Elife. 2020;9:e53500.
Tautz D, DomazetLošo T. The evolutionary origin of orphan genes. Nat Rev Genet. 2011;12(10):692–702.
Casola C. From de novo to “de nono”: the majority of novel proteincoding genes identified with phylostratigraphy are old genes or recent duplicates. Genome Biol Evol. 2018;10(11):2906–18.
Casola C, Owoyemi A, Pepper AE, Ioerger TR. Accurate identification of de novo genes in plant genomes using machine learning algorithms. bioRxiv. 2022. https://doi.org/10.1101/2022.11.01.514720.
An NA, Zhang J, Mo F, et al. De novo genes with an lncRNA origin encode unique human brain developmental functionality. Nat Ecol Evol. 2023;7:264–78. https://doi.org/10.1038/s41559022019256.
Gubala AM, Schmitz JF, Kearns MJ, Vinh TT, BornbergBauer E, Wolfner MF, et al. The Goddard and Saturn genes are essential for Drosophila male fertility and may have arisen de novo. Mol Biol Evol. 2017;34(5):1066–82.
Vakirlis N, Acar O, Hsu B, Coelho NC, Van Oss SB, Wacholder A, et al. De novo emergence of adaptive membrane proteins from thyminerich genomic sequences. Nat Commun. 2020;11(1):1–18.
Katju V, Bergthorsson U. Old trade, new tricks: insights into the spontaneous mutation process from the partnering of classical mutation accumulation experiments with highthroughput genomic approaches. Genome Biol Evol. 2019;11(1):136–65.
Böndel KB, Kraemer SA, Samuels T, McClean D, Lachapelle J, Ness RW, et al. Inferring the distribution of fitness effects of spontaneous mutations in Chlamydomonas reinhardtii. PLoS Biol. 2019;17(6):e3000192.
Racimo F, Schraiber JG. Approximation to the distribution of fitness effects across functional categories in human segregating polymorphisms. PLoS Genet. 2014;10(11):e1004697.
Huber CD, Kim BY, Marsden CD, Lohmueller KE. Determining the factors driving selective effects of new nonsynonymous mutations. Proc Natl Acad Sci. 2017;114(17):4465–70.
Mérot C, Oomen RA, Tigano A, Wellenreuther M. A roadmap for understanding the evolutionary significance of structural genomic variation. Trends Ecol Evol. 2020;35(7):561–72.
Trost B, Loureiro LO, Scherer SW. Discovery of genomic variation across a generation. Hum Mol Genet. 2021;30(R2):R174–86.
Majic P, Payne JL. Enhancers facilitate the birth of de novo genes and gene integration into regulatory networks. Mol Biol Evol. 2020;37(4):1165–78.
Zhang W, Landback P, Gschwend AR, Shen B, Long M. New genes drive the evolution of gene interaction networks in the human and mouse genomes. Genome Biol. 2015;16(1):1–14.
Shen X, Song S, Li C, et al. Synonymous mutations in representative yeast genes are mostly strongly nonneutral. Nature. 2022;606:725–31. https://doi.org/10.1038/s4158602204823w.
CassidyHanley DM. Tetrahymena in the laboratory: strain resources, methods for culture, maintenance, and storage. Methods Cell Biol. 2012;109:237–76.
Milo R, Jorgensen P, Moran U, Weber G, Springer M. BioNumbersthe database of key numbers in molecular and cell biology. Nucleic Acids Res. 2010;38(suppl_1):D750–3.
Petersen J, Russell P. Growth and the environment of Schizosaccharomyces pombe. Cold Spring Harb Protoc. 2016;2016(3):pdb–top079764.
Harris EH. The Chlamydomonas sourcebook: introduction to Chlamydomonas and its laboratory use: volume 1, vol. 1. Academic Press; 2009.
Ishida M, Hori M. Improved isolation method to establish axenic strains of Paramecium. Jpn J Protozool. 2017;50(1–2):1–14.
Fey P, Kowal AS, Gaudet P, Pilcher KE, Chisholm RL. Protocols for growth and development of Dictyostelium discoideum. Nat Protoc. 2007;2(6):1307–16.
Silva RR, Moraes CA, Bessan J, Vanetti MCD. Validation of a predictive model describing growth of Salmonella in enteral feeds. Braz J Microbiol. 2009;40(1):149–54.
FernándezMoreno MA, Farr CL, Kaguni LS, Garesse R. Drosophila melanogaster as a model system to study mitochondrial biology. In: Mitochondria. Springer; 2007. p. 33–49.
Koornneef M, Scheres B. Arabidopsis thaliana as an Experimental Organism. In: eLS, (Ed.). 2001. https://doi.org/10.1038/npg.els.0002031.
Long H, Winter DJ, Chang AYC, Sung W, Wu SH, Balboa M, et al. Low basesubstitution mutation rate in the germline genome of the ciliate Tetrahymena thermophila. Genome Biol Evol. 2016;8(12):3629–39.
Zhu YO, Siegal ML, Hall DW, Petrov DA. Precise estimates of mutation rate and spectrum in yeast. Proc Natl Acad Sci. 2014;111(22):E2310–8.
Lee H, Popodi E, Tang H, Foster PL. Rate and molecular spectrum of spontaneous mutations in the bacterium Escherichia coli as determined by wholegenome sequencing. Proc Natl Acad Sci. 2012;109(41):E2774–83.
Farlow A, Long H, Arnoux S, Sung W, Doak TG, Nordborg M, et al. The spontaneous mutation rate in the fission yeast Schizosaccharomyces pombe. Genetics. 2015;201(2):737–44.
Ness RW, Morgan AD, Colegrave N, Keightley PD. Estimate of the spontaneous mutation rate in Chlamydomonas reinhardtii. Genetics. 2012;192(4):1447–54.
Sung W, Tucker AE, Doak TG, Choi E, Thomas WK, Lynch M. Extraordinary genome stability in the ciliate Paramecium tetraurelia. Proc Natl Acad Sci. 2012;109(47):19339–44.
Saxer G, Havlak P, Fox SA, Quance MA, Gupta S, Fofanov Y, et al. Whole Genome Sequencing of Mutation Accumulation Lines Reveals a Low Mutation Rate in the Social Amoeba Dictyostelium discoideum. PLoS ONE. 2012;7(10):e46759. https://doi.org/10.1371/journal.pone.0046759.
Lind PA, Andersson DI. Wholegenome mutational biases in bacteria. Proc Natl Acad Sci. 2008;105(46):17878–83.
Keightley PD, Trivedi U, Thomson M, Oliver F, Kumar S, Blaxter ML. Analysis of the genome sequences of three Drosophila melanogaster spontaneous mutation accumulation lines. Genome Res. 2009;19(7):1195–201.
Ossowski S, Schneeberger K, LucasLledó JI, Warthmann N, Clark RM, Shaw RG, et al. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science. 2010;327(5961):92–4.
Eckmann JP, Tlusty T. Dimensional reduction in complex living systems: where, why, and how. BioEssays. 2021;43(9):2100062.
Vaishnav ED, de Boer CG, Molinet J, Yassour M, Fan L, Adiconis X, et al. The evolution, evolvability and engineering of gene regulatory DNA. Nature. 2022;603(7901):455–63.
Dickinson BC, Packer MS, Badran AH, Liu DR. A system for the continuous directed evolution of proteases rapidly reveals drugresistance mutations. Nat Commun. 2014;5(1):5352.
Brazhnik P, De La Fuente A, Mendes P. Gene networks: how to put the function in genomics. Trends Biotechnol. 2002;20(11):467–72.
Clark MB, Amaral PP, Schlesinger FJ, Dinger ME, Taft RJ, Rinn JL, et al. The reality of pervasive transcription. PLoS Biol. 2011;9(7):e1000625.
Hebenstreit D, Fang M, Gu M, Charoensawan V, van Oudenaarden A, Teichmann SA. RNA sequencing reveals two major classes of gene expression levels in metazoan cells. Mol Syst Biol. 2011;7(1):497.
Lieberman E, Hauert C, Nowak MA. Evolutionary dynamics on graphs. Nature. 2005;433(7023):312–6.
EyreWalker A, Keightley PD. The distribution of fitness effects of new mutations. Nat Rev Genet. 2007;8(8):610–8.
Cotto O, Day T. A null model for the distribution of fitness effects of mutations. Proc Natl Acad Sci. 2023;120(23):e2218200120.
Ness RW, Morgan AD, Vasanthakrishnan RB, Colegrave N, Keightley PD. Extensive de novo mutation rate variation between individuals and across the genome of Chlamydomonas reinhardtii. Genome Res. 2015;25(11):1739–49.
HodginsDavis A, Duveau F, Walker EA, Wittkopp PJ. Empirical measures of mutational effects define neutral models of regulatory evolution in Saccharomyces cerevisiae. Proc Natl Acad Sci. 2019;116(42):21085–93.
Bao K, Melde RH, Sharp NP. Are mutations usually deleterious? A perspective on the fitness effects of mutation accumulation. Evol Ecol. 2022;36(5):753–66.
Despons A, Lacoste D, Peliti L. Adaptive strategy in Kelly’s horse races model. arXiv preprint arXiv:2201.03387. 2022.
Watson RA, Szathmáry E. How can evolution learn? Trends Ecol Evol. 2016;31(2):147–57.
Libchaber A, Tlusty T. Walking droplets, swimming microbes: on memory in physics and life. C R Mécanique. 2020;348(6–7):545–54.
Grandchamp A, Kühl L, Lebherz M, Brüggemann K, Parsch J, BornbergBauer E. Population genomics reveals mechanisms and dynamics of de novo expressed open reading frame emergence in Drosophila melanogaster. Genome Res. 2023;33(6):87290.
Zhao L, Saelao P, Jones CD, Begun DJ. Origin and spread of de novo genes in Drosophila melanogaster populations. Science. 2014;343(6172):769–72.
Hayashi Y, Sakata H, Makino Y, Urabe I, Yomo T. Can an arbitrary sequence evolve towards acquiring a biological function? J Mol Evol. 2003;56(2):162–8.
Yona AH, Alm EJ, Gore J. Random sequences rapidly evolve into de novo promoters. Nat Commun. 2018;9(1):1–10.
Lagator M, Sarikas S, Steinrueck M, ToledoAparicio D, Bollback JP, Guet CC, Tkačik G. Predicting bacterial promoter function and evolution from random sequences. Elife. 2022;11:e64543. https://doi.org/10.7554/eLife.64543.
Bhave D, Tautz D. Effects of the expression of random sequence clones on growth and transcriptome regulation in Escherichia coli. Genes. 2022;13(1):53.
Rich A, Acar O, Carvunis AR. Massively integrated coexpression analysis reveals transcriptional regulation, evolution and cellular implications of the noncanonical translatome. bioRxiv. 2023. https://doi.org/10.1101/2023.03.16.533058.
Fraser HB. Gene expression drives local adaptation in humans. Genome Res. 2013;23(7):1089–96.
Nourmohammad A, Rambeau J, Held T, Kovacova V, Berg J, Lässig M. Adaptive evolution of gene expression in Drosophila. Cell Rep. 2017;20(6):1385–95.
Blanc J, Kremling KA, Buckler E, Josephs EB. Local adaptation contributes to gene expression divergence in maize. G3. 2021;11(2):jkab004.
Wu DD, Zhang YP. Evolution and function of de novo originated genes. Mol Phylogenet Evol. 2013;67(2):541–5.
Guerzoni D, McLysaght A. De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biol Evol. 2016;8(4):1222–32.
Cutter AD, Jovelin R. When natural selection gives gene function the cold shoulder. BioEssays. 2015;37(11):1169–73.
Sane M, Diwan GD, Bhat BA, Wahl LM, Agashe D. Shifts in mutation spectra enhance access to beneficial mutations. Proceedings of the National Academy of Sciences. 2023;120(22):e2207355120.
Wünsche A, Dinh DM, Satterwhite RS, Arenas CD, Stoebel DM, Cooper TF. Diminishingreturns epistasis decreases adaptability along an evolutionary trajectory. Nat Ecol Evol. 2017;1(4):1–6.
Aggeli D, Li Y, Sherlock G. Changes in the distribution of fitness effects and adaptive mutational spectra following a single first step towards adaptation. Nat Commun. 2021;12:5193. https://doi.org/10.1038/s41467021254407.
Acknowledgements
We thank John McBride, Luca Peliti, and AnneRuxandra Carvunis for very helpful discussions. We also thank anonymous reviewers of the manuscript for their deep evaluation of the model and its results.
Funding
This work was supported by the Institute for Basic Science (Grant IBSR020D1).
Author information
Authors and Affiliations
Contributions
SM: conceived the project, designed research, developed models, performed simulations, analysed data and wrote the manuscript. TT: conceived the project, supervised research, and wrote the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Figures S1S21, Tables 13. Table 1. Estimates of one model timestep in various species. Figure S1. The distribution of relative fitnesses and absolute fitness values found to fix in the detailed simulation is statistically similar to the distribution of relative fitnesses and absolute fitness values of individuals simply picked with a probability proportional to their relative fitness. Figure S2. Comparison of adaptation of loci in the main model versus the multilocus model. Figure S3. Comparison of adaptation of loci in the main model versus a model where de novo gene evolution occurs in the presence of an independently evolving established gene. Figure S4. Trajectories of population averaged fitness subject to diminishing returns epistasis. Figure S5. Neutrality of mutants with fitness effect \(=0.5 \times 10^{3}\) in N = 1000 populations. Figure S6. Fraction of positive effect mutations (fitness effect > \(0.5 \times 10^{3}\)). Figure S7. Adjustments to \(\Delta E\) in a run of 1000 time steps for a population of 1000 individuals. Figure S8. Parameters conducive to adaptation in N = 100 vs N = 1000 populations in the high mutation rate regime. Figure S9. Nonconducive DFEs. Table 2. Pearson’s correlation coefficients between DFE parameters (rows) and PC1, PC2 and PC3 N = 1000 populations. Figure S10. Three Principal Components explain DFE conducivity. Figure S11. Tradeoff between the shape parameter and frequency and size of beneficial mutations for N = 1000 populations. Figure S12. Tradeoff between the shape parameter and frequency and size of beneficial mutations for N = 100 populations in the high mutation rate regime. Figure S13. Time of minimum average fitness effects probability of retention of the locus in N = 100 populations in the high mutation rate regime. Figure S14. Selection is essential for sustained fitness increase in the high mutation rate regime. Table 3. Pearson’s correlation coefficients between DFE parameters (rows) and time of minimum fitness for N = 100 and N = 1000 populations in the high mutation rate regime. Figure S15. Rare, largeeffect positive jumps lead to sudden fitness increase in trajectories of adaptation. Figure S16. Scatter plot showing the distribution of minimum average fitness and the time of minimum average fitness for populations which eventually crossed the fitness threshold in the high mutation rate regime. Figure S17. Trajectory of population average fitness in the low mutation rate regime. Figure S18. Effect of locus deletion in N = 1000 populations in the high mutation rate regime. Figure S19. Histograms for correlations between expression level, adaptive value and fitness trajectories for populations where the fitness threshold of 0.1 was crossed. Figure S20. Histogram for correlations between random expression level (E rand ) and fitness trajectories for populations where the fitness threshold of 0.1 was crossed. Figure S21. Scale of \(\Delta A\) exponential fits as a function of the model parameter p.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Mani, S., Tlusty, T. Gene birth in a model of nongenic adaptation. BMC Biol 21, 257 (2023). https://doi.org/10.1186/s12915023017455
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915023017455