Comparative analysis of protein coding sequences from human, mouse and the domesticated pig

Background The availability of abundant sequence data from key model organisms has made large scale studies of molecular evolution an exciting possibility. Here we use full length cDNA alignments comprising more than 700,000 nucleotides from human, mouse, pig and the Japanese pufferfish Fugu rubrices in order to investigate 1) the relationships between three major lineages of mammals: rodents, artiodactyls and primates, and 2) the rate of evolution and the occurrence of positive Darwinian selection using codon based models of sequence evolution. Results We provide evidence that the evolutionary splits among primates, rodents and artiodactyls happened shortly after each other, with most gene trees favouring a topology with rodents as outgroup to primates and artiodactyls. Using an unrooted topology of the three mammalian species we show that since their diversification, the pig and mouse lineages have on average experienced 1.44 and 2.86 times as many synonymous substitutions as humans, respectively, whereas the rates of non-synonymous substitutions are more similar. The analysis shows the highest average dN/dS ratio in the human lineage, followed by the pig and then the mouse lineages. Using codon based models we detect signals of positive Darwinian selection in approximately 5.3%, 4.9% and 6.0% of the genes on the human, pig and mouse lineages respectively. Approximately 16.8% of all the genes studied here are not currently annotated as functional genes in humans. Our analyses indicate that a large fraction of these genes may have lost their function quite recently or may still be functional genes in some or all of the three mammalian species. Conclusions We present a comparative analysis of protein coding genes from three major mammalian lineages. Our study demonstrates the usefulness of codon-based likelihood models in detecting selection and it illustrates the value of sequencing organisms at different phylogenetic distances for comparative studies.


Background
Large scale sequencing projects of many different species allow us to investigate phylogenetic issues in much more detail and to identify whether certain genes have had an extraordinary evolution in one or more species and thus gain insight into the actions of natural selection. Despite the sequencing of an increasing number of mammalian genomes and the implementation of more sophisticated evolutionary models using maximum likelihood and Bayesian methodology, the branching order within the mammalian phylum is still not completely resolved. The main reason for this uncertainty is that the diversification of these orders occurred over a short period of time, making the inference of branching order a difficult problem. One of the highly debated issues concerns the relative order of branching among primates, artiodactyls and rodents [1][2][3][4][5][6][7][8][9]. Here, the Japanese pufferfish Fugu rubrices is used as an outgroup to estimate the branching order of the three species relative to each other.
Codon based models [10,11] allow for powerful analysis of protein coding nucleotide sequences. Evolutionary hypotheses may be tested using likelihood ratio tests between nested models. For an introduction to the practical use of these models see [12], for a more thorough review of the methodology see [13]. The parameter of primary interest is the ratio of nonsynonymous to synonymous substitutions (ω), also known as the dN/dS ratio. The dN/dS ratio measures the relative importance of evolutionary forces that have shaped a particular protein. A dN/dS ratio significantly larger than one strongly suggests that positive Darwinian selection has acted on the protein. Different extensions to the basic codon model exist, and these can be divided into three main categories: (1) Lineage-specific models that average ω over sites but differentiate between lineages [14]; (2) site-specific models that average ω over lineages but differentiate over sites [15]; (3) branch-site specific models that combine the two previous extensions by allowing ω to vary over sites in all background lineages, but allow for a different value of ω in one or more pre-specified lineages [16]. The models we use here and their relationships are shown in Table 1. Numerous studies have shown the ability of the site-specific and the branch-site specific models to detect positive selection in cases where the branch-specific models did not, indicating that averaging over sites is generally a more serious problem than averaging over lineages and that in many cases using a branch-site specific model increases the power to detect positive selection [17][18][19][20][21][22].
In a recent study of cDNA trios of human, mouse and chimpanzee a codon based branch-site specific model was used to search for human genes that have undergone positive selection since our divergence from other primates [23]. Here, a similar search is done on a different phylogenetic level using a collection of porcine genes. While the study by Clark and colleagues concentrates on the divergence between humans and chimpanzees (branch a in Figure 1) our study searches for genes that have undergone positive selection since the divergence of primates, artiodactyls and rodents. Several recent studies have shown that some of the branch-site specific models under certain conditions might have a high false positive rate when used to detect positively selected sites [24,25]. This problem has recently been addressed by Yang and colleagues with the implementation of a new Bayes empirical Bayes (BEB) method for predicting positively selected sites. This new method is much better at avoiding false positives while still retaining a high sensitivity (Z. Yang, pers. comm.). Here we use the new and improved BEB version of the branch-site specific model originally presented in [23] to detect genes that may have been influenced by positive selection. The parameters used are (κ) transition / transversion ratio, (τ) branch length, (ω) dN/dS ratio, (p) fraction of codons that fall into the specified ω category.

Results
The distribution of sequence lengths of the 1120 threespecies alignments is shown in Figure 2. Since the full length cDNAs were assembled from random ESTs, there is a bias towards assembling relatively short genes. Therefore the subset of genes used in this analysis is not a random sample from the pig genome. This decreases the power of our evolutionary tests, since short alignments have less power when testing for positive selection, but we do not anticipate any other systematic bias in our results.

Mammalian phylogeny
The relative branching order of the three mammalian species was investigated with the individual genes as well as with a concatenated super gene. Using the empirical amino acid substitution model of Whelan and Goldman [26] we maximized the likelihood under the three conflicting topologies shown in Figure 3a-c. In 123 of the 988 alignments all amino acids are identical in the three mammalian species giving us no information to discriminate between the three topologies. Of the remaining 865 alignments 245 favour topology A, while 440 and 180 favour topology B and topology C respectively. A concatenated super gene of all 988 alignments clearly favoured topology B over topology A, which again has a higher likelihood than topology C, consistent with the results from the individual gene comparisons (Table 2.).
We used the baseml program of PAML to compare the three topologies in a nucleotide based framework. Different nucleotide based substitution models were used to maximize the likelihood on the three topologies for each of the three codon positions separately. The results of using different models of nucleotide evolution were highly similar so here we only discuss the results obtained with the HKY85 model [27]. The results based on the third codon position shows that Fugu is too distantly related to the three mammals to be informative in placement of the root of the mammals (results not shown). The first and second codon positions do not show such saturation and should therefore be useful in comparing the three topologies. Consistent with the results based on the amino acid substitution model we see that topology B is favoured in most genes, followed by topology A and topology C, respectively. The actual numbers from the second codon position are 215, 386 and 179 in favour of topology A, topology B and topology C respectively and 208 alignments are uninformative. The corresponding numbers for the first codon position are 215, 545, 175 and 53 (Table 2.).
The internal branch is rather short in all cases. Therefore in the remaining analyses we treat the mouse, human, pig split as a trifurcation. Depending on which topology is actually the right one, the only bias introduced by treating the topology as a star tree, as shown in Figure 3d, is a minor overestimation of the branch length of the species that actually roots the other two.

The rates of evolution
The three-species alignments were used to estimate the synonymous and nonsynonymous substitution rates of the three branches under the free ratio model, see Table 3. Figure 4a-f shows the distribution of the synonymous and nonsynonymous branch lengths for each gene in all three species. The synonymous rates are significantly different between the three species. The average synonymous substitution rate, estimated using the concatenated super gene, is approximately 2.86 times larger in mouse compared to pig, and approximately 1.44 times larger in pig than in human. The nonsynonymous rates are more similar among the three species. The corresponding values for the nonsynonymous rates are 2.08 and 1.17 respectively. Table 3 shows the mean, median and variance of both the synonymous and nonsynonymous rate distributions as well as the values obtained from the concatenated super gene. The average values from the individual genes are highly similar to the results obtained from the concatenated super gene.

Positive Darwinian selection
The dN/dS ratios on the three different lineages were estimated under the free ratio model (Figure 4g-i). Most genes in all three species have an average dN/dS ratio very close to zero with the average dN/dS ratio higher in human than in pig, which again is higher than in the mouse lineage.
Phylogenetic tree of key mammalian species Figure 1 Phylogenetic tree of key mammalian species. A schematic drawing showing the topologies considered in our study compared to a recent study on human, chimpanzee and mouse trios [23]. Branch a shows the branch considered in the study by Clark et al (2003) while branch a+b represents the evolutionary time scale studied here.
The one ratio model averages over sites and lineages, which makes this an extremely conservative method of detecting positive selection. Only four of the 1120 threespecies alignments have an average dN/dS ratio larger than one, see Table 4, and of those only one is significantly larger than one (XM_165930). The free ratio model allows each lineage to have its own dN/dS ratio. This model has slightly more power than the one ratio model due to its ability to find lineage specific signals. The likelihood ratio test (LRT) of these two models should not be considered as a stringent test for positive selection, but more as a test for different selective forces among lineages. The LRT shows that 154 genes have significantly different dN/dS ratios among lineages at the 5% significance level, 73 at 1% and 41 at the 0.1% level of significance. Table 5 shows the 24 genes that have a dN/dS ratio larger than one in one or more lineages as well as the result from each gene of a LRT that tests whether the estimated value of ω is significantly larger than one. As with the one ratio model only one gene shows a result significantly larger than one. The gene is the same one as reported with the one ratio model (XM_165930) and the lineage with a dN/ dS ratio significantly larger than one is the lineage leading to pig.
Several studies have shown that averaging over sites is more conservative when searching for positive selection than is averaging over lineages. The branch-site specific model A and model B [16] were originally designed to search for genes where only a small fraction of codons in a specific foreground lineage has evolved under positive selection. Several studies have shown that the original models are prone to predicting false positives under certain conditions, and one should therefore be very careful drawing conclusions from studies based on those models.
Here we use a new and improved version of a branch-site model developed for the analyses of human, chimpanzee and mouse gene trios [23]. The new model we use here is implemented in PAML v. 3.14 and uses the new and improved Bayes empirical Bayes approach to predict Distribution of sequence alignment lengths  Table 6.
The molecular function of the genes predicted to be under positive selection was determined using the Panther server [28] and the NCBI server using the newest build of the human genome. Both annotation servers are updated on a regular basis when new information becomes available.
During the course of this study the annotation of several genes changed. Of our 1120 alignments 188 are currently not annotated as functional genes indicating that they might possibly be pseudogenes in human; see the Discussion for more details on this subject. The proportion of genes that we report to have undergone Conflicting mammalian phylogenies Figure 3 Conflicting mammalian phylogenies. A schematic drawing of the three conflicting bifurcating topologies (a-c) as well as a multifurcating alternative (d). The divergence times shown in (a) are million years from present [31].

Star topology
positive selection in the human lineage at the 5% level of significance can therefore be viewed as either 58/1120 5.2% or 43/931 ~4.6%, indicating that possible pseudogenes are only slightly overrepresented in the genes predicted to have undergone adaptive evolution. The genes predicted to have been under positive selection in the pig and mouse lineage show a similar trend.
Several different models have been developed that allow for heterogeneity of ω over sites in an alignment. We used the M4 model [15] which allows each codon to fall into one of 5 categories corresponding to ω equal to 0, 1/3, 2/ 3, 1 and 3. The first category represents the fraction of codons that have evolved under strong purifying selection allowing no nonsynonymous changes to occur. The next two categories represent different intensities of purifying selection. The category with ω = 1 represents neutrally evolving sites, while the last category with ω = 3 represents codons that have evolved under positive selection. The results of this analysis on the concatenated super gene can be seen in Table 7. Only 1.6 % of all codons appear to have evolved under positive selection, and approximately 69 % have been under strong functional constraints.

Codon usage bias
The concatenated super gene was also used to investigate the patterns of codon usage in the three species; the results of this investigation are summarized in Table 8. A test for equal codon distributions is rejected in all three pair wise comparisons (P < 0.0001, 60 d.f.). Using nucleotide frequencies to estimate the codon equilibrium frequencies fits the data poorly, so does the equal frequency model (Table 9). For a description of the codon equilibrium fre-quency models, see the Methods. The F3 × 4 model was extended with one extra parameter that accounts for CpG avoidance at the second and third codon position. Since all changes in the second position of a codon are nonsynonymous, the frequency of NCG codons is     it is still better to use the actual codon frequencies, but with individual genes the number of codons can sometimes be so small that the use of actual codon counts can be problematic. We also implemented a similar model that incorporated the avoidance of CG in first and second position by introducing an additional parameter but this does not improve the fit of the model significantly (results not shown). This is probably caused by the fact that all four codons with CG in the first and second position code for the same amino acid, Arginine. Arginine has six different codons and the two codons without a CG pair (AGA and AGG) are generally favoured over the other four (Table 8), but this tendency is apparently accounted for when modelling nucleotide frequencies at the three codon positions, so here we only present the model that accounts for CpG avoidance at the second and third codon position. Table 9 shows that the choice of codon equilibrium frequency model has detectable effects on the parameter estimates. Most striking is the apparent overestimation of the transition/transversion ratio and the dN/ dS ratio when the model is less parameter-rich.

Discussion
The phylogeny of the early mammalian radiation has been extensively debated over the last two decades. The classical view based on fossil evidence states that all major orders of placental mammals first appear right after the Cretaceous-Tertiary (KT) boundary approximately 65 million years ago [29]. This sudden appearance of all major placental orders is known as the mammalian radiation. With the use of molecular data this late radiation has been challenged and it is now widely accepted that the radiation of the placental orders probably occurred many million years before the KT boundary [29][30][31]. Molecular data have also been used to investigate the relative branching orders of many of the larger clades of placental  mammals [1][2][3][4][5][6][7]9,30]. One of the issues that have been debated extensively is the placement of Rodentia in the placental tree. Some studies favour a basal placement of the rodents [1,[3][4][5]32,33] while other studies favour a sister relationship between primates and rodents [6][7][8].
Recently strong evidence based on insertions, deletions and ancient transposable elements in favour of a sister relationship of primates and rodents has been reported [2,34].
The incongruence of single gene phylogenies was investigated in a recent study of eight yeast species [35]. The phylogeny commonly believed to be correct is completely resolved when concatenating 20 or more randomly chosen genes to form a super gene. A concatenated multi gene approach was also shown to resolve single gene incongruences in a recent study on green algae [36]. Here we use 988 full cDNA alignments comprising 672,918 nucleotides to investigate the branching order of the three mammalian species. We present results based on both sin-  The concatenated super gene is used to estimate the distribution of dN/dS ratios over sites. Each codon is allowed to fall into one of the five predefined dN/dS ratio classes. The branch lengths are expressed as number of substitutions per codon.
gle gene phylogenies and a concatenated super gene. All genes including the concatenated super gene were analysed with both nucleotide and amino acid based substitution models. All methods favour a primate-artiodactyls clade with rodents as an outgroup but with a relatively short internal mammalian branch, indicating that the mammalian radiation happened within a short period of time. The different methods used in this study have very different assumptions but they all show the same general results. The HKY85 model takes into account differences in nucleotide frequencies and transition/transversion biases and allows for differences in substitution rates among the lineages. However, it is still possible that complexities unaccounted for such as non-stationarity and irreversibility of the substitution process have created biases that lead to long-branch attraction of Fugu and Mouse and an erroneous conclusion. Furthermore, the incongruence between our analysis and many recent studies is also affected by the following. (1) The choice of outgroup; bony fishes are believed to have diverged approximately 450 million years ago [31], making saturation effects in synonymous sites a real problem. We are therefore forced to only consider nonsynonymous sites or amino acid replacements in the phylogenetic analyses. The recently completed genome sequence of the chicken (Gallus gallus) shows that the average value of dS between human and chicken genes is approximately 1.66 [37], which indicates that many genes may still be too distantly The frequencies are expressed as a percentage of the 240,048 codons in each of the three species. Human(H), Mouse(M), Pig(P). Stop codons are not allowed in the analyses (*). related for synonymous sites to avoid problems with saturation. A marsupial species would provide a much better outgroup when available [3,32]. (2) Taxon sampling; by only using three species the variance of the parameter estimates can be quite high and the power to discriminate between two conflicting topologies quite low. The sequencing of more species will lessen this problem. (3) Overly simplistic evolutionary models; here we use only nucleotide and amino acid based models. If a more closely related outgroup was available the use of more complex codon based models could be beneficial in resolving the apparent conflict. Several extensions have been made to the codon models during the past few years. One obvious extension to the codon models is a model that incorporates CG avoidance within and over codon boundaries. This will clearly improve the fit of the data to the model and therefore give more accurate parameter estimates. Including context dependencies over codon boundaries and information about protein structure have also been shown to increase the fit of the models to protein coding data and therefore should result in better parameter estimates [38,39]. (4) Gene trees and species trees can be different; the split between the three groups probably occurred within a very short period of time, allowing for the possibility that different genes actually have different phylogenies due to ancient polymorphisms at the time of the speciation. Using even larger number of genes and a sufficiently sophisticated model should lessen this problem [35,36].
The rate of synonymous substitution was estimated to be almost three times higher in rodents than in other mammals, in agreement with previous investigations that also showed an elevated rate in rodents [40][41][42]. This has historically often been explained by a generation time effect. Species that have short generation times experience more generations in the time span we consider and consequently they will experience more neutral substitutions over time. The fact that the pig, which has a generation time intermediate between mouse and humans, has an intermediate rate of synonymous substitutions, seems to agree with this theory. For a more thorough discussion of the generation time hypothesis in mammals see [43]. The nearly neutral theory of molecular evolution predicts that the generation time effect should be smaller for non-synonymous substitutions [42,44,45]. The simple argument is that animals with short generation times such as rodents often have a very large effective population size. In a population with a large effective population size slightly deleterious mutations will be removed from the gene pool more effectively than in a population with a small effective population size, where genetic drift will reduce the efficiency of natural selection. Figure 4g-h shows the distribution of the dN/dS ratio in the three lineages. The average dN/dS ratio is highest in humans suggesting a small effective population size, while it is smallest in mouse suggesting a larger effective population size.
Previous studies of the occurrence of positive selection based on pair wise comparisons have revealed a very low occurrence of positive selection. In a study of 3595 alignments only 17 genes showed evidence of positive selection [46]. The branch specific models used here only find one gene where the dN/dS ratio is significantly larger than one. The gene reported is XM_165930. XM_165930 was originally annotated as being similar to cold shock domain protein A, but it has recently been removed from Genbank as a result of standard genome annotation processes.
Codon based branch-site models similar to the ones used here were used in a paper based on a three way comparison among chimpanzees, humans and mice [23]. They report that approximately 1.6 % of all the genes studied have been undergoing positive selection in the lineage leading to modern humans. Using a similar criterion our study indicates that approximately 3.0 % of the genes studied have been undergoing positive selection on the lineage leading to humans; the corresponding numbers for pig and mouse are 2.0 % and 2.2 % respectively. When comparing these two studies it is important to consider the following three things: (1) the relatively short average length of the genes studied here decreases the power of the models to detect positive selection; (2) the use of the new BEB method for detecting positively selected sites should reduce the number of false positives, making our estimates more conservative and more accurate; (3) our study deals with a completely different phylogenetic level, covering a much longer time span than the study by Clark and colleagues.
The multiple testing and the small number of taxa used in a study like this imply that the results presented should not be taken as conclusive evidence for positive selection, but more as an approach to searching among the thousands of genes to look for genes that may have evolved in a biologically interesting manner. Comparative approaches such as the one we use here can only be a first step towards showing that positive Darwinian selection may be a key part in the evolution of many different gene families. Further experimental and computational analyses must then be used to investigate the suggested candidates more thoroughly.
During the course of our investigation a large fraction of the genes were re-annotated as putative pseudogenes: 188/1120 ~16.8%. However, all these genes have uninterrupted reading frames in all three species; only a tiny fraction of all codons seems to have evolved in a neutral-like fashion (ω~1), and the distributions of the synonymous as well as the nonsynonymous rates of these putative pseudogenes are almost identical to the distributions of the remaining genes (results not shown). The only difference is a slight increase in the dN/dS ratio in the human lineage, which is actually due to a few genes that experience an unusually high dN/dS ratio. Omitting these genes from the analysis removes the observed differences completely. Thus, if all these genes are indeed pseudogenes in human, the loss of function must have occurred quite recently and they may not be pseudogenes in pig and mouse.

Conclusions
The collection of a large set of pig cDNA sequences has enabled us to study long term evolutionary trends in mammalian genes. Our results indicate that the codon models are able to detect evolutionary signals indicating adaptive evolution in several genes. Our phylogenetic investigation of the primate, rodent, artiodactyl split disagree with most recent findings in favouring a primate, artiodactyl clade with rodents as an outgroup. Our study indicates that several genes that are not classified as genes in the most recent human annotation might after all be real genes; or at least they have become pseudogenes very recently, and the orthologous genes in mouse and pig might still be functional. This shows the potential of comparative methods in identifying functional regions of the genome.

cDNA alignment
Complete cDNAs from the domesticated pig Sus scrofa were assembled at the Danish Institute of Agricultural Sciences from cDNA libraries from 100 different tissues constructed in the following way. Total RNA was purified from selected tissues using RNeasy (Qiagen) and Poly(A+) mRNA was selected using Oligotex (Qiagen). Directional cloneable cDNA was synthezised from Poly(A+) mRNA using the cDNA Synthesis Kit (Stratagene) and was ligated into Eco RI/Xho I digested pTrueBlue (GenomicsOne) followed by electrotransformation into E. coli XL1-Blue MRF' (Stratagene). 5'-EST sequencing was performed using standard protocols (Applied Biosystem). The sequences were trimmed to the longest open reading frame and the termination codons were removed.
Homologues sequences from human, mouse and the Japanese pufferfish Fugu rubrices were obtained with the blastall program with default parameters; the E-score was set to 10 -8 . We constructed two different datasets, one with and one without Fugu rubrices. Individual alignments were made using ClustalW version 1.83 with default parameters [47]. We kept the pig reading frame intact in the alignments by removing any columns where the align-ment gave rise to gaps in the pig sequence. Alignments that resulted in premature stop codons, or were shorter than 30 codons, were removed. We used the one ratio model to estimate the total branch length of the tree as well as the synonymous branch lengths. These distributions were used to detect peculiar genes where one or more sequences might not be a true orthologue, and all outliers were thereafter removed from the dataset. This analysis gave 1120 alignments of mouse, human and pig, and of these 988 also included Fugu. The 1120 original cDNAs from Sus scrofa have been deposited in Genbank with the following accession numbers: AY609387-AY610506.

Phylogeny and rates of evolution
Nine hundred and eighty-eight four-species alignments were concatenated into a super gene. The three topologies were compared using the super gene as well as each individual gene. Both nonsynonymous nucleotide substitutions and amino acid substitutions were investigated with PAML v. 3.14 [48]. The nonsynonymous substitutions were represented by the first and second codon positions of all codons, and the three different topologies were investigated with baseml using the HKY85[27] model (model = 4) of nucleotide substitutions. The likelihood was then maximized under the three different topologies using all the individual genes as well the concatenated super gene. The codeml program with the codons translated to amino acids (seqtype = 3) were also used to investigate the three topologies. We used different models of amino acid evolution to maximize the likelihood under the three topologies and since the results were highly similar we only present the results from the empirical method of Whelan and Goldman (model = 2, aaratefile = wag.dat) [26].
Using the 1120 three species alignments, the synonymous and nonsynonymous rates of evolution were estimated with the codeml program (seqtype = 1) using the free ratio model (model = 2) with the transition/transversion ratio estimated from the data (fix_kappa = 0).

Investigation of selection
The different tests for positive Darwinian selection are all based on extensions of the basic codon based likelihood model [11]. Likelihood ratio tests (LRTs) were used to compare nested models where one allows for positive selection and the other does not. The probability that a codon i substitutes into another codon j during the time interval t is determined by the rate matrix Q = (q ij ) with entries for i ≠ j, with corresponding substitution probability matrix given by exp(Qt). Here π j is the equilibrium codon frequency of codon j, κ is the transition/transversion ratio and ω is the dN/dS ratio. All parameters are estimated independently for each gene. The star topology of the three species is used to estimate the branch lengths (τ human , τ pig , τ mouse ) for synonymous and non-synonymous substitutions.
Positive selection was tested in two different ways. Test 1 averages over sites but differentiates among lineages. The LRT compares the free ratio model where all three lineages have a different value of ω estimated from the data with the one ratio model where all three lineages share a common value of ω [14]. We note that this test is more a test of variable dN/dS ratios among lineages than a test for positive selection. The free ratio model has three parameters for ω and the one ratio model only one. The LRT statistic is calculated as 2 times the differences in maximum log likelihood and is asymptotically distributed as a χ 2 distribution with 2 degrees of freedom. The genes found in one or more lineages evolving with a dN/dS ratio > 1 are compared to a nested model where the dN/dS ratio is fixed at 1 in the lineages shown to have a dN/dS ratio larger than one to see whether the result can be attributed to natural selection or just relaxation of selective pressures.
Test 2 is based on a new and improved version of the branch-site method presented in [23]. We will refer to this model as model A. The LRT is based on a comparison of the neutral model and model A. The neutral model assumes two categories of sites, a proportion p 1 of sites where ω 1 are estimated from the data and is forced to lie between 0 and 1, and a proportion p 2 of neutrally evolving sites where ω 1 = 1 (p 1 + p 2 = 1). Model A furthermore allows a pre-specified branch to have a proportion of sites that evolve with a different value of ω estimated from the data. This value cannot be smaller than 1. The LRT follows a χ 2 distribution with 2 degrees of freedom. If the value of ω in the foreground lineage is estimated to be equal to one the model collapses to the neutral model.
PAML v. 3.14 [48] was used to estimate likelihood and parameters under each model. Codon equilibrium frequencies can be estimated from data using either simple proportions in the full data set (the CT model with 60 parameters), assuming equal frequencies (Fequal model), multiplying overall counts of nucleotide frequencies (F1 × 4 model, 3 parameters) or counts of nucleotide frequencies for each codon position (F3 × 4 model, 9 parameters). The codon table (CT) was used for analysis of the concatenated super gene and the F3 × 4 model was used on the individual genes.