Accelerated gene evolution and subfunctionalization in the pseudotetraploid frog Xenopus laevis
© Hellsten et al; licensee BioMed Central Ltd. 2007
Received: 30 January 2007
Accepted: 25 July 2007
Published: 25 July 2007
Ancient whole genome duplications have been implicated in the vertebrate and teleost radiations, and in the emergence of diverse angiosperm lineages, but the evolutionary response to such a perturbation is still poorly understood. The African clawed frog Xenopus laevis experienced a relatively recent tetraploidization ~40 million years ago. Analysis of the considerable amount of EST sequence available for this species together with the genome sequence of the related diploid Xenopus tropicalis provides a unique opportunity to study the genomic response to whole genome duplication.
We identified 2218 gene triplets in which a single gene in X. tropicalis corresponds to precisely two co-orthologous genes in X. laevis – the largest such collection published from any duplication event in animals. Analysis of these triplets reveals accelerated evolution or relaxation of constraint in the peptides of the X. laevis pairs compared with the orthologous sequences in X. tropicalis and other vertebrates. In contrast, single-copy X. laevis genes do not show this acceleration. Duplicated genes can differ substantially in expression levels and patterns. We find no significant difference in gene content in the duplicated set, versus the single-copy set based on molecular and biological function ontologies.
These results support a scenario in which duplicate genes are retained through a process of subfunctionalization and/or relaxation of constraint on both copies of an ancestral gene.
Gene duplication followed by subsequent functional divergence is widely recognized as an important mechanism for the evolution of novelty [1, 2]. On a small scale, local tandem duplications can rapidly produce new gene families, such as the Hox cluster in animals , the olfactory receptors in vertebrate genomes , and numerous other examples in plants [5, 6], protists  and other lineages. Recently duplicated genes have a strong tendency to become pseudogenes, and will generally be lost due to disabling mutations unless positive selection preserves the duplicate loci. Based on the divergence of surviving gene pairs in diverse genomes, the typical lifetime of duplicated genes in a diploid background has been estimated to be several million years .
On a grander scale, entire genomes can be duplicated by polyploidization so that the cells of the resulting organism find themselves with two copies of every gene. Again, there is presumably a strong tendency towards rapid differential loss due to mutation of superfluous copies, and the long-term effect on the genome is elimination of most of the duplicate loci . In the case of polyploidy, the population dynamic and stoichiometric effects are different from the case of a localized duplication in a diploid background. Loss of a copy of a locally-duplicated gene simply restores the pre-duplication genome. In contrast, in the case of whole genome duplication the polyploid population is presumably reproductively isolated from its diploid brethren, and inactivation/loss of one of a pair of duplicate sequences puts that gene at half the copy number of the remaining loci, at least in the early stages of rediploidization. As haploinsufficiency is relatively rare , reduced copy number is not by itself an overwhelming impediment to large scale loss, as is evident from analysis of surviving duplicates in the Arabidopsis, rice, teleost, and yeast genomes [9, 11–13].
Early thoughts on the selective forces leading to duplicate gene retention centered on divergence in protein function. This suggests that one or both copies could acquire novel  and/or complementary  biochemical functions that would render both copies indispensable. It was further recognized that novel or complementary organismal functions could arise from differential regulatory mutations [14, 15]. Thus, if duplicate genes become expressed in different cell types or developmental stages, they might become indispensable and resistant to loss even if their associated peptides remain interchangeable. Through this mechanism, novel spatiotemporal roles can emerge, with numerous individual examples of cis- or trans-regulatory subfunctionalization known, for example, in teleost fish .
The well-studied amphibian Xenopus laevis has chromosome number (2N = 36) and genome size (~3Gb), roughly double that of its congener Xenopus (formerly Silurana) tropicalis (2N = 20, ~1.5 Gb) [16, 17]. This difference is attributed to a merger of two diploid progenitors originating ~40 million years ago [16, 18–20]. Allotetraploidy is suggested by the ease with which modern Xenopus species can form hybrids via unreduced gametes . However, we cannot rule out an autotetraploid origin. In this latter case, the duplicated pairs would be identical at the duplication event, whereas in the allotetraploid case such pairs would represent orthologs from the speciation event of the progenitors and might have separated at slightly different epochs prior to their last common ancestor, depending on the level of polymorphism at speciation. However, the differences in measurable terms are subtle, and in the following we refer to polyploidization events as genome duplications regardless of their origin. The X. laevis genome duplication is significantly more recent than the teleost-specific duplication (~350 million years ago (Mya)) [11, 21] and the ancient vertebrate-specific duplications (> 500 Mya) [22, 23]. However, it is older than the typical lifetime of duplicated genes in a diploid background (several million years) . Thus, by comparing X. laevis and X. tropicalis gene pairs, we can analyze an animal gene complement relatively soon after rediploidization, taking advantage of large-scale genome sequence data.
Results and Discussion
To study the evolution of duplicate gene pairs in X. laevis relative to their unique orthologs in X. tropicalis, we identified 20223 X. laevis open reading frames (ORFs) from an assembly of over half a million expressed sequence tags (ESTs) and transcripts , and compared them with each other and with a set of 24957 predicted transcripts from the X. tropicalis genome project (PM Richardson et al, unpublished results). Over half of the X. laevis ORFs in our set appear to be complete – that is, with a plausible start and stop codon.
From this analysis we identified 9574 likely X. laevis-X. tropicalis (LT) orthologous genes. A simple molecular clock estimate puts the divergence of the X. laevis and X. tropicalis lineages at ~50 Mya, and the genome duplication event at ~40 Mya, consistent with mitochondrial data  and a previous analysis of a dozen duplicated genes .
Guided by Figure 1, we conservatively identified pairs of X. laevis paralogs for 2218 of the LT genes. These define high confidence LLT triplets such that (a) the X. laevis pair arose during the whole genome duplication event and is retained in the modern pseudotetraploid genome within the expressed gene dataset, and (b) the single X. tropicalis gene is the unique ortholog. X. laevis paralogs are arbitrarily designated L1 and L2; both are "co-orthologs"  of the corresponding X. tropicalis gene. This set represents the largest collection of such triplets from any whole genome duplication event in animals – three to four times larger than in teleost fish [26, 27] and four to five times larger than in previous work on Xenopus [28, 29]. Zebrafish duplicates from the much older teleost genome duplication show near-saturation at the synonymous codon positions (Figure 1a) [27, 30].
How many of the ancient duplicated X. laevis gene pairs have subsequently lost one of the copies? This number cannot be accurately determined with only a partial collection of X. laevis genes based on ESTs. Nevertheless, we can crudely estimate a likely loss range of 50–75%, as discussed in the Methods section.
X. laevis paralogs show an enhanced rate of amino acid change relative to X. laevis-X. tropicalisorthologs.
Amino acid substitutions per site
Nucleotide transversions per synonymous site
Rate of amino acid substitution per unit nucleotide change
Corr 4 DTv (σ)
To test whether the acceleration found in X. laevis is a feature of retained gene duplicates or simply a feature of all genes in that lineage, we compared genes possessing observed paralogs with apparent single copy genes by identifying two mutually exclusive sets of orthologs from the five species. Set 5A consists of the original sextuplets with one of the X. laevis paralogs randomly removed from each gene. The 5B quintuplets each have only a single laevis gene with no known recent (4 DTv < 0.2) paralogs. Significantly accelerated evolution in X. laevis peptides is found only in genes with a confirmed paralog (Figure 3). For the X. laevis genes without recent observed paralogs, the normalized peptide vs nucleotide ratio is 1.11 ± 0.027, much closer to the ratio of 1 seen between the other species. Due to the incompleteness of the EST-derived X. laevis gene set we expect some of the 5B genes to have unobserved paralogs in the available X. laevis expressed gene set. The observed ratio can be explained if ~20% of the 5B genes have as yet undetected paralogs with the same pattern of evolution as those in 5A.
To study the peptide evolution in X. laevis paralogs further, we identified 148401 highly-constrained positions in the six-way LLTHMR multiple alignments, defined as positions with an identical amino acid in human, mouse, rat and at least two of the three frog orthologs. The vast majority of these sites (97.1%) were identical across all six peptides, but 4272 sites (around five residues per peptide) varied in just a single frog sequence. Of these, 26% (1090/4272) occurred in X. tropicalis, with ~37% in each of the X. laevis paralogs. Thus, even at highly-conserved positions duplicate X. laevis genes appear to be accepting additional substitutions eliminated by purifying selection in other species. Similar observations have been made in a number of previous studies (see for example Koonin  and references herein).
Gene content in X. tropicalis genes within LLT-triplets compared to the reference set of all X. tropicalis genes with X. laevisorthologs.
Number within referenceSet
Number within LLT triplets
Expected based on reference
Over- or Under-represented (±)
Select regulatory molecule
Double-stranded DNA binding
Differential expression levels measured using the four largest X. laevisEST sets show that a significant fraction doublets show differential expression.
ESTs hitting probes
Number of probes hit
N > = 16
P < = 0.01
A striking example is skp1a, whose amino acid sequence is 100% identical in all three frog peptides. This peptide is therefore under strong selection across its entire length. One paralog is expressed in the kidney and multiple head structures where the other paralog is either not expressed or only weakly so. These data support studies of other gene pairs in X. laevis (and zebrafish) that show subdivided expression patterns relative to single copy counterparts in mammals .
The duplication of an entire genome is a spectacular natural experiment in which tens of thousands of genes are effectively duplicated synchronously, so that each gene has a matched "paralogous" partner with a highly similar or identical sequence and chromosomal context. Subsequent divergence, loss, and rearrangement then gradually erode the signs of duplication. Whole genome duplication can be a powerful evolutionary force, but the polyploidies and subsequent rediploidization that occurred early in the vertebrate and teleost lineages are so ancient (~500 Mya and ~350 Mya, respectively) that the immediate evolutionary response is obscured in modern genomes. Genome tetraploidization occurred more recently in the evolution of X. laevis and with extensive genomic and cDNA sequencing available this provides a unique opportunity to analyze a genome in the process of reacting to a recent tetraploidization.
We identify more than 2200 cases in which a single gene in X. tropicalis possesses precisely two co-orthologous genes in X. laevis, both of which have survived until the present – the largest such collection of orthologs from an animal whole genome duplication. Analysis of such triplets reveals an accelerated evolution, or relaxation of constraint, in the peptides of the X. laevis duplicates compared to their orthologs in X. tropicalis and other vertebrates. In contrast, X. laevis genes for which only one duplicate is retained do not appear to show such acceleration. This is a subtle effect for any single gene, affecting on average only ~1–2 amino acids per peptide, and can only be confidently established by means of the large number of genes available for analysis. The relaxed constraint experienced by retained duplicates is consistent with overlapping/redundant biochemical functions.
The response to genome duplication, however, is more complex than simply relaxing sequence constraints. In one notable example, duplicate X. laevis genes produce identical peptides that are also identical to their (single) X. tropicalis ortholog. In this case, and in other examples studied with in situ hybridization, the X. laevis duplicates were found to be expressed in different patterns during development. We looked for other examples of differential gene expression by considering EST counts in deeply sequenced cDNA libraries, and found that a significant fraction (about one third to one half-) of duplicate genes show divergent expression levels in specific tissues. These results are consistent with the subfunctionalization model for the retention of duplicated genes [14, 15], in which paralogs acquire complementary coding and/or cis-regulatory mutations that leave both copies subject to purifying selection. These changes must occur rapidly, as the lifetime of truly redundant duplicates would be short (few million years) due to (a) the ease with which single nucleotide mutations across a gene can generate a null allele, and (b) the expected nearly neutral selection on such a null allele in the presence of a second locus of identical function.
While whole genome duplications are found in the ancestry of vertebrates, teleost fishes, yeasts, and multiple angiosperm lineages, there are relatively few cases in which a duplicated genome has a natural unduplicated sister sequence that can provide a recent comparative reference. For example, tetrapods can serve as a sister taxon for the study of the teleost duplication, but with a divergence of ~450 million years; for Arabidopsis, the related taxa all share either more ancient duplications or their own unique duplications that complicate analysis.
The X. tropicalis/X. laevis system provides an ideal testing ground for ideas about whole genome duplication, as the timing of the X. laevis tetraploidization is neither "too recent" compared with the lifetime of a duplicated locus, nor "too ancient" for measures of nucleotide variation to have reached saturation. The X. tropicalis genome is available in draft form (Richardson et al, unpublished results). As we have shown, the divergence of the two X. laevis sub-genomes is extensive, comparable to the divergence between mouse and rat. This suggests that whole genome shotgun approaches would successfully capture the genic regions of the X. laevis genome and provide a unique comparative reference for the study of genome evolution.
Identification of X. laevisORFs from DFCI (TIGR) gene indices
We downloaded 39724 tentative clusters (TCs) from the X. laevis TIGR gene index version 9.0 (now known as the DFCI indices ). All open reading frames (ORFs) in the 5' to 3' direction at least 150 nucleotides long were extracted, translated, and compared against the annotated set of X. tropicalis genes (JGI, version 4.1 Department of Energy Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA) using BLASTP  with default settings and including hits of E-value 1e-10 or better.
In more than 95% of the cases where an X. laevis TC had sequence similarity to an X. tropicalis gene, the longest ORF was also the ORF that showed the best BLAST score. In these cases, the longest ORF was selected. In 20% of the remaining 5% the longest ORF still showed similarity, in which case it was selected. Hence, the longest ORF is picked in about 96% of all cases in which the TC has sequence similarity to X. tropicalis. In cases where no ORF with sequence similarity exists, the longest ORF was picked, provided that it is at least 300 bases long. Such ORFs are not used in the present analysis. Otherwise, no ORF is annotated for the TC.
In the relatively few remaining cases, we adopted the following heuristics.
In about half the cases in which the longest ORF does not show sequence similarity but a shorter ORF does, the shorter ORF starts immediately at the 5' end, suggesting that the TC is incomplete in the 5' end. In such cases, the incomplete ORF was selected. If the ORF with similarity did not start at the 5' end, we chose the longest ORF if this was longer than 300 bases and the shorter ORF was not. We used this rationale because transposons and low-complexity regions within UTRs occasionally trigger a short ORF with similarity. If the TC has a relatively long ORF, we would suspect that to be the 'real' gene.
In the few remaining cases where both the longest ORF and the homologous ORF are shorter than 300 bases (but longer than 150 bases), we selected the homologous ORF, suspecting that a frame shift or sequencing error could have truncated this ORF.
Many TCs are incomplete at the 5' end. Hence, if the longest ORF started right at the 5' end, we included the entire CDS, even if the translated ORF did not start with a methionine. If the ORF was internal to the TC (i.e., three nucleotides immediately 5' of the ORF start translate into a stop codon), we interpret the gene as complete with 5' UTR. We report only on the CDS from the first ATG if it is longer than 150 nucleotides, unless the translated ORF has clear hits to a X. laevis gene at least 20 amino acids upstream of the first methionine, in which case the entire frame will be reported. The latter scenario could conceivably result from a sequencing error.
This annotation procedure resulted in 24674 candidate transcripts and peptides, 20825 of which show significant (< 1e-10) similarity to human genes. A total of 11711 (47.4%) were deemed partial by the above criteria. Some of the transcripts might be alternatively spliced versions of the same gene, which we identified by having evolutionary distances of 0, or close to 0. To reduce the number of shorter forms of alternatively spliced genes we applied the following filtering procedure. From the all-against-all Smith-Waterman alignment of the peptides described below, we evaluated 4DS distances, i.e., the fraction of four-fold degenerate third codon positions showing a nucleotide substitution. For all pairwise alignments with at least 25 conserved four-fold degenerate codon positions and not a single substitution observed, the shorter of the transcripts was marked as a short alternative splice form, and excluded from further analysis. A total of 1777 transcripts were filtered out in this manner, leaving 22897 X. laevis genes, 19211 of which showed similarity to human genes, 19598 to X. tropicalis genes, and 20223 to either X. tropicalis or human. These 20223 peptides and corresponding CDS sequences were used in subsequent analysis.
Identification of LLT orthologous triples
We aim to identify unambiguous sets of L1–L2-T triplets where L1 and L2 are the only two known recent copies in X. laevis and have an evolutionary distance consistent with originating from the whole genome duplication epoch, whereas the X. tropicalis version T does not have any known recent paralogs. We first performed all-against-all double affine Smith-Waterman alignments of the peptides in X. laevis and X. tropicalis using a TimeLogic DeCypher system (Active Motif, Inc., 1914 Palomar Oaks Way, Suite 150, Carlsbad, CA. 92008) with BLOSUM62 scoring matrix, gap opening penalty -15, gap extension penalty -2 until gap size 10, with no additional extension penalties. We identified the conserved four-fold degenerate amino acids within the alignments, extracted the corresponding codons in the underlying DNA sequence and calculated the 4 DTv distances (D 4DTv ) between each aligning pair as the fraction of four-fold degenerate (4D) third codon positions in which transversions are observed to have occurred. This provides a measure of the evolutionary distances between genes that is largely independent of the gene families, unlike measures based on peptides. D 4DTv ranges from 0 for recently duplicated peptides, to ~0.5 for paralogs that are so old that third codon nucleotides have essentially been randomized. Assuming that transversions occur independently, with equal probability at all, 4D sites, we can correct for multiple substitutions using the simple formula:
D 4DTv,corr = -1/2ln(1-2D 4DTv )
In addition, we calculated the fraction of 4D sites that had experienced any substitution, transition or transversion, D 4D . This distance measure gives better resolution for recent paralogs.
Of the 9905 mutual best hitting laevis-tropicalis pairs 9574 – almost 97% – have 4 DTV < = 0.2. These genes are almost certainly truly orthologous pairs. Of these, 843 have one or more recent paralog in X. tropicalis as defined by having 4 DTV < 0.2 to a homologous X. tropicalis gene. We eliminated these genes from consideration, as the functional evolution is more difficult to interpret when multiple paralogs are present. For each of the remaining 8731 pairs, we identified an unambigous LLT triplet if the X. laevis gene was a member of one of the 2875 doublets previously identified. This method resulted in 2218 unambiguous LLT triplets used in the study. The CDS and peptide sequences of these triplets, along with identifiers mapping the X. laevis genes to their corresponding TCs are available in Additional file 1. The sequence similarity between a pair of X. laevis CDS sequences in a triplet is typically about ~93%, whereas in the less conserved corresponding UTR regions it is no more than 85–87%, with several gaps in the alignments. Clearly, paralogs from the duplication events are sufficiently distinguishable for correct assembly of the EST clusters. In addition, the distinct UTR regions allows for selection of unique probes for our in situ hybridizations, as described later.
Estimate of the fraction of retained duplicate genes
We made two rough boundary estimates of the fraction of originally duplicated genes that has been retained in the modern X. laevis. First, we have seen in the previous section that of 8731 L-T orthologs, 2218 were found to have a second L co-ortholog, which would suggest a retention fraction of f = 2218/8731 = 0.25. However, this must be a minimum estimate as some co-orthologs will inevitably be missed due to the incompleteness of the X. laevis gene set. At the other extreme we can assume that for any L-T orthologous pair, the probability p miss of missing an existing co-ortholog due to incompleteness is 1-NEST,L/Ntot,L, where NEST,L is the number of X. laevis genes in our EST-based set and Ntot,L is the total (unknown) number of genes in the X. laevis genome, which can be expressed in terms of the size Ntot,T of the X. tropicalis genome, if we assume that these two genomes differ mainly due to the presence of duplicate genes. In that case we have Ntot,L = (1+f) Ntot,T, where f is the retention fraction. Combining this with the expression for p miss above, and using the approximation NEST,L = Ntot,T = 20000 genes, we get p miss = f/(1+f). The total number of L-T orthologs with retained co-orthologs, corrected for incompleteness is then 8731 f = 2218+(8731-2218)p miss . Substituting p miss and solving for f we get f = 0.5, that is, half the original duplicates are still present. This is likely to be an upper estimate, as the calculation of p miss assumes that any gene has an equal possibility of being in the X. laevis EST set, whereas in reality, once we have observed the presence of one co-ortholog in this set, the other co-ortholog, if it exists, could well have a larger-than-average probability of being included as well as this set are biased towards highly expressed genes.
Based on these estimates, we conclude that at least 25% and at most 50% of the duplicated genes in X. laevis have been retained. Interestingly, from the study of the quintuplets in the results section, we argued that we could account for the observed patterns of acceleration if 20% of the 5B (single-copy) genes had undetected co-orthologs. This would be consistent with a retention rate of f ~ 40%.
Multiple sequence alignment and peptide evolution analysis
We performed multiple sequence alignments of the LLT triplets using the clustalW program  with default settings, and extracted blocks of gap-free aligning sequence flanked by fully conserved amino acids and allowing no more than four consecutive positions of non-conserved amino acids within each block. A total of 2 135 of the triplets had a least 50 amino acids in such highly-conserved blocks, which concatenated into 513 188 amino acid residues for which combined P-distances (i.e., fractions of differing amino acids) and 4 DTV distances could be evaluated. The results are shown in Table 1.
Where P Bin (N, N1+N2) is the binomial probability function. This method can only detect significantly skewed (i.e., ~10 or more AA changes) evolution of peptides. That is, we do not have the statistical power to identify cases where a single change at a strategic site changes the function of the peptide.
To evaluate the relative expression of members in X. laevis doublets we aligned the nucleotide sequence in the 2135 confirmed doublets using BLASTn  with a cutoff in e-value of 1e-100. If the aligning sequence, stripped for gaps, was longer than 199 bases, we picked this sequence pair as a probe-set against which ESTs from any library can be aligned. By this method we were able to construct 2070 pairs of probes. The members of each pair are sufficiently distinct from each other (mean and median ~92.7% identity) that it can be unambiguously identified which of the two probes is the correct match for a given EST. As quite a few ESTs contain undetermined bases, and SNPs could be present, we don't always see a 100% match. We define all hits to one of the members probe-set better than 98.5% as a match.
To test whether X. laevis pairs differed significantly in expression level, we performed a statistical analysis similar to that performed to detect asymmetric evolution in peptides. For each pair of EST hits (N1, N2) where N1 and N2 are the number of ESTs compatible with probe 1 and 2, respectively, we calculated the probability of the observed results or worse under the hypothesis that each gene in the probe pair are equally expressed, i.e., had an equal probability of being assigned an EST. This probability, evaluated using the normal approximation to the binomial distribution, constitutes a p-value for each of these 130 probe pairs.
Identification of 301 candidate doublets from zebrafish whole genome duplication
The zebrafish doublets shown in Figure 1a were determined as follows: the Ensembl  models v. 24.4.1 were aligned to each other and to the Ensembl models v. 26.35.1 for human on Timelogic Decypher™ using the same parameter settings as for the frog aligments, and 4 DTv distances were determined for each pair with 25 or more 4D codon sites. A single-linkage clustering of paralogs hitting each other with a P score < 10-20 was then performed, and all clusters with more than eight members were rejected as promiscuous genes. On the remaining set, we performed a mutual-best hitting algorithm excluding hits with (a) 4 DTv distance < 0.25 (recent paralogs), and (b) genes on the same chromosome within 5 megabases from each other. These hits are from tandem duplications or recent paralogs and hence not candidates for the zebrafish whole-genome duplication. From the remaining pairs, we removed pairs in which (a) both members had different orthologs in human, as determined by mutual best hits (paralogs preceding the human-fish lineage split), and (b) pairs with no human orthologs (and hence undatable). In the remaining cases, we performed multiple sequence alignments of the human-ZF1-Zf2 triplets and calculated the P-distances in conserved, gap-free blocks. We then retained the pairs in which the Zf-Zf2 P-distance was shorter than either human-Zf1 or human-Zf2, as these are likely to be a result of a duplication event that happened after the human-Zf split. The 4 DTv distance distribution for the 301 remaing pairs is shown in Figure 1a.
Comparison to other vertebrates
We compared the sequence evolution rates of the LLT triplets to human, mouse, and rat genes in the following manner. For each of these three species, we downloaded the set of Ensembl gene models and, using only the longest gene at each locus, we identified blocks of conserved synteny between each pair of species using a PERL implementation of the following algorithm: for the first pairwise aligment of genes in the proteomes of the two species, the gene locations on the chromosomes is recorded and a one-pair segment of conserved synteny is defined. Subsequent gene pairs either defines new segments, or, if the genes in both species are located within a specified maximum distance from a gene pair in an existing segment, the pair is added to that segment. If a pair can be added to two segments, these segments are joined into a larger segment of conserved synteny. After traversing all alignments, we have a set of conserved syntenic regions, on which we can impose a minimum member limit (typically three pairs) to removed spurious regions. In the vertebrates, regions of conserved synteny can extend over several hundred genes. A gene in one species can, and usually does, form part of more than one block of conserved segments. However, the longest such block usually defines the orthologous region, whereas smaller blocks are remnants of either ancient genome duplications or recent segmental duplications. For the purpose of this study, we retained only the strictest set of orthologs, confirmed by the longest block of conserved synteny covering the area, and excluding all genes found to be members of a tandem duplicated family, in order to avoid mis-identified orthologs. For human-mouse, ~95% of the synteny-confirmed orthologous pairs are also mutual best hits to each other. A total of 9852 tropicalis genes have synteny-confirmed orthologs with at least one human, mouse, or rat gene, and 5475 have synteny-confirmed orthologs in all three. The 4 DTV distributions for orthologous pairs defined in this manner are shown in Figure 1b. It is seen that they indeed peak around characteristic values that reflects the evolutionary distance between the species. By this measure, laevis-tropicalis and the two X. laevis doublets are at an intermediate evolutionary distance between that of mouse-rat and mouse-human.
In 1039 of the LLT triplets, the X. tropicalis gene had synteny-confirmed orthologs to human, mouse, and rat and were used to construct clusters of six genes containing two laevis co-orthologs and their corresponding single tropicalis, human, mouse, and rat orthologs.
After multiple sequence alignment, 904 of the sextuplets showed conserved blocks of at least 50 amino acids among all six peptides in the same manner defined above for the triplets.
Test for EST artifact in peptide evolution
To rule out the possibility that the higher rate of peptide evolution in X. laevis is simply an artifact caused by EST sequencing errors, we performed the same analysis on the subset of 339 sextuplets for which the X. laevis doublets were both based on TCs assembled from 12 or more ESTs. For such clusters, sequencing errors associated with individual ESTs will generally be corrected by overlapping ESTs used in the consensus sequence. The peptide evolution to 4 DTV ratio stayed the same in this subset, however, as well as for an even more restricted subset of 158 doublets with 24 or more ESTs (data not shown).
Description of triplets selected for in situ hybridizations
In some instances, paralog probes in X. laevis detected no significant expression differences and were set aside for this analysis (data not shown). However, as shown in Figure 4 some probes identified different expression patterns for the two paralogs in X. laevis (also indicating that they were a paralog specific probe set). In each case to confirm expression patterns, over three dozen embryos were stained for each probe in three different in situ hybridization experiments. Expression patterns shown in Figure 4 are representative and were consistently seen across all of the embryos analyzed.
We thank the IMAGE consortium and Cambridge University and the Wellcome Trust/Sanger Institute and Wellcome Trust/Cancer Research UK Gurdon Institute Xenopus tropicalis EST project for reagents. R.M.H. is supported by the National Institutes of Health (GM66684). MKK is supported by a K08-HD42550 award from the National Institute of Child Health and Human Development.
- Ohno S: Evolution by Gene Duplication. 1970, Berlin: Springer VerlagView ArticleGoogle Scholar
- Koonin EV: Orthologs, paralogs, and evolutionary genomics. Annu Rev Genet. 2005, 39: 309-338. 10.1146/annurev.genet.39.073003.114725.View ArticlePubMedGoogle Scholar
- Ferrier DE, Holland PW: Ancient origin of the Hox gene cluster. Nat Rev Genet. 2001, 2: 33-38. 10.1038/35047605.View ArticlePubMedGoogle Scholar
- Young JM, Trask BJ: The sense of smell: genomics of vertebrate odorant receptors. Hum Mol Genet. 2002, 11: 1153-1160. 10.1093/hmg/11.10.1153.View ArticlePubMedGoogle Scholar
- Arabidopsis Genome Initiative: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature. 2000, 408: 796-815. 10.1038/35048692.View ArticleGoogle Scholar
- Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell. 2004, 16: 1679-1691. 10.1105/tpc.021410.PubMed CentralView ArticlePubMedGoogle Scholar
- Eichinger L, Pachebat JA, Glockner G, Rajandream MA, Sucgang R, Berriman M, Song J, Olsen R, Szafranski K, Xu Q, et al: The genome of the social amoeba Dictyostelium discoideum. Nature. 2005, 435: 43-57. 10.1038/nature03481.PubMed CentralView ArticlePubMedGoogle Scholar
- Lynch M, Conery JS: The evolutionary fate and consequences of duplicate genes. Science. 2000, 290: 1151-1155. 10.1126/science.290.5494.1151.View ArticlePubMedGoogle Scholar
- Wolfe KH: Yesterday's polyploids and the mystery of diploidization. Nat Rev Genet. 2001, 2: 333-341. 10.1038/35072009.View ArticlePubMedGoogle Scholar
- Deutschbauer AM, Jaramillo DF, Proctor M, Kumm J, Hillenmeyer ME, Davis RW, Nislow C, Giaever G: Mechanisms of haploinsufficiency revealed by genome-wide profiling in yeast. Genetics. 2005, 169: 1915-1925. 10.1534/genetics.104.036871.PubMed CentralView ArticlePubMedGoogle Scholar
- Wittbrodt J, Meyer A, Schartl M: More genes in fish?. Bioessays. 1998, 20: 511-515. 10.1002/(SICI)1521-1878(199806)20:6<511::AID-BIES10>3.0.CO;2-3.View ArticleGoogle Scholar
- Paterson AH, Bowers JE, Chapman BA, Peterson DG, Rong J, Wicker TM: Comparative genome analysis of monocots and dicots, toward characterization of angiosperm diversity. Curr Opin Biotechnol. 2004, 15: 120-125. 10.1016/j.copbio.2004.03.001.View ArticlePubMedGoogle Scholar
- Postlethwait J, Amores A, Cresko W, Singer A, Yan YL: Subfunction partitioning, the teleost radiation and the annotation of the human genome. Trends Genet. 2004, 20: 481-490. 10.1016/j.tig.2004.08.001.View ArticlePubMedGoogle Scholar
- Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J: Preservation of duplicate genes by complementary, degenerative mutations. Genetics. 1999, 151: 1531-1545.PubMed CentralPubMedGoogle Scholar
- Ferris SD, Whitt GS: Evolution of the differential regulation of duplicate genes after polyploidization. J Mol Evol. 1979, 12: 267-263. 10.1007/BF01732026.View ArticlePubMedGoogle Scholar
- Bisbee CA, Baker MA, Wilson AC, Haji-Azimi I, Fischberg M: Albumin phylogeny for clawed frogs (Xenopus). Science. 1977, 195: 785-787. 10.1126/science.65013.View ArticlePubMedGoogle Scholar
- Hirsch N, Zimmerman LB, Grainger RM: Xenopus, the next generation: X. tropicalis genetics and genomics. Dev Dyn. 2002, 225: 422-433. 10.1002/dvdy.10178.View ArticlePubMedGoogle Scholar
- Kobel HR: Allopolyploid speciation. The Biology of Xenopus. Edited by: Tinsley RC, Kobel HR. 1996, Oxford: Clarendon Press, 390-401.Google Scholar
- Evans BJ, Kelley DB, Tinsley RC, Melnick DJ, Cannatella DC: A mitochondrial DNA phylogeny of African clawed frogs: phylogeography and implications for polyploid evolution. Mol Phylogenet Evol. 2004, 33: 197-213. 10.1016/j.ympev.2004.04.018.View ArticlePubMedGoogle Scholar
- Graf JD, Kobel HR: Genetics of Xenopus laevis. Methods Cell Biol. 1991, 36: 19-34.View ArticlePubMedGoogle Scholar
- Amores A, Force A, Yan YL, Joly L, Anemiya C, Fritz A, Ho RK, Langeland J, Prince V, Wang YL, Westerfield M, Ekker M, Postlethwait JH: Zebrafish hox clusters and vertebrate genome evolution. Science. 1998, 282: 1711-1714. 10.1126/science.282.5394.1711.View ArticlePubMedGoogle Scholar
- McLysaght A, Hokamp K, Wolfe KH: Extensive genomic duplication during arly chordate evolution. Nat Genet. 2002, 31: 200-204. 10.1038/ng884.View ArticlePubMedGoogle Scholar
- Gu X, Wang Y, Gu J: Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution. Nat Genet. 2002, 31: 205-209. 10.1038/ng902.View ArticlePubMedGoogle Scholar
- Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J: The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res. 2005, 33: D71-74. 10.1093/nar/gki064.PubMed CentralView ArticlePubMedGoogle Scholar
- Hughes MK, Hughes AL: Evolution of duplicate genes in a tetraploid animal, Xenopus laevis. Mol Biol Evol. 1993, 10: 1360-1369.PubMedGoogle Scholar
- Postlethwait J, Amores A, Cresko W, Singer A, Yan YL: Subfunction partitioning, the teleost radiation and the annotation of the human genome. Trends Genet. 2004, 20: 481-490. 10.1016/j.tig.2004.08.001.View ArticlePubMedGoogle Scholar
- Christoffels A, Koh EGL, Chia J-M, Brenner S, Aparicio A, Venkatesh B: Fugu genome analysis provides evidence for a whole-genome duplication early during the evolution of ray-finned fishes. Mol Biol Evol. 2004, 21: 1146-1151. 10.1093/molbev/msh114.View ArticlePubMedGoogle Scholar
- Chain FJJ, Evans BJ: Multiple mechanisms promote the retained expression of gene duplicates in the tetraploid frog Xenopus laevis. PLoS Genet. 2006, 2: e56-10.1371/journal.pgen.0020056.PubMed CentralView ArticlePubMedGoogle Scholar
- Morin RD, Chang E, Petrescu A, Liao N, Griffith M, Chow W, Kirkpatrick R, Butterfield YS, Young AC, Stott J, et al: Sequencing and analysis of 10,967 full-length cDNA clones from Xenopus laevis and Xenopus tropicalis reveals post-tetraploidization transcriptome remodeling. Genome Res. 2006, 16: 796-803. 10.1101/gr.4871006.PubMed CentralView ArticlePubMedGoogle Scholar
- Jaillon O, Aury JM, Brunet F, Petit JL, Stange-Thomann M, Mauceli E, Bouneau L, Fischer C, Osouf-Costaz C, Bernot A, et al: Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature. 2004, 431: 946-957. 10.1038/nature03025.View ArticlePubMedGoogle Scholar
- Kimura M, Ohta T: On some principles governing molecular evolution. Proc Natl Acad Sci USA. 1974, 71: 2848-2852. 10.1073/pnas.71.7.2848.PubMed CentralView ArticlePubMedGoogle Scholar
- Jordan IK, Wolf YI, Koonin EV: Duplicated genes evolve slower than singletons despite the initial rate increase. BMC Evol Biol. 2004, 4: 22-10.1186/1471-2148-4-22.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechanya A: PANTHER: a library of protein families and subfamilies indexed by function. Genome Res. 2003, 13: 2129-2141. 10.1101/gr.772403.PubMed CentralView ArticlePubMedGoogle Scholar
- Thomas PD, Kejariwal A, Guo N, Mi H, Campbell MJ, Muruganujan A, Lazareva-Ulitsky B: Applications for protein sequence-function evolution data: mRNA/protein expression analysis and coding SNP tools. Nucleic Acids Res. 2006, 34: W645-W650. 10.1093/nar/gkl229.PubMed CentralView ArticlePubMedGoogle Scholar
- Takahashi N, Tochimoto N, Ohmori SY, Mamada H, Itoh M, Inamori M, Shinga J, Osada S, Taira M: Systematic screening for genes specifically expressed in the anterior neuroectoderm during early Xenopus development. Int J Dev Biol. 2005, 49: 939-951. 10.1387/ijdb.052083nt.View ArticlePubMedGoogle Scholar
- DFCI indices. [http://compbio.dfci.harvard.edu/tgi]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410.View ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.PubMed CentralView ArticlePubMedGoogle Scholar
- Hubbard TJP, Aken BL, Beal1 K, Ballester1 B, Caccamo M, Chen Y, Clarke L, Coates G, Cunningham F, Cutts T, et al: Ensembl. Nucleic Acids Res. 2007, 2007 Jan 1, doi:10.1093/nar/gkl996Google Scholar
- Harland RM: In situ hybridization: an improved whole-mount method for Xenopus embryos. Methods Cell Biol. 1991, 36: 685-695.View ArticlePubMedGoogle Scholar
- Khokha MK, Chung C, Bustamante EL, Gaw LW, Trott KA, Yeh J, Lim N, Lin JC, Taverner N, Amaya E, et al: Techniques and probes for the study of Xenopus tropicalis development. Dev Dyn. 2002, 225: 499-510. 10.1002/dvdy.10184.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.