A phylogenomic framework and timescale for comparative genomics and evolutionary developmental biology of tunicates

Background Tunicates are the closest relatives of vertebrates and are widely used as models to study the evolutionary developmental biology of chordates. Their phylogeny, however, remains poorly understood and to date, only the 18S rRNA nuclear gene and mitogenomes have been used to delineate the major groups of tunicates. To further resolve their evolutionary relationships and provide a first estimate of their divergence times, we used a transcriptomic approach to build a supermatrix consisting of 258 evolutionarily conserved orthologous genes from representative species of all major tunicate lineages. Results Phylogenetic analyses using site-heterogeneous CAT mixture models of amino acid sequence evolution resulted in a strongly supported tree topology resolving the relationships among four major tunicate clades: 1) Appendicularia, 2) Thaliacea + Phlebobranchia + Aplousobranchia, 3) Molgulidae, and 4) Styelidae + Pyuridae. Notably, the morphologically derived Thaliacea are confirmed as the sister-group of the clade uniting Phlebobranchia + Aplousobranchia within which the precise position of the model ascidian genus Ciona remains uncertain. Relaxed molecular clock analyses accommodating the accelerated evolutionary rate of tunicates reveal ancient diversification (∼450-350 million years ago) among the major groups and allow comparing their evolutionary age with respect to the major vertebrate model lineages. Conclusions Our study represents the most comprehensive phylogenomic dataset so far available for tunicates. It offers a reference phylogenetic framework and first tentative timescale for tunicates, allowing a direct comparison with vertebrate model species in future comparative genomics and evolutionary developmental biology studies.


Conclusions.
Our study represents the most comprehensive phylogenomic dataset so far available for tunicates. It offers a reference phylogenetic framework and first tentative timescale for tunicates, allowing a direct comparison with vertebrate model species in future comparative genomics and evolutionary developmental biology studies. 3

Background
Large-scale phylogenetic analyses of tunicate genomic data from a handful of model species have identified this marine chordate group as the closest relative of vertebrates [1][2][3][4][5]. This discovery has had profound implications for comparative genomics and evolutionary developmental biology studies aimed at understanding the origins of chordates and vertebrates [6][7][8]. Indeed, the new chordate phylogeny implies that the tunicate body plan is evolutionarily derived and has become secondarily simplified from more complex chordate ancestors [2,3].
The key phylogenetic position of tunicates within chordates has prompted the selection of model species such as Ciona robusta (formerly Ciona intestinalis type A [9]) for which a full genome has been sequenced early in the history of comparative genomics to provide insight into vertebrate-specific whole genome duplications [10]. Since then, genome sequences have been assembled for additional species that are widely used as models in comparative genomics and evolutionary developmental biology [11] including Ciona savignyi [12], Oikopleura dioica [5], Botryllus schlosseri [13], Molgula occidentalis, M. occulta and M. occulata [14], Phallusia mammillata [15] and Halocynthia roretzi [15]. The available genomic data have notably revealed a stunning contrast in the evolutionary rate of nuclear proteincoding genes between tunicates and vertebrates [3,16]. This accelerated evolution of tunicate genes is also coupled with extensive structural rearrangements observed in their genomes [5,17,18]. This contrast is even more pronounced for mitochondrial genomes, which are particularly fast evolving and highly rearranged in tunicates with respect to other deuterostomes in which they are widely conserved [5,19,20]. The reasons behind the rapid rate of genomic evolution in tunicates remain unclear [16,21,22] and contrasts with the unusual conservation level of embryonic morphologies between all ascidian species studied so far [7].
The phylogenetic position of thaliaceans is key for understanding the evolution of developmental modes within tunicates [29]. Compared to their closest relatives that are mostly solitary and sessile, the three groups of thaliaceans (salps, doliolids and pyrosomes) are pelagic with complex life cycles including solitary and colonial phases. Their unique lifestyle also seems to be associated with spectacular differences in their embryology, such as the loss of a well-developed notochord in the larva of most thaliaceans, with the exception of only a few doliolid species [29]. Based on our current understanding of tunicate evolution, thaliaceans may have evolved from a sessile ascidian-like ancestor and therefore can serve as a model to understand how the transition from a benthic to a pelagic lifestyle has led to drastic modifications in the morphology, embryology, and life cycle of these tunicates [29].
Coloniality is another remarkable feature of the thaliaceans, which shows some similarities with the coloniality in ascidians, even tough this trait probably evolved independently in the two groups [29]. It is noteworthy that doliolids have polymorphic colonies [30], a trait that is absent in colonial ascidians. A reliable phylogeny positioning thaliaceans with regard to colonial ascidians is thus necessary to understand the evolution of these unique features.
Outstanding questions in chordate evolution include the identification of the determinants of the rapid rate of genome evolution in tunicates and the emergence of vertebrates [11,31]. A prerequisite to address these issues is to reconstruct a reliable phylogenetic framework and timescale to guide future comparative evolutionary genomic and evolutionary studies of chordate development. Moreover, given that the fossil record of tunicates is deceptively scarce and controversial [32][33][34], a molecular timescale for chordates would allow comparing tunicate evolution to the well-calibrated vertebrates [35] for the first time. A phylogenetic and timing context is notably critical for the identification and interpretation of both conserved and divergent developmental features of tunicates compared to model vertebrate species in the context of their fast rate of genomic evolution [11].
Here, we use new transcriptomic data obtained through high-throughput sequencing technologies (Roche 454 and Illumina HiSeq) to build the first tunicate phylogenomic dataset including all major tunicate groups. This dataset consists of 258 orthologous nuclear genes for 63 taxa including representative deuterostome species and all major chordate lineages.
Using phylogenetic analyses based on the best-fitting site-heterogeneous CAT mixture model of amino acid sequence evolution, we inferred well-resolved phylogenetic relationships for the major clades of tunicates. Our molecular dating analyses based on models of clock relaxation accounting for variation in lineage-specific evolutionary rates provide a first tentative timescale for the emergence of the main tunicate clades allowing a direct comparison with vertebrate model systems.

Transcriptome data collection
Live tunicate specimens were ordered from Gulf Specimen Marine Laboratories, Inc.
(Panacea, FL, USA) and the Roscoff Biological Station (France) services, and collected in

Phylogenomic dataset assembly
We built upon a previous phylogenomic dataset [39] to select a curated set of 258 orthologous markers for deuterostomes. Alignments were complemented with sequences from the NCBI databases using a multiple best reciprocal hit approach implemented in the newly designed Forty-Two software [40]. Because 454 DNA sequence reads are characterized by sequencing errors typically disrupting the reading frame when translated into amino acids, alignments were verified by eye using the program ED from the MUST package [41]. Ambiguously aligned regions were excluded for each individual protein using Gblocks with medium default parameters [42] with a few subsequent manual refinements using NET from the MUST package to relax the fact that this automated approach is sometimes too conservative. This manual refinement step restored only 418 amino acid sites (i.e., 0.6 % of the total alignment length). Potential cross-contaminations between our samples were also dealt with at the alignment stage by performing BLAST searches of each sequence against a taxon-rich reference database maintained for each curated gene alignment, and were further looked for by a visual examination of each individual gene phylogeny.

Phylogenetic analyses
Bayesian cross-validation [45] implemented in PhyloBayes 3.3f [46] was used to compare the fit of site-homogeneous (LG and GTR) and site-heterogeneous (CAT-F81 and CAT-GTR) models coupled with a gamma distribution (Γ 4 ) of site-rate heterogeneity. Ten replicates were considered, each one consisting of a random subsample of 10,000 sites for training the model and 2,000 sites for computing the cross-validation likelihood score. Under sitehomogeneous LG+Γ 4 and GTR+Γ 4 models, 1,100 sampling cycles were run and a burn-in of 100 samples was used, and under site-heterogeneous models CAT-F81+Γ 4 and CAT-GTR+Γ 4 , 3,100 sampling cycles were run and the first 2,100 samples were discarded as burn-in.
Bayesian phylogenetic reconstruction under the best-fitting CAT-GTR+Γ 4 mixture model [47] was conducted using PhyloBayes_MPI 1.5a [48]. Two independent Markov Chain Monte Carlo (MCMC) starting from a randomly generated tree were run for 6,000 cycles with trees and associated model parameters being sampled every cycle. The initial 1,000 trees sampled in each MCMC run were discarded as burn-in after checking for convergence in both likelihood and model parameters, as well as in clade posterior probabilities using bpcomp (max_diff < 0.3). The 50% majority-rule Bayesian consensus tree and the associated posterior probabilities (PP) were then computed from the remaining combined 10,000 (2 x 5,000) trees using bpcomp.
We further assessed the robustness of our phylogenomic inference by applying a gene jackknife resampling procedure [3]. A hundred jackknife replicates constituted of 130 alignments drawn randomly out of the total 258 protein alignments were generated. The 100 resulting jackknife supermatrices were then analysed using PhyloBayes_MPI under the second best-fitting CAT-F81+Γ 4 model instead of the best-fitting CAT-GTR+Γ 4 and for 2,000 sampling cycles in order to reduce computational burden. After removing the first 200 sampled trees of each chain as the burn-in, a majority-rule consensus tree was obtained for each replicate using the 1,800 trees sampled from the posterior distribution. A consensus tree was then obtained from the 100 jackknife-resampled consensus trees. The support values displayed by this Bayesian consensus tree are thus gene Jackknife Support percentages (JS). High values indicate nodes that have high posterior probability support in most jackknife replicates and are thus robust to gene sampling. We verified convergence of MCMCs in each Jackknife replicate by checking that varying the burn-in value did not affect the JS percentages obtained in the final consensus.

Molecular dating
Molecular dating analyses were performed in a Bayesian relaxed molecular clock framework using PhyloBayes 3.3f [46]. In all dating calculations, the tree topology was fixed to the majority-rule consensus tree inferred in previous Bayesian analyses (Fig. 1). This tree has been rooted between Xenambulacraria (Xenoturbellida + Ambulacraria) and Chordata following the results of Philippe et al. [39] showing that Xenacoelomorpha (acoelomorphs + xenoturbellids) were related to Ambulacraria (hemichordates + echinoderms) within Deuterostomia. These results have been recently challenged by two studies claiming support for a more external position of Xenacoelomorpha as a sister-group to Nephrozoa (Protostomia + Deuterostomia) [49,50]. However, this newly proposed position is still debated as it might be the result of a long-branch attraction (LBA) artefact caused by the very long branches of acoelomorphs in phylogenomic trees [51,52]. Hence, our choice to root our trees according to Philippe et al. [39]. Dating analyses were conducted using the best-fitting siteheterogeneous CAT-GTR+Γ 4 mixture model and a relaxed clock model with a birth-death prior on divergence times combined with soft fossil calibrations following Lartillot et al. [46].
Given the lack of trustable fossils within tunicates, we used 12 calibration intervals defined within vertebrates [53,54] and one within echinoderms [55] In order to select the best-fitting clock model, we compared the auto-correlated lognormal (LN) relaxed clock model [56] with the uncorrelated gamma (UGAM) model [57], and a strict molecular clock (CL) model. These three clock models were compared against each other using the same prior settings (see above) in a cross-validation procedure as implemented in PhyloBayes following Lepage et al. [58]. However, to reduce the computational burden, the CAT-F81+Γ 4 mixture model was used instead of CAT-GTR+Γ 4 .
The cross-validation tests were performed by dividing the original alignment in two subsets with 90% of sites for the learning set (59,934 sites) and 10% of sites for the test set (6,659 sites). The overall procedure was repeated over 10 random splits for which a MCMC was run on the learning set for a total 4,000 cycles sampling posterior rates and dates every cycle.
The first 3,000 samples of each MCMC were excluded as the burn-in for calculating the cross-validation scores averaged across the 10 replicates.
The final dating calculations were conducted under both LN and UGAM relaxed-clock models and the CAT-GTR+Γ 4 mixture model of sequence evolution by running MCMCs for a total 25,000 cycles sampling posterior rates and dates every 10 cycles. The first 500 samples of each MCMC were excluded as the burn-in after checking for convergence in both likelihood and model parameters using readdiv. Posterior estimates of divergence dates and associated 95% credibility intervals were then computed from the remaining 2,000 samples of each MCMC using readdiv.

A reference phylogenetic framework for model tunicates
The evolutionary relationships of tunicates have long been a matter of debate. This is mainly because tunicates are characterized by an overall accelerated rate of evolution in their nuclear and mitochondrial genomes compared to other deuterostome species. This large lineage-specific variation in evolutionary rates among tunicates [16] could result in LBA artefacts, which hamper the reliable reconstruction of their phylogenetic relationships [59][60][61]. Another contributing factor to our limited understanding of tunicate evolution is the uneven availability of genome data across different tunicate lineages. To address these limitations, we used (i) a wider taxon sampling encompassing all major tunicate lineages including two divergent thaliaceans, (ii) numerous nuclear genes to reduce stochastic error, and (iii) powerful site-heterogeneous models that generally offer the best fit to phylogenomic data and have the advantage to be least sensitive to LBA and other potential phylogenetic artefacts [39,62,63]. Accordingly, the results of our Bayesian cross-validation tests showed that the CAT-GTR+Γ 4 mixture model offered the best statistical fit to the data (ΔlnL = 1,506 ± 98 compared to LG+Γ 4 ), followed by the CAT- The results from this first phylogenomic study including all tunicate lineages were in line with recent studies [20,[25][26][27][28] demonstrating that ascidians (Class Ascidiacea) form a paraphyletic group. Our results showed that phlebobranchs and aplousobranchs are undoubtedly closer to thaliaceans than to stolidobranchs (Fig. 1), and that a thorough taxonomic revision of the tunicate classes is necessary. It seems clear that the use of the Ascidiacea class should be abandoned in favour of more meaningful classification schemes.
Even though the position of Thaliacea was not always statistically supported, it consistently appeared as the sister-group of phlebobranchs + aplousobranchs in previous studies [20,[24][25][26]28], except for a recent genome-scale study in which the positioning of Salpa thompsoni most likely suffered artefactual LBA attraction towards the fast-evolving appendicularians [66]. The robust phylogenetic position of thaliaceans found here indicates that they evolved from a sessile ancestor and their study can provide valuable information on the morphological transformations associated with the transition to the pelagic lifestyle [29].
The monophyly of the clade uniting phlebobranchs and aplousobranchs has never been challenged and thus we suggest to re-use the term Enterogona to define this group as originally proposed by Perrier [65] and subsequently redefined by Garstang [67]. The close relationship between thaliaceans and enterogones has also been supported by all previous molecular studies, as well as by morphological observations. The gonad position and the shared paired ontogenetic rudiment of the atrial cavity and opening might constitute two of their anatomical synapomorphies [68]. Lastly, we also confirmed the previously reported monophyly of stolidobranchs (= Pleurogona), with molgulids being the sister-group to styelids + pyurids.
Finally, our phylogenomic study casted new light on two recurring issues in tunicate phylogenetics. First, phlebobranchs have been repeatedly found to be paraphyletic, albeit usually with no statistical support [25][26][27][28]69], and the phylogenetic affinities among its members remains unclear. Notably, the traditional position of Ciona as a phlebobranch ascidian was challenged by Kott [70], who placed the genus within aplousobranchs on the basis of morphological characters. More recently, Turon & López-Legentil [69] and Shenkar et al. [28] found that Ciona was closer to aplousobranchs than to other phlebobranchs using mitochondrial DNA. These results are in agreement with the tree topology obtained in the present study, although it was not statistically supported. The positioning of the model Ciona genus and the phylogenetic relationships of phlebobranchs need to be the focus of additional phylogenomic studies including a denser taxon sampling. Second, although the position of appendicularians as sister-clade to all other tunicates was well supported here and in all previous tunicate phylogenomic studies [2,3], the extremely long branch of Oikopleura dioica coupled with our current inability to completely alleviate a potential LBA artefact -even with complex site-heterogeneous mixture models (see [63]) -prevent us from considering this species' phylogenetic position as conclusive. The long appendicularian branch should be subdivided with the inclusion of additional divergent species in future phylogenomic analyses to definitively settle this point.

Evolutionary rate variations and molecular clock models
As observed in previous phylogenomic studies of chordates [2][3][4], the Bayesian phylogram estimated under the best-fitting CAT-GTR+Γ 4 mixture model revealed marked branch length heterogeneity (Fig. 1). The tunicate branch lengths not only were much longer than those of all the other deuterostome clades, but also displayed strong variations within tunicates. From the ancestral node of Olfactores, the tunicate median evolutionary rate as measured in terms of branch length was of 1.53 amino acid substitution per site compared to the vertebrate median evolutionary rate that was 0.65. From the ancestral vertebrate node, the average of branch lengths is 0.35 ± 0.05 amino acid replacements per site. In contrast, from the ancestral node of tunicates -excluding the super fast-evolving Oikopleura dioica -the average of branch lengths was 0.69 ± 0.19. For the proteins here combined for phylogenomic purposes, tunicates (to the exception of Oikopleura dioica) displayed a twicehigher rate of amino acid substitutions than vertebrates.
Such substitution rate variation among lineages -within tunicates, and between tunicates and other deuterostomes -needs to be accounted for in molecular dating analyses by using models of clock relaxation [56]. The selection of the clock model is often arbitrary and appears mostly dependent of the software choice, with an overwhelming majority of studies relying on the BEAST software [71] using an uncorrelated gamma (UGAM, also known as UCLN) model of clock relaxation. However, it has been shown that autocorrelated rate models, such as the autocorrelated log-normal model (LN), often provide a better fit with phylogenomic data [58,72]. Consequently, we compared the fit of both the UGAM and LN models to the fit of a strict molecular clock (CL) model for our dataset using cross-validation tests under the CAT-GTR+Γ 4 model. As expected given the large lineage specific rate variation, both relaxed clock models largely outperformed the strict clock model The use of a relaxed clock model allowed us to perform evolutionary rate comparisons in terms of number of substitutions per site per million years for the 63 terminal taxa considered (Fig. 2). The box plots clearly showed that tunicates evolved faster than other groups, especially compared to vertebrates that were the slowest evolving. On average, tunicates evolved 6.25 faster than vertebrates (two-tailed t test; t = 4.542, p = < 0.001*), 2.08 times faster than cephalochordates (two-tailed t test not applicable with only one cephalochordate), and 2.45 times faster than the outgroups (two-tailed t test; t = 1.711, p = 0.099 ns ) here included. The evolutionary rate variation was also much more pronounced within tunicates than within other groups, even when the very fast evolver Oikopleura dioica was excluded. For instance, the colonial species Diplosoma listerianum and Salpa fusiformis evolved considerably faster than the solitary species Ciona spp. and Styela plicata. This confirmed earlier observations based on a reduced number of taxa and substitution rate estimations on 35 housekeeping genes [16], once again underlining the peculiar genomic evolution of tunicates that might find its root in elevated mutation rates and pervasive molecular adaptation [21,22]. Even though the difference in fit between the two relaxed clock models was not significant for our dataset, in general LN provided more consistent dating estimates than UGAM with respect to the mean divergence dates of numerous vertebrate groups reported in the latest phylogenomic study of jawed vertebrates [35]. Notably, as observed in a previous phylogenomic study of tetrapods [73], the application of the UGAM relaxed clock model provided unrealistically recent estimates with respect to the maximum node age for the origin of turtles (LN Fig. 3 and Additional file 3: Figure S1). The UGAM model also tended to systematically provide much wider 95% credibility intervals than LN with several of them actually spanning hundreds of millions of years (Table 1; Fig. 1; Additional file 3: Figure   S1). Given the uncertainty associated with the dating results obtained using the UGAM model of clock relaxation, we focused our discussion below on results obtained with the more robust autocorrelated LN model, which we considered as our currently most reliable dating estimates.

A tentative timescale for tunicate evolution within chordates
The Bayesian chronogram obtained using the LN relaxed molecular clock model and the site-heterogeneous CAT-GTR+Γ 4 mixture model of amino acid sequence evolution is presented in Figure 3. This phylogenomic timescale showed that major tunicate clades appeared early in chordate evolutionary history. The earliest split between appendicularians and all other tunicates was dated back to ca.  Fig. 3). Even more recent divergences such as the ones between congeneric species within Ciona and Molgula occurred more than 100 Mya. Our estimated divergence dates in tunicates were associated with fairly large 95% credibility intervals, probably because of the lack of internal fossil calibrations within tunicates, in contrast to the well-calibrated vertebrates. It has recently been pointed that given the uncertainty associated with molecular dating estimates, building evolutionary narratives would be premature for early animal evolution [74]. In our case, we argue that in the absence of a trustable tunicate fossil record [33], our tentative molecular timescale constitutes the first and only currently available approach to provide a much-needed relative comparison of divergence times between the major lineages of tunicates and vertebrates.
The relatively ancient origins of the different tunicate lineages revealed by our molecular dating estimates have two broader implications. First, there seems to be a larger gap than previously thought between tunicate and vertebrate taxonomic ranks which exacerbates the inadequacy of their direct comparison. For example, when a vertebrate genus usually spans less than 40 million years [75], a tunicate genus (e.g. Molgula) can span up to two hundred million years ( Fig. 3; Table 1). The meaning of Linnean categorical ranks and their temporal inconsistencies among clades have been largely discussed [75], as recently illustrated by the debate around the taxonomic status of the main Chordate lineages [76][77][78]. The parallel we draw here between tunicates and vertebrates should help tunicate developmental biologists to interpret their results in light of the large divergences that might exist between tunicate model species despite their classification in the same genus. Second, the ancient age of their major divergence events can heavily complicate orthology assessment among tunicates, as well as between tunicates and vertebrates, thus reducing the quality of genome annotations. Furthermore, the fast-paced molecular evolution of tunicates prevents the identification of some genes by simple similarity methods (e.g. BLAST), even when orthologs do exist in databases. For instance, in terms of evolutionary depth, a comparative study of the genus Molgula is roughly equivalent to a comparative study among turtles representing about 180 Myr of evolution. In terms of amino acid sequence divergence, the differences are much more pronounced between Molgula occidentalis / Molgula tectiformis (88.1% similarity) than between Phrynops / Chrysemys (98.0% similarity; Fig. 3, Table 1).
From an evo-devo perspective, the phylogenetic framework and tentative timescale presented here lead to an apparent paradox. Like most nematodes [79], the embryos of each ascidian species develop in a stereotyped manner, based on the use of invariant cell lineages [7]. Unlike nematodes however, ascidian stereotyped cell lineages are shared between evolutionarily distant species such as Ciona robusta (Enterogona) and Halocynthia roretzi (Pleurogona) [11]. The extreme morphological conservation of ascidian embryogenesis therefore contrasts with the high rates of protein divergence observed in their genomes. This paradox raises questions about the underlying mechanisms involved in developmental regulation of these animals with highly dynamic genomes. In this context, our reference phylogenetic tree and divergence date estimates among tunicate lineages could be used as an evolutionary framework to select model species sufficiently close to one another (i.e. retaining sufficient phylogenetic information) for future comparative genomic analyses assessing orthology by gene tree reconciliation and estimating evolutionary rate variations among gene ontology categories.

Conclusion
This study represents the first large-scale phylogenomic analysis including all major tunicate lineages based on transcriptomic data. The resulting phylogenetic framework and tentative timescale constitute a necessary first step towards a better understanding of tunicate systematics, genomics, and development, and in a broader context, of chordate evolution and developmental biology.

Declarations
Ethics approval and consent to participate. Not applicable.

Consent for publication. All authors agree with the publication of this article.
Availability of data and material. Raw Figure S1. The Bayesian chronogram has been obtained using an uncorrelated gamma (UGAM) relaxed molecular clock model using PhyloBayes under the CAT-GTR+Γ 4 mixture model, with a birth-death prior on the diversification process, and 13 soft calibration constraints. Node bars indicate the uncertainty around mean age estimates based on 95% credibility intervals. Plain node bars indicated nodes used as a priori calibration constraints.
Numbers at nodes refer Table 1.  Average rate (x1000) Outgroups Cephalochordates Tunicates Vertebrates q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q