Major trends
Our analyses indicate that progress on the vertebrate tree of life has been remarkably rapid over the last 16 years, resulting in an at least 50% bootstrap support for approximately one quarter of the nodes in the vertebrate tree of life (Figure 1c). The increase in resolution has been faster than linear (linear r2 = 0.970; second-order polynomial r2 = 0.998; P-value for significant increase in r2 = 4.13 × 10-12) and appears to be proportional to the increase in number of publications on phylogenetics.
When the 100 clades are pooled into major vertebrate lineages (classes or similar taxonomic levels), several important trends emerge (Figure 2). Among high-diversity clades (Figure 2a) phylogenetic progress has been most rapid for tetrapods and, in particular, for mammals and birds. However, progress has been even more striking for relatively low-diversity clades (those containing < 2% of vertebrate diversity) such as crocodilians and turtles (Figure 2b). Crocodilia, with only 23 contained species, is particularly well resolved. We recovered 80% of the nodes in its tree (Figure 2b), 60% of which were well supported at a bootstrap level of 95 (Table 1). This trend is general across the clades that we sampled. Large clades, on average, experience less research effort per contained species than small clades, resulting in data sets with large amounts of missing data (Figure 3) and comparatively low resolution (Figure 2a and 2b). Among the high-diversity vertebrate lineages, we found three relatively distinct levels of progress. Birds and mammals have undergone the most rapid phylogenetic progress, with approximately 40% of each group's tree of life resolved (Figure 2a), followed by amphibians and squamate reptiles (~30%) and ray-finned fishes (~15%). Our estimate of 40% resolution of the mammal tree of life is similar to a recent species-level analysis of most mammal species [12], suggesting that our automated supermatrix approach is performing reasonably well. The study by Bininda-Emonds et al. [12] found a 46% resolved supertree for mammalia, but included all the deep-level nodes in the mammal tree which tend to be better resolved than the tip level nodes and, thus, increase the overall resolution.
Another major trend in the accumulation of vertebrate phylogenetic knowledge is that marine clades are the least well characterized of all major vertebrate lineages. Ray-finned fishes are extremely diverse, containing over half of all vertebrate species, and this may explain their low (~14%) proportion of resolved nodes. However, the cartilaginous fishes (sharks, rays and skates) and lampreys are both species-poor (with ~1200 and ~40 species, respectively) and have a similarly low resolution (Figure 2b), suggesting that phylogenetic progress on marine 'fishes' has lagged behind the remaining vertebrates, regardless of species diversity per se. The cetaceans (whales, dolphins and porpoises) represent the only exception to this trend in our dataset, with 100% species-level sampling and 61% of nodes in their tree of life resolved (Additional Files 2 and 3).
Although progress has increased steadily for virtually all clades, it has improved dramatically for a few groups. Phylogenetic resolution in the amphibians was accumulating at the same slow rate as ray-finned fishes throughout the early period of molecular systematics, but then saw the most rapid phylogenetic progress of any diverse clade beginning in about 2003 (Figure 2a). This increase was probably due to a large influx of funding and several prominent studies on amphibian systematics in the last several years [16, 20–25]. Both amphibian and bird research received major National Science Foundation funding from the Assembling the Tree of Life initiative (in 2004 for the amphibians and in 2002 and 2003 for birds [26]), which was immediately followed by rapid increases in phylogenetic progress.
What dataset features are associated with phylogenetic resolution?
As vertebrate phylogenetics moves forward, we can ask what gene- and taxon-sampling features are most strongly associated with phylogenetic resolution: such insights should aid the community in allocating resources for future research. For example, among the 100 clades that we analysed, species sampling in GenBank varied between 6% (for the Paracanthopterygii) and 100% (for Cetacea, Crocodilia, Lemuriformes and Perrisodactyla) of described species (Additional File 2). Regardless of the measure (gene, taxon or character sampling), the amount of effort has been uneven across clades (Figure 3, Table 1, Additional File 2), resulting in a correspondingly uneven distribution of phylogenetic information (Figure 2, Table 1, Additional File 2). In order to assess the effects of sampling patterns on phylogenetic resolution, we performed a multiple regression of the proportion of species sampled in a clade, the total number of species in the clade, the average number of characters per species and dataset density (or proportion of non-missing data) on phylogenetic resolution for the 100 clades (r2 = 0.93, P = 9.3 × 10-55). The proportion of species sampled in a clade has the greatest effect on phylogenetic resolution (partial regression coefficient 0.80), followed by dataset density (partial regression coefficient of 0.22). Both clade size and number of characters have smaller, but significant, impacts (partial regression coefficients of 0.10 and 0.11, respectively). Dataset density co-varies with clade size and number of characters (Figure 3), with exceptionally large clades having both fewer characters as well as lower dataset densities.
Are we concentrating on the best genes?
While confidence in the tree of life will ultimately come from rigorously analysed multiple marker datasets, the first approximations will most likely come from sparsely sampled (at the gene- and character-level), taxonomically enriched datasets. Following the large effect of taxon sampling, the density of these matrices appears to have the largest effect on the resolution of the resulting trees. To this end, researchers can help to increase the data density in the 'vertebrate matrix' by focusing on a common set of markers, in addition to clade-specific markers. Ideally, this common set of markers should comprise the most informative and the most commonly used markers.
We examined sampling efforts by identifying those genes that have been most heavily sampled for phylogenetic studies and asking whether those genes also carry a strong phylogenetic signal. We selected the most heavily studied subset of the 100 clades (defined as clades that had been sampled for more than 20 genes, n = 37), analysed each of the genes for these 37 clades independently and asked which genes had received the most sequencing effort (measured by the number of taxa that had been sequenced) and which genes provided the most resolution (measured by the relative amounts of resolution found in phylogenies derived from each gene). The most frequently sampled genes were (in decreasing order) the mitochondrial markers cytochrome B, 12 S and 16 S ribosomal RNAs. These mitochondrial genes were among the top five genes in terms of taxon sampling in 89%, 76% and 73% of the clades, respectively. The most frequently sampled nuclear genes were recombination activating gene 1 (RAG-1), β-fibrinogen and myoglobin, which were among the top five gene clusters in terms of taxon sampling in 22%, 19%, and 16% of the clades, respectively.
The most highly resolved gene trees in our dataset were derived from the mitochondrial nicotinamide adenine dinucleotide dehydrogenase subunit 2 and control region, the nuclear β-fibrinogen, aldolase B, myoglobin, growth hormone 1 and RAG-1. Thus, among the nuclear genes, the most heavily sampled genes were also among the most resolved genes, although this analysis also identifies growth hormone 1 and aldolase B as strong candidates for the additional sequencing effort. In the mitochondrial data, the genes with the most resolving power were not among the most heavily sampled. Further, the most heavily sampled mitochondrial genes did not rank near the top of mitochondrial genes in terms of resolving power. Cytochrome b, 12 S and 16 S ribosomal RNAs, rank at numbers 8, 9 and 10 (out of 16) in terms of the resolving power for mitochondrial genes. Despite these results, an attractive target for additional mitochondrial sequencing in vertebrate phylogenetics is, perhaps, cytochrome oxidase I. This gene is already the target of massive DNA bar-coding efforts and ranked well in our analysis, in terms of both species sampling (four out of 16) and phylogenetic resolution (five out of 16).
We checked that these results were not being driven by a correlation between the number of species sampled for a gene and phylogenetic resolution and found no significant relationship (r2 = 0.0015, regression slope = 9.7 × 106, P = 0.15). The analysis is based on averages across several clades and it is well known that rates of molecular variation vary across the tree of life. Further, these recommendations apply to studies at relatively shallow levels of diversity (generally at the family level and below). Thus, these genes appear to be attractive starting points for a common gene set, although it is unlikely that they will be the most informative genes in all clades and certainly not at all phylogenetic scales.
Future progress on the tree of life
Extrapolations based on the last 16 years provide a framework for the discussion of the future progress on the vertebrate tree of life. Like any extrapolation, these projections are assumption-laden and are necessarily approximate, although they are instructive. If current trends continue, we predict essentially complete species-level sampling before 2020 (Figure 4). This assumes that all species are equally easy to sample and that no new vertebrate species will be described, both of which are incorrect assumptions. Presumably the last few species of many clades will be those that are rare and/or secretive, subject to political difficulties with collecting the data or are recently extinct. Even so, current trends predict that most species will have sequenced DNA available in roughly another decade. Although a single sequence for a single specimen is a far cry from a complete phylogeny, our projections suggest that we will have at least a rough phylogenetic placement, based on molecular data, for most vertebrates in the near future.
Although projections for the resolution of the tree itself imply slower progress, they still suggest that an essentially fully resolved vertebrate tree of life is within reach. Again, these projections are based on strong assumptions, as some parts of the tree will be more difficult to resolve than others (for example, those with short branches, incongruous gene trees, etc.). If much of the tree is difficult, and our progress to date is largely made up of the 'easy' parts, future progress will take much longer than our projections. Alternatively, if we assume that our progress to date has been an unbiased sample with respect to easy versus difficult nodes, then the current rate of progress suggests that we will understand a majority of the vertebrate tree, with strong support, in the next three decades (Figure 4).