The Holy Grail of phylogenetic research is to reconstruct the evolutionary relationships for all of life, currently, if vaguely, estimated to range somewhere between 3 and 100 million species [1]. Hundreds of years of systematic research have arguably yielded a reasonable idea of how the main branches of the Tree of Life are arranged, at least for eukaryotic organisms. Nevertheless, numerous problematic branches naturally remain, as does the question of how the main branches come off the tree trunk. In large part, much of the challenge going forward will be to fill in this scaffold formed by the major branches to provide a complete evolutionary picture of the approximately 1.7 million-and counting-described species on the planet.
In our attempts to derive the Tree of Life, the limiting factor has always been the amount of data available to us. Prior to the molecular revolution, phylogenetic data were comparatively limited, with only morphology being generally available (ignoring early molecular data sources such as DNA-DNA hybridization or immunogenetic and serological data). These data were sufficient to provide us with a general overview, but the resolution was often limited. So, for example, whereas the main groupings, or orders, of eutherian mammals were relatively uncontroversial, their relationships to one another were not. Similarly, the large morphological differences between the animal phyla made them easy to distinguish, but often difficult to place relative to one another.
The growing abundance of DNA sequence data-whether in the form of individual genes, expressed sequence tags (ESTs), or whole genome data-has brought a wealth of new information into play, sometimes contradicting classical hypotheses and often providing more resolution than morphology could alone. The past 15 years have witnessed an explosive growth in sequencing effort and in public databases of sequence information such as GenBank and its sister databases EMBL and DDBJ. Indeed, the amount of information in GenBank is staggering. As of April 2011, the nearly 200 million sequence records in the traditional and whole genome divisions comprised nearly 320 billion bases for almost 250,000 species. The growing use of next-generation and next-next-generation sequencing technologies promises to accelerate the growth rate even further.
Despite this, the amount of molecular data remains limited and our data matrices are very sparse, even for well sampled groups such as green plants or mammals [2]. Paradoxically, sparse and limited as the data are, they are still stretching the limits of what we can process currently, from the point of view of both data collection and actual phylogenetic analysis. The phylogenomic pipeline developed by Peters and colleagues [3] represents the latest in a series of automated solutions (for example, [4–7], in addition to those listed in [3]) to both of these problems, all of which are geared to facilitate large-scale phylogenomic analyses using publicly available sequence data. Using their pipeline, they were able to construct a comprehensive molecular tree of over 1,100 species of Hymenoptera (bees, ants, wasps, and sawflies; Figure 1), presenting the state-of-the-art with respect to hypotheses of evolutionary relationships within the group (Figure 2).
Taken together, these phylogenomic pipelines epitomize the revolution that the combination of molecular sequence data and bioinformatics has wrought on phylogenetic research in the past 20 years. Our evolutionary trees are larger and more complete than ever, giving hope that the Tree of Life might be realised soon. However, obstacles still stand in our way.