The aim of the present investigation was to develop a bioinformatics pipeline for retrieving, processing, filtering, editing and analyzing large amounts of sequence data from GenBank in a phylogenetic context. Instead of using supertree approaches to explore existing data (see, for example, [19, 49]), we relied on a direct reanalysis of the sequence data. Smith et al.  presented an alternative approach that they called a "mega-phylogeny approach", which also directly uses sequence data. It includes an a priori selection of gene regions of interest and an a priori separation of sequences into alleged monophyla with the aims of reducing the size of the supermatrix and improving alignment quality. A number of taxon-specific studies have also made use of GenBank sequence data, but those studies focused on specific genes (see, for example, [51, 52]). We intended to avoid a priori decisions. In our pipeline, we suggest solutions for almost any obstacle that may appear along the way from sequence retrieval to tree reconstruction under the ML optimality criterion. In various regards, our approach is an extension and improvement of earlier efforts [2, 4]. It offers an extended degree of automation in steps such as downloading from GenBank, sorting of sequences and translating and backtranslating sequences [steps I, b.II, b.V and VI] (Figure 1). Also, our approach includes improved quality management, such as by automatically checking the GenBank sequences for strand polarity and annotation, by masking problematic alignment regions and by handling compositional heterogeneity [steps b.III, VII and XI] (Figure 1). Our data selection steps [for example, steps III, IX and XV] (Figure 1) guarantee standardized levels of the density of the data set and of sequence overlap between included species. By choosing a minimum sequence overlap of 100 positions, we attempted to find a reasonable compromise between sequence overlap and number of species in the analysis. A larger overlap would have led to a significant decrease of the number of species in our phylogenetic tree. Furthermore, the present study is an update in terms of tree reconstruction facilities. We have, for the first time, applied a ML algorithm to such a large amount of GenBank data [step XVII] (Figure 1). Our approach is more general and independent of the taxonomic group. Finally, our bioinformatics solution is transparent and user-friendly. We provide all new scripts with respective comments and detailed manuals as part of this publication so that the pipeline is ready for use by anybody interested. In the following paragraphs, we discuss the results of our exemplary pipeline run with Hymenoptera data.
Data set and analysis
One of the main characteristics of data sets when combining sequence data from independently conducted investigations is data scarcity; that is, the lack of data overlap. Data distribution in supermatrices is unbalanced, and, as a consequence, there is a huge amount of missing data. However, data sets do not necessarily have to be complete to provide phylogenetic information. In fact, there is evidence that even with very low coverage, reliable phylogenetic estimates can be obtained (see, for example, ). The sheer proportion of missing data is not decisive as long as the number of characters scored is sufficient to correctly place the taxa in the tree . Accordingly, we tried to cope with the problem of data scarcity by ensuring a minimum sequence overlap between taxa and a standardized data set density [steps III, IX, XIII, XIV and XV] (Figure 1). Still, our Hymenoptera data matrix is very large and exhibits very low coverage (1.5%). This is a direct consequence of the characteristics of the original sequence information present in GenBank. A large number of species for which only few sequences are available contrasts with a small number of species for which the transcriptome, the mitochondrial genome or even the entire nuclear genome have been sequenced. By combining all of these data in a single analysis, this data set will inevitably become large and unbalanced and will suffer from low overlap between taxa. Irrespective of the fact that sequencing is getting cheaper and faster and that phylogenomic data will rapidly increase the size of data sets, the data characteristics described herein are still expected to prevail in the near future. The challenge is to find optimal subsets for phylogenetic analysis in order to explore available information and to subsequently identify and fill the most severe gaps via target-specific sequencing. Accordingly, one of the goals of our approach has been to identify unstable nodes and to suggest future foci of molecular phylogenetic studies, in Hymenoptera, for an effective, economical and time-saving process.
For tree reconstruction, we performed supermatrix ML analyses. To the best of our knowledge, this is the largest set of eukaryotic real data studied using ML analysis. Past studies that utilized very large data sets applied supertrees or parsimony analyses. For example, McMahon and Sanderson  and Thomson and Shaffer  applied maximum parsimony analyses with supermatrices in their pipelines, but stated that they based this decision mainly on speed and computational capacity. However, with the latest program version of RAxML implementing partitioned analysis, rapid bootstrap functions, and the ability of parallel analyses, even very large data sets, can be analyzed in a reasonable amount of time. In the next few years, systematic biologists' access to multicore computers will get easier and broader, and high-performance computing (HPC) will become routine. At the moment, subsets should be constrained in size to allow ML analysis. During our work, we set an approximate maximum of 1,500 taxa and 100,000 sites. Phylogenetic analyses of subsets of this size take a maximum of two weeks on a fully parallelized HPC unit such as the one that we used. Unless one wants to analyze data sets that are significantly larger than ours, there is no computational or speed argument left to perform supertree or parsimony methods in favor of ML analyses. Accordingly, our approach was designed to prepare data for ML analysis. However, if a user wants to apply other algorithms for tree reconstruction (for example, maximum parsimony) or to adjust parameters (for example, to seek an extension of exploration of tree space or a comparison between inferred trees), the supermatrix produced by our pipeline can be used just as well (after step XVI) (Figure 1).
The phylogeny of Hymenoptera
We have restricted our results and discussion to (1) new contributions to the phylogeny of major lineages within Apocrita and to the monophyly and phylogeny of Proctotrupomorpha, (2) the recovery of some noncontroversial relationships and (3) the diagnosis of persistent problems and possible solutions. Phylogenetic relations within Hymenoptera are far too numerous and complex to be exhaustively discussed. The complete trees in Additional files 5 and 6 can be consulted for lower systematic level relationships.
In the following subsections, we repeatedly refer to single species as "misplaced". This means that their position as inferred in our trees clearly contradicts previous results from taxonomic as well as morphological and molecular phylogenetic studies. Accordingly, the phylogenetic positions of these taxa were considered artefacts and were excluded from discussion of topologies.
Major lineages within Apocrita
Within Apocrita, our analysis suggests a topology of Stephanoidea + (Ichneumonoidea + (Proctotrupomorpha + (Evanioidea + Aculeata))) (with misplacement of a single Vanhorniidae as sister to Stephanoidea being ignored) (Figure 3). Stephanoidea was inferred to be sister group to all other Apocrita in the morphological analyses of Vilhelmsen et al. . Our analysis gives additional support for this relationship. The Ichneumonoidea are monophyletic in our trees. (Misplacement of a single Trigonalidae as sister to Braconidae is ignored.) Ichneumonoidea has been suggested as sister group to Aculeata by Rasnitsyn , a relationship that found only moderate support from Vilhelmsen et al.  and was not retrieved by most recent analyses (see, for example, [16, 21, 24, 55, 56]). Our trees corroborate the results of most analyses cited above and suggest a rejection of the clade Aculeata + Ichneumonoidea. Instead, we found Evanioidea to be sister group to Aculeata in our trees. A sister group relationship of Evanioidea and Aculeata has been suggested only by the combined morphological and molecular analysis by Sharkey et al. , and there are currently no convincing morphological synapomorphies that would support this clade. However, despite low branch support, we consider it quite possible that the Evanioidea are the long-sought sister group to the Aculeata and suggest further investigation of this particular clade. Rasnitsyn  introduced the supertaxon Evaniomorpha, which includes Evanioidea, Ceraphronoidea, Megalyroidea, Trigonaloidea and Stephanoidea. We argue against the monophyly of Evaniomorpha, as our data support Stephanoidea as sister taxon of the remaining Apocrita (corroborating Vilhelmsen et al. ). We cannot provide substantial information on the position of the superfamilies Ceraphronoidea, Megalyroidea and Trigonaloidea, because their representatives are either included solely in the extended, possibly less reliable tree 2 (Ceraphronoidea) or obviously misplaced (Megalyroidea and Trigonaloidea).
In our analyses, Proctotrupomorpha s.l. (that is, sensu Rasnitsyn 1988 ) was retrieved when again ignoring a few misplaced taxa. In tree 1, Proctotrupomorpha comprises Chalcidoidea, Platygastroidea and Cynipoidea (all of which are monophyletic, forming Cynipoidea + (Platygastroidea + Chalcidoidea)). In tree 2, more representatives of Proctotrupomorpha s.l. are present, and the inferred topology suggests the following relationships: Cynipoidea + (Platygastroidea + (Mymarommatoidea + (Diaprioidea + Chalcidoidea))). This contradicts the often proposed sister group relationship between Mymarommatoidea and Chalcidoidea (see, for example, [24, 57, 58]; but see the ambiguity in ). A sister group relationship between Diaprioidea and Chalcidoidea was retrieved in the molecular analyses of Castro and Dowton , but their taxon sampling lacked Mymarommatoidea, and was retrieved by Heraty et al. . Our study is one of the first to include Mymarommatoidea in a molecular phylogenetic analysis, but the position of Mymarommatoidea in our analysis is not well supported and the group is represented only in the less reliable tree 2. A position of Chalcidoidea outside Proctotrupomorpha was recently proposed by Sharanowski et al.  based on the analysis of 24 putative orthologous genes (derived from ESTs) from a small number of taxa. We regard this position as unlikely based on our own results and those of previous molecular studies that provided respective parts of our data set [16, 21, 56]. The most recent morphological or combined morphological and molecular analyses also contradict an origin of Chalcidoidea outside Proctotrupomorpha [17, 57].
Recovery of noncontroversial relationships
We evaluated the reliability of the inferred phylogenetic trees by the recovery of phylogenetic relationships that are largely considered noncontroversial. We found positive indications in tree 1. Specifically, our results are consistent with the generally accepted paraphyly of "Symphyta" (see, for example, ) and with the generally accepted monophyly of Apocrita and Aculeata (see, for example, [24, 28]) (with misplacement of one Megalyridae within Aculeata being ignored). Also, we retrieved the noncontroversially monophyletic superfamilies Apoidea, Chalcidoidea, Cynipoidea, Evanioidea, Ichneumonoidea and Siricoidea. However, some crucial taxa were not represented in tree 1: Xyelidae and Orussidae. If we add them to the data set to infer tree 2, they are misplaced. The Xyelidae are found as a sister group to Pamphilioidea (Figure 4). This position is not very likely, as the sister group relationship of Xyelidae and the remaining Hymenoptera is well supported [25–27]. The Orussidae, which have a key position within Hymenoptera evolution as sister group of Apocrita, are placed at the base of Apocrita along with some Proctotrupoidea taxa (Figure 4). However, the clade Orussidae + Apocrita is well established and supported by morphological and molecular data (see, for example, [13, 17, 18, 57]). This demonstrates the necessity of sequence overlap definitions and shows that the positions of reincluded taxa (indicated by asterisks in Figure 4 and Additional file 6) have to be discussed with caution. The backbone of the tree, with its major splits, however, remains largely unaffected by adding taxa that do not fulfill our overlap criteria.
Diagnosis of persistent problems and possible solutions
With the aid of our trees, we identified several persistent problems in the Hymenoptera tree. While the available sequence data already cover all major lineages of Hymenoptera, they are unequally distributed and there is poor overlap among taxa. This contradiction between taxonomic breadth and genomic depth in the data of Hymenoptera is in accordance with the conclusions of Sanderson  in his evaluation of the phylogenetic signal in Eukaryota. The large amount of missing data and the low taxonomic overlap between mitochondrial and nuclear data in our sets call for a solution. To get more independent markers and to close the taxonomic gap between mitochondrial and nuclear data, we suggest EST studies (nuclear genes) for taxa with completely sequenced mitochondrial genomes and sequencing of mitochondrial genomes of those taxa for which we already have a large number of nuclear sequence data available.
An obvious problem for solving higher-level relationships within Hymenoptera is the underrepresentation of the small superfamilies Megalyroidea, Trigonaloidea, Ceraphronoidea and Mymarommatoidea. Another highly problematic issue is those families of Proctotrupoidea that we currently cannot map on the phylogenetic tree. Any additional data regarding these taxa in terms of species and genes will be of great value.
As extensive EST studies are still expensive, we also recommend target-specific amplification of nuclear coding genes. With the prospect of new primer design tools (J. Borner, C. Pick, T. Burmester, unpublished data), amplification and sequencing of a data set of, for example, 22 taxa (all superfamilies) and 50 nuclear coding genes can be accomplished in a reasonable amount of time and at reasonable cost. Taxon sampling should again be based on taxa with completely sequenced mitochondrial genomes.