A hidden Markov model-based classification of Rab protein sequences
We started our classification by collecting known Rab protein sequences from 21 widely diverse species. After aligning the sequences and removing those of low quality, we obtained a first set of approximately 500 sequences from which we reconstructed a phylogenetic tree and clustered similar sequences (see Methods for details). We extracted a conserved core Rab motif from the constructed alignment, partitioned it into sub-alignments according to the different initial sequence groups and constructed an HMM for each one. Finally, we used these models to search genome and expressed sequence tag (EST) databases for further Rab sequences. With the expanded dataset we repeated the analysis and refined the set of specific HMMs. This process was repeated until no further improvement in the classification was achieved. Finally, we supplemented our dataset with data from further species, obtaining more than 7600 different Rab sequences from more than 600 species representing all major eukaryotic phyla. For 384 of these species we found genome projects listed in the Genomes On Line Database (GOLD) [21]. The analysis contained 248 metazoans (of which 131 were included in GOLD: 37 genomes were complete, 3 were draft, and 91 were incomplete), 166 fungi (25, 4, and 90 respectively), 81 plants (16, 1, and 30), 19 apicomplexans (8, 0, and 7), 20 heterokonts (7, 1, and 5), and 11 kinetoplastids (4, 0, and 4).
Identifying the Rabs of the last eukaryotic common ancestor
Our classification analysis identified a set of 20 basic Rab types (Figure 1). Deducing which of these Rab types were present in the LECA was complicated by the fact that the placement of the root in the eukaryotic tree of life is still under debate [22–24]. What is perhaps the most widely accepted hypothesis places the eukaryotic root between the bikonts and unikonts [23, 25, 26]. Other possible scenarios reported to date include rooting the tree within or close to Excavata [27, 28], and studies of rare genomic changes have led to the proposal of a root between Archaeplastidia and all other eukaryotes [29]. We inspected the effect of these different hypotheses on the set of likely LECA Rabs, and found that the presence of the same 20 Rabs in the LECA was supported by two of the three trees (Figure 2). The exception was the tree based on the Archaeplastida outgroup, in which the number of LECA Rabs reduced to 14 of these 20. However, it should be noted that in this model the position of the excavates was uncertain, with the authors stressing the need for 'extreme caution' [29], and if the excavates were placed with the Archaeplastida, then the number of LECA Rabs would again be 20.
It is also formally possible that Rabs have moved between kingdoms by horizontal gene transfer, but the trees for the 20 Rabs generally fit well with species evolution. In addition, horizontal gene transfer after endosymbiosis would probably have involved a non-unikont species as the donor and so would not have increased the number of Rab families in the bikonts. If one divides eukaryotic phyla into the four proposed supergroups (Unikonta, Excavata, Archaeplastida and SAR+CCTH [28]), then of the 20 putative LECA Rabs, 19 are present in at least three (Figure 3 and Additional file 1). The exception was Rab29, whose only occurrences outside of unikonts were found in Naegleria gruberi (an excavate) and Thecamonas trahens (a member of the phylum Apusozoa, whose relationship to other phyla is unclear) [23, 30]. Thus, Rab29 seems to be the only equivocal Rab in our proposed set of 20 LECA Rabs. Given these observations and caveats, we have based our further discussion on the cautious assumption that all 20 basic Rab types were in the LECA.
For each of these 20 basic types we constructed HMMs. However, this basic set has diversified over time in all major phyla, and especially in metazoans. For the majority of these diversifications, it is clear from which LECA Rab they descended; however, we found a number of lineage-specific Rab types that have diverged substantially from their ancestors, making them difficult to place with one of the LECA Rabs. This problem is especially present in protozoan lineages for which only few species have been sequenced to date. For a few metazoan duplications, we used an approach based on sequence similarity to identify their ancestral Rab type (see Methods for details). To better distinguish basic Rabs from those that developed later, we supplemented the generated HMMs with 15 further models that recognize metazoan-specific types, and especially those that were difficult to place. Other eukaryotic lineages with extensive genome sequence coverage, such as fungi and plants, do not possess such diversified Rab sets, and so for these cases we did not need to generate further specific HMMs.
All our models showed at least 95% sensitivity and positive predictive value (see Additional file 2). Thus, the structure of the Rab family tree developed from our analysis appears to be robust, allowing considerable confidence in its implications. We have implemented a web interface called the Rab Database which provides access the collected information and allows searches with new proteins against our HMM-based classifiers (http://bioinformatics.mpibpc.mpg.de/rab/).
The history of Rab evolution shows some striking features, including an unexpectedly large number of different Rabs present in the LECA. These twenty LECA Rabs can be arranged into six supergroups, which has implications for the evolution of the membrane system prior to the LECA. During the subsequent divergence of eukaryotes many of the LECA Rabs have been lost in particular lineages, whereas other families have expanded, and in some cases it is possible to correlate this with loss or gain of particular cellular processes. These points are discussed in detail below.
The last eukaryotic common ancestor had a large repertoire of Rabs
Our analysis identifies a set of 20 Rab proteins that are likely to have been present in the LECA based on the arguments above (Figure 1, Figure 3). Of these proteins, Rab1, Rab2, Rab4, Rab5, Rab6, Rab7, Rab8, Rab11, Rab18, Rab21, Rab23, and Rab28 are known likely candidates, and Rab14, Rab32, and RabL4 have just recently been added to this list [16]. To these proteins we can now add Rab22, Rab24, Rab29, RabX1, and Rab7L1. This total of 20 is larger than that found in many extant eukaryotic phyla, such as plants and fungi. This indicates that the endomembrane system of the LECA was relatively complex, consistent with studies of proteins involved in other cellular processes such as the cytoskeleton, all of which have indicated that the LECA was a particularly complex and sophisticated cell [24, 31].
It is well established that the LECA must have had a Golgi apparatus, and the capacity for both endocytosis and phagocytosis [32]. This would be consistent with the roles that many of these Rabs have been reported to play in extant eukaryotes. Thus Rab1, Rab2, and Rab6 are on the Golgi, and Rab8 acts in Golgi to plasma membrane traffic, while Rabs 4, 5, 7, 11, 14, 21, and 22 are in the endosomal system, and are thus likely to have acted in endocytic and phagocytic processes of the LECA cell. These processes will have allowed uptake of food sources, as well of recycling of components to the cell surface, and possibly the fusion of a contractile vacuole to expel water [33, 34]. Likewise, RabL4 (also known as intraflagellar transport (IFT)27) and Rab23 are known to be involved in cilia/flagella formation or function in extant eukaryotes, consistent with other proteins specific to these structures being found in all eukaryotic kingdoms and thus present in the LECA [35].
Of the remaining seven LECA Rabs, Rab32 and its paralogs in extant eukaryotes (for example, Rab38) are well established to be involved in forming lysosome-related organelles (LROs) such as melanosomes, platelet dense granules and alveolar lamellar bodies [36, 37]. Rab29 and Rab7L1 are distantly related (see below), and the latter has been proposed to also have a role in granules derived from endosomal system [38]. It can only be speculated as to how the LECA may have used LROs, but obvious possibilities include pigment granules to block sunlight, secretory granules to combat competitors, or granules that fused with phagosomes to aid killing of phagocytosed bacteria in the manner of neutrophils [39]. Of the few non-metazoans that have conserved this Rab, phytophthora contain a large number of granules in their motile zoospores, which are released during encystation to rapidly reform the cell wall, concomitant with a loss of motility [40].
This leaves four Rabs (Rab18, Rab24, Rab28 and RabX1) whose role even in extant eukaryotes is unclear, although their ancient origins suggest that they may have more fundamental roles than previously thought. Of these, Rab18 is the best characterized, with several reports suggesting a role on the endoplasmic reticulum (ER), either in lipid droplet formation or in Golgi to ER traffic [41, 42]. However, despite Rab18 being well conserved in many eukaryotic phyla, human patients lacking Rab18 do not show obvious defects in lipid storage or general secretion, and so the role of Rab18 remains unclear [43]. However, it is possible that Rab40 may have a partially redundant role with Rab18, as in humans it seems to have expanded from Rab18 during metazoan evolution (see below). Of the remaining three LECA Rabs, Rab24 has been linked to autophagy, but cannot be obligatory for this process as it is conserved in only a very few non-metazoan phyla, and Rab28 has been linked to endosome function in trypanosomes, but little is known of its role in metazoans [44, 45]. Finally, RabX1 is conserved in only a few sequenced genomes across several phyla, including various invertebrates but not vertebrates. In Drosophila, it is expressed primarily in the nervous system, and a P-element insertion next to the gene perturbs development of the peripheral nervous system [46, 47].
Rab expansion during the period of evolution leading to the last eukaryotic common ancestor
Many eukaryotic-specific genes were already present in the LECA as families, which indicates that they must have duplicated and diverged in the period between the earliest eukaryote(s) and the LECA. The Rabs are a particularly extreme case of this, and our analysis allows insights into how this family emerged. During the iterative refinement of our analysis, it became apparent that the twenty LECA Rabs fell into a set of six larger supergroups, suggesting a primitive pre-LECA eukaryote with just six Rabs. The tree (Figure 1) shows the relationship between the different LECA Rabs, and provides statistical support for this observation. Interestingly, each of these six supergroups comprised Rabs that are mostly associated with one particular process, consistent with diversification from a simpler system in which there was a single Rab for each of the following: secretion (group I), early endosomes (group II), late endosomes (group III), recycling from endosomes to the surface (group IV), recycling from endosomes to Golgi (group V), and traffic associated with cilia/flagella (group VI).
Rab family evolution during the diversification of the eukaryotes
Comparing the LECA Rabs with those present in extant eukaryotic phyla revealed two striking patterns. Firstly, many Rabs have been lost in at least some lineages, despite being conserved in others, with relatively few Rabs seeming to be indispensible (Figure 4). Secondly, some Rab families have expanded greatly in particular kingdoms (Figure 3). In both cases, it was sometimes possible to correlate these genetic changes to changes in cell structure and function. Below we discuss these two aspects of Rab evolution in detail.
Rab losses during eukaryotic evolution
Although the LECA seems to have had at least 20 Rabs, most of these are clearly not essential for eukaryotic life, as many have been lost in one or more kingdoms (Figure 4). Indeed, of the twenty Rabs present in the LECA, only five seem near-indispensible, in that they are present in almost all well-characterized genomes reported to date. These are Rab1, Rab5, Rab6, Rab7 and Rab11, which correspond precisely to five of the six supergroups present in the LECA, with the remaining supergroup probably being that linked to cilia/flagella. This again suggests that membrane traffic is fundamentally underpinned by just five Rabs. Note that even these 'indispensible' five show some losses in a few very reduced or parasitic eukaryotes such as the microsporidian Encephalitozoon cuniculi, and although all five are present in the budding yeast Saccharomyces cerevisiae, remarkably three of the five (Rab5, Rab6 and Rab7) are not actually essential for the viability of this yeast. In addition to these core five Rabs, there are three Rabs that are only rarely lost in free living eukaryotes: Rab2 and Rab18 (both lost in budding yeasts) and Rab8 (lost in kinetoplastids). It may be that these three can be lost because the 'indispensible' Rab in their group can sometimes take over their role, and, indeed, it has been shown for S. cerevisiae that a single chimeric Rab can perform the roles of both Rab1 and Rab8 [48, 49].
Consistent with the LECA having a large Rab repertoire, there is evidence that several eukaryotic kingdoms contain more Rabs than do many of that kingdom's individual species (Figure 4). For instance, basal fungi have already lost seven of the twenty LECA Rabs (Rab14, Rab21, Rab22, Rab24, Rab28, Rab29 and Rab7L1), but it is apparent that further Rabs were lost during the expansion of the fungal kingdom. All non-basal fungi have also lost Rab23, Rab32 and RabL4. In addition, Rab18 has been lost in all Saccharomycotina, while Rab2, Rab4 and RabX1 are still present in Yarrowia lipolytica but seem to have been lost in all later Saccharomycotina, including S. cerevisiae, leaving the latter with orthologs of only six of the LECA Rabs. For plants, only 14 out of the 20 LECA Rabs are present in the Chlorophyta. Interestingly, all more derived plants have lost three additional LECA members (Rab24, Rab28 and RabL4). In addition, angiosperms have also lost Rab23, and so even these complex multicellular eukaryotes lack almost half of the Rabs present in the LECA.
In some cases, it is possible to correlate loss of particular sets of Rabs with changes in cellular organization. Rabs linked to cilia and flagella (RabL4 (IFT27) and Rab23), have been lost in those organisms that have also lost these structures, in particular most plants and fungi. This loss is often associated with the organism gaining a cell wall, and such a structure would also prevent phagocytosis of large objects. This provides a possible explanation for the loss of Rab14 in these lineages, as this Rab is recruited to phagosomes in both mammals and Dictyostelium [33, 50]. In other cases, genome compaction may have driven Rab loss, as particularly small Rab repertoires are present in organisms with compacted genomes such as the microalgae Ostreococcus, or the budding yeasts, although it is also possible that this is an indirect consequence of a general simplification of the intracellular membrane-traffic systems of these organisms.
Rab expansions
While many Rabs show a history of extensive independent losses, there have also been many cases in which Rab families have expanded by gene duplication and diversification. Metazoans show expansions in twelve different LECA Rabs, plants in eight, heterokonts in five, apicomplexans in three, fungi in two and kinetoplastids in just one (Figure 3; Figure 4). We discuss these expansions briefly for each of the six supergroups of Rabs.
Group I: Rab1, Rab8, Rab18
This supergroup has a particularly complex history of expansions and losses. Rab1 is the 'indispensible' member of the supergroup, and seems to have been duplicated a number of times in metazoans (Rab19, Rab30, Rab33, Rab35, RabX6). The addition of Rab35 seems to predate the rise of metazoans, as we could also identify it in Capsaspora owczarzaki, which branched off from the pre-metazoan lineage after fungi but before choanoflagellates. RabX6 appeared in metazoans, but is one of the few Rabs that is lost in vertebrates. Rab1 has duplicated independently in most other phyla, including apicomplexans (Rab1B) and heterokonts (Rab1B) [51]. In angiosperms, there are three different Rab1 proteins, apparently expanded from one in Bryophyta.
Rab8 has probably the most complex history of the LECA Rabs. It has been independently triplicated in heterokonts and, similar to Rab1, it has been duplicated in angiosperms but not in Bryophyta. More strikingly, it has a large set of duplications in metazoans (Rab3, Rab10, Rab15, Rab26, Rab27, Rab34, Rab44, Rab45, RabX4). Rab44 and Rab45 are two of three Rabs that have an additional domain present, with two EF hand motifs present in the N-terminal region of the protein. Many of these Rab families have expanded further in vertebrates, but a few, such as RabX4, have been lost. Although RabX4 is present in insects and not in vertebrates, it is also present in a few genomes of more primitive metazoans (for example, Amphimedon queenslandica, Nematostella vectensis and Strongylocentrotus purpuratus), indicating that it arose early in metazoan evolution and was widely lost. In addition, Rab8 seems to have undergone another duplication in deuterostomes (Rab12) and two in vertebrates (Rab8b, Rab13).
By contrast, Rab18 seems to have a rather simple evolutionary history, being present in all major clades and only being lost in Saccharomycotina. It shows a duplication in angiosperms and also duplication in bilateria (Rab40), which expanded further in vertebrates (Rab40b, Rab40c) and then in primates (Rab40a, Rab40aL).
Group II: Rab5, Rab21, Rab22, Rab24 and RabX1
Rab5 is apparently indispensible, and also has the most complex evolutionary history of this supergroup. It seems to have been duplicated independently in basal fungi after the loss of their flagellum (Ypt52), and also in apicomplexans (Rab5B) and kinetoplastids (Rab5B). In addition, it has quadrupled in vertebrates (Rab5a, Rab5b, Rab5c, Rab17). Similarly, there are independent duplications in Saccharomycotina (Ypt10) and in Angiospermae (Rab5B).
Rab21 and Rab22 have no major expansions (Rab22 has a duplication in vertebrates), and have been lost in members of several phyla. Rab24 is rather unusual as it is well conserved in metazoans, where it has also been duplicated (Rab20), but it seems to have been lost in most other species. However, we could detect members of this group in a small but diverse group of species outside of unikonts, indicating that it was present in the LECA (Figure 3; see Additional file 1). The final member of this group, RabX1, is one of the few that seem to have been lost in vertebrates; however, it is well conserved in insects and nematodes, as well being present in heterokonts and fungi (it is present in some Saccharomycotina species, but not in S. cerevisiae).
Group III: Rab7, Rab23, Rab29, Rab32 and Rab7L1
Rab7 seems to be indispensible, and is also the member of the family that shows the most expansions. There are four Rab7s in angiosperms, and metazoans have a Rab7 relative, Rab9. Previous analysis has dated the duplication that formed Rab9 to the rise of the metazoans [52]. However, we identified a conserved set of fungal proteins that must either be an independent duplication of fungal Rab7 or must belong to the Rab9 subfamily, and hence Rab9 appeared with the opisthokonts. Our phylogenetic analysis indicates that the latter possibility seems to be more likely. In addition, we found this special fungal version of the protein not only in a wide variety of fungi, but also in very basal fungi species, strengthening our view that these sequences are indeed members of the Rab9 family.
Rab23 and Rab32 show extensive patterns of loss, which in the case of Rab23 correlate well to loss of cilia or flagella, consistent with functional work on this protein [35]. Rab23 shows no duplications, whereas Rab32 has been duplicated in vertebrates (Rab38). Rab7L1 and Rab29 have been independently lost in fungi, plants, apicomplexans, and kinetoplastids, but for Rab7L1, we identified a number of related sequences in diverse non-unikont species and in metazoans, including humans (Figure 3 and Additional file 1). As noted above, the assignment of Rab29 as a LECA Rab rather than being unikont-specific is based on Rab29 being in one excavate with a high score to our Rab29 model, with a Rab29-like sequence also being found in the phylogenetically elusive Apusozoa. Interestingly, Rab29 has also been lost in vertebrates. The relationship between Rab7L1 and Rab29 is not completely clear, as several metazoan species seem to have lost one or other group. However, we found sequences from choanoflagellates and from C. owczarzaki, corresponding to both groups. We have used the names Rab29 and Rab7L1 for these two groups because they have been used for members of the two groups previously; however this has not been consistent, thus if confirmed by more genomes, adopting new names for these two proteins may help avoid confusion.
Group IV: Rab2, Rab4, Rab11 and Rab14
Rab11 seems to be the indispensible member of this supergroup, and is certainly the only Rab from this family that is still present in Saccharomycotina, where it has duplicated (Ypt31/32). Overall, Rab11 presents a unique pattern of duplications. Apicomplexa contain two Rab11 genes, and there has been a large expansion in plants. There are three different Rab11 genes in Bryophyta (Rab11A, Rab11C, Rab11D); these genes expanded to 10 versions in angiosperms and Arabidopsis thaliana possesses 26 different Rab11 genes. In heterokonts, we can observe some independent duplications of Rab11 genes, including oomycetes which have seven Rab11 genes.
Rab2 is present in all major eukaryotic phyla and has been independently duplicated in metazoa (Rab39), heterokonts (Rab2B) and angiosperms (Rab2B), but has been lost in all more derived Saccharomycotina species. By contrast, Rab4 and Rab14 have no major duplications (Rab4 has been duplicated in vertebrates) and have both been lost in several phyla.
Group V: Rab6
Rab6 seems to be indispensible, but we found no expansions in the genomes examined. apart from one in Angiospermae and, like many other Rabs, an expansion in vertebrates (Rab6b). In the latter case, there was a later expansion to generate Rab41, which is present in primates and dolphins, and a further expansion due to a reactivated retrotransposed pseudogene, which seems to be specific to Hominidae (Rab6c, [53]).
Group IV: Rab28 and RabL4
Neither of these Rabs is indispensible, with many losses and no expansions apart from a duplication of Rab28 in heterokonts. RabL4 (also known as IFT27) has been clearly linked to the function of flagella and cilia, and its pattern of conservation fits well with the presence of these important but not ubiquitous structures [54]. Interestingly, the phylogenetic profile of Rab28 is rather similar to that of RabL4, suggesting that its function may also be linked to cilia or flagella. RabL4 has no prenylation site, and our analysis gives strong support to separating this group from the other Rabs (Figure 1). This may indicate that these groups are not classic Rab proteins, but would be better placed somewhere between the Ran and Rab supergroups of the Ras family.
What has driven the expansion of particular Rab families?
The striking expansion of Rabs in humans seems to have been caused by two separate processes: one at the start of metazoan evolution and a second that occurred at the appearance of vertebrates. Comparing the Rabs of metazoans with those of choanoflagellates, which are the closest relatives of metazoans, the appearance of 17 new Rab proteins can be seen in metazoans (Figure 5). Interestingly, the large majority (14) of these new Rab proteins appeared in group I, indicating an early diversification of the exocytic pathway. It seems likely that this is associated with several changes in cellular organization. Not only are metazoans multicellular, but they also form polarized cell sheets, hence proteins need to be trafficked to different domains on the cell surface. In addition, even early metazoans seem to have had neuroendocrine cells that communicated with their neighbors, presumably by regulated release of compounds stored in vesicles [55]. These extra processes will have required diversification of the Rab machinery that is involved in delivery to the cell surface, and indeed, many of the relatives of Rab8 are believed to be involved in such processes. Several members of this family seem to be expressed primarily in the brain, and in some cases have been clearly shown to function in neurotransmitter release (Rab3, Rab15 and Rab27 and RabX4) [47]. It also seems possible that more complex patterns of exocytosis would have required a concomitant elaboration of pathways of endocytic recycling, which may account for the expansion of the Rab1 family to generate more Rabs associated with Golgi functions, such as Rab19/43, Rab30 and Rab33. By contrast, plants and fungi show no expansion of Rab8, suggesting that it was protein-mediated interactions between cells without cell walls, rather than multicellularity per se, which drove the Rab explosion in early metazoans.
The second major expansion in Rab numbers took place during the rise of the vertebrates, most likely through two rounds of whole genome duplications, expanding their set from 38 to 62 Rab proteins (with further duplications in primates, taking the total in humans to 66). In contrast to the first event, the effect of the expansion in vertebrates can be seen in all major groups, and indeed, many other vertebrate families show such an expansion from one gene to two, three or four paralogous genes (Figures 2; Figure 4). In the case of the Rabs these extra paralogs may have acquired subtly different roles, but in some cases, the paralogs are expressed in different sets of tissues, which may have contributed to their conservation [16, 56].
Rationalizing Rab expansions in other kingdoms is made more difficult by the more restricted number of genomes available. However, it is at least clear that in plants several Rab families have also expanded, with the most notable being Rab11, which diversified even in Bryophyta. This also explains how the Angiospermae were able to encompass such a large variety of Rab11 proteins (we found 10 different proteins as a general pattern, and A. thaliana has 26 Rab11-related proteins). Rab11 is involved in recycling back to the plasma membrane, and diversification of such routes to the plasma membrane seem likely to be important for cell plate formation during mitosis, for polarized growth of root hairs and pollen tubes, and for the targeted delivery of auxin transporters that underlies the spatial organization of some multicellular structures in plants [57–59].