Skip to main content
  • Research article
  • Open access
  • Published:

Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes



Genomes of bacteria and archaea (collectively, prokaryotes) appear to exist in incessant flux, expanding via horizontal gene transfer and gene duplication, and contracting via gene loss. However, the actual rates of genome dynamics and relative contributions of different types of event across the diversity of prokaryotes are largely unknown, as are the sizes of microbial supergenomes, i.e. pools of genes that are accessible to the given microbial species.


We performed a comprehensive analysis of the genome dynamics in 35 groups (34 bacterial and one archaeal) of closely related microbial genomes using a phylogenetic birth-and-death maximum likelihood model to quantify the rates of gene family gain and loss, as well as expansion and reduction. The results show that loss of gene families dominates the evolution of prokaryotes, occurring at approximately three times the rate of gain. The rates of gene family expansion and reduction are typically seven and twenty times less than the gain and loss rates, respectively. Thus, the prevailing mode of evolution in bacteria and archaea is genome contraction, which is partially compensated by the gain of new gene families via horizontal gene transfer. However, the rates of gene family gain, loss, expansion and reduction vary within wide ranges, with the most stable genomes showing rates about 25 times lower than the most dynamic genomes. For many groups, the supergenome estimated from the fraction of repetitive gene family gains includes about tenfold more gene families than the typical genome in the group although some groups appear to have vast, `open' supergenomes.


Reconstruction of evolution for groups of closely related bacteria and archaea reveals an extremely rapid and highly variable flux of genes in evolving microbial genomes, demonstrates that extensive gene loss and horizontal gene transfer leading to innovation are the two dominant evolutionary processes, and yields robust estimates of the supergenome size.


Comparative genomics of bacteria and archaea (collectively, prokaryotes) reveals extensive variation of gene repertoires, which is thought to reflect a highly dynamic regime of genome evolution [1]-[6]. Prokaryotic genomes present a wide variety of genome sizes, from approximately 150 kb in some intracellular endosymbionts of insects [7] to approximately 13 Mb in the largest cyanobacteria [8] and myxobacteria [9]. This broad range of genome sizes is reflected in the diversity of gene repertoires: only a tiny minority of prokaryotic genes are (nearly) universal whereas the great majority are present in small subsets of genomes [6],[10],[11]. Substantial differences in genome size and gene content are often evident between species within the same genus [12]-[14] and even between strains of the same species [15]-[17].

The enormous diversity of the gene repertoires among bacteria and archaea implies that prokaryotic genomes exist in a state of incessant flux, expanding through horizontal gene transfer (HGT), gene duplication and possibly de novo emergence of genes, and contracting via gene loss [6],[18]-[20]. Beyond the comparative genomic observations, estimates of the genome dynamics rates in prokaryotes have been obtained via explicit evolutionary reconstruction using maximum parsimony or maximum likelihood (ML) methods. These approaches typically employ the pattern of gene presence/absence in a set of species, which is mapped onto a guide phylogenetic tree [19]-[23]. All evolutionary reconstructions performed with widely different groups of bacteria and archaea infer various combinations of vertical inheritance, gene loss and gain. Averaged over long spans of evolution, gene loss appears to be a more common process than gene gain as shown for different groups of bacteria and archaea [19],[21],[23]-[25]. In obligate intracellular parasitic bacteria as well as in parasitic archaea, gene loss is the single dominant evolutionary process [7],[26]-[28]. However, genome reduction, often described as streamlining, also prevails in the evolution of bacterial saprophytes [24],[29] and some free-living microbes, particularly in marine environments [23],[25],[30],[31]. These findings are compatible with the observation of the overall deletion bias in the evolution of prokaryotic (as well as eukaryotic) genomes [32],[33]. Furthermore, gene loss has been reported to be a more uniform, `clock-like' process than gene gain, which shows a stronger episodic character [19],[29]. Taken together, these findings have inspired the concept of genome reduction as the `default' evolutionary process counterbalanced by episodes of gene gain, primarily via HGT [34].

The discovery of the diversity of the gene repertoires, even among bacteria and archaea with closely related nucleotide sequences, led to a conceptual shift in microbiology. Under the new view of the microbial world, the key unit of microbial evolution is not the genome of an individual bacterium or archaeon but rather the pangenome of a prokaryote species [17],[35]-[38]. The term `pangenome' has been used alternatively to describe either the superset of the genes present in the genomes of all sequenced isolates of a given species, or the entire pool of genes that are potentially available for acquisition to the given species (or an otherwise defined monophyletic group of genomes) over the course of its evolution. Hereinafter, to avoid ambiguity, we restrict the use of `pangenome' to denote the empirically detected superset of genes and use the term `supergenome' [39] to refer to the entire gene reservoir. Obviously, the supergenome of any microbial species cannot be characterized directly and can only be estimated from the analysis of samples of the relevant genomes. Such estimates have pointed to vast, `open' supergenomes for most prokaryotes because analysis of newly sequenced isolates did not show any signs of saturation of new gene discovery [36],[37],[40]. However, for a minority of bacteria, the supergenomes appear to be `closed', with new genomes adding few if any new genes [36],[37],[40]. Attempts to estimate microbial supergenome sizes have been made using either statistical approaches or explicit mathematical models of the evolutionary process. In particular, Snipen et al. [41] estimated the supergenome size for several bacteria using a binomial mixture approach [42] to approximate the gene frequency distribution in an analyzed set of genomes. This analysis, unlike the earlier approximations, yielded closed and relatively small supergenomes that were only several fold larger than a typical microbial genome. A recent model of microbial (pan)genome evolution by gene replacement, known as the Infinitely Many Genes model, under which the replacing genes are drawn from a formally infinite reservoir [43],[44], also suggested a close but much larger supergenome for the cyanobacterium Prochlorococcus. On the whole, the accuracy of the available supergenome estimates and the validity of the underlying models remain uncertain. Thus, delineation of supergenomes across the diversity of bacteria and archaea and elucidation of the factors that underlie the supergenome evolution are major tasks for evolutionary microbial genomics.

HGT is at present universally recognized as a major factor in the evolution of prokaryotes and a key source of innovation and adaptation to new environments and lifestyles [5],[18],[45]-[47]. However, attempts at quantification have yielded widely different estimates of the prevalence of HGT. Some early studies that involved a small number of genomes resulted in modest estimates, which implied a limited importance for HGT compared to vertical inheritance [21],[48],[49]. More recent phylogenomic analyses that included larger sets of genomes widely representative of the bacterial and archaeal diversity, generally reveal a much greater level of HGT [50]-[56]. For example, a quantitative assessment of the contributions of vertical inheritance and HGT to the evolution of prokaryotes based on the topological comparison of thousands of phylogenetic trees suggested that nearly two-thirds of evolutionary events originate from HGT [55]. Furthermore, evidence has been presented that HGT rather than gene duplication is the principal contributor to the evolution of gene families in prokaryotes [57].

We were interested in taking a comprehensive census of various events of genome dynamics across the diversity of prokaryotes. To obtain reliable rates of these events, we sought to analyze groups of multiple, closely related genomes so that robust identification of gene orthology and estimation of phylogeny and evolutionary distances would be ensured. The rates of gene dynamics were estimated for 35 clusters of prokaryotic genomes that make up an updated version of the collection of alignable tight genome clusters (ATGCs) [58] using a phylogenetic birth-and-death ML model [22],[23]. The results reveal extremely rapid genome dynamics, albeit with broad ranges of gene loss and gain rates among prokaryotic species, and indicate an overall tendency to genome contraction, which is partially compensated by gene gain via HGT. We show that the overall flux of genes is the defining parameter of genome dynamics and provide estimates of the supergenome size for diverse groups of prokaryotes.


Genome dynamics in prokaryotes: extensive gene family loss and gain dominate over family expansion and reduction

We employed an updated version of the ATGCs [58] to reconstruct the genome evolution for 34 groups of bacteria and one group of archaea (Additional file 1: Table S1 and Additional file 2: Figure S1). From the clusters of orthologous genes (COGs) that are associated with each ATGC, we derived the phyletic patterns (i.e. the patterns of presence/absence of gene families in each genome) as well as data on the number of members of each family including all paralogous genes. These patterns were mapped onto the phylogenetic tree of the respective ATGC (see Methods) and employed for the evolutionary reconstruction using Count, an ML method based on a phylogenetic gene birth-and-death model [22]. It should be emphasized that the COGs derived under this procedure account for the entire pangenome of each ATGC, and thus include genes shared between any number of organisms within the ATGC as well as genes unique to a single genome (singletons). Thus, no biases that could result from using a subset of genes preselected on certain criteria, such as, for example, the degree of sequence conservation, affect the estimates described below.

The rates of four types of elementary evolutionary event (hereinafter called genome dynamics events or GDEs) were analyzed: (i) gain of a gene family not present in the ancestor node (hereinafter, gain, for brevity), (ii) loss of all gene family members (loss), (iii) expansion of a gene family, i.e. addition of one or several family members (expansion) and (iv) reduction of a gene family, i.e. elimination of one or several family members (reduction). In mechanistic terms, gains are most likely to originate from HGT, and perhaps on rare occasions, from de novo gene birth; extreme divergence of duplicated genes that could lead to the appearance of a new family is highly unlikely on the short evolutionary scale of an ATGC. Gene family expansion is a combination of bona fide gene duplication and acquisition of a new member of a pre-existing family via HGT (here we do not attempt to distinguish between these two sources).

The number of GDEs of each type associated with each tree branch shows a strong significant positive correlation with the branch length (Figure 1, Additional file 2: Figure S2 and Additional file 3: Table S2). Thus, all these events appear to occur under a genomic clock, by analogy to the traditional molecular clock of sequence evolution [59]. The accuracy of the genomic clock was found to be the highest for gene family gain and the lowest for gene family loss (Figure 1) although this difference has to be taken with caution due to the limited number of data points (ATGCs). A bootstrap analysis (1,000 replicates) of the GDE rates shows that the estimated rates are not disproportionately affected by a small number of outliers and also supports the observation on the wider scatter of the loss rate compared to the gain rate (Additional file 2: Figure S3). Previous analyses performed on genomes representing diverse branches of bacteria, have suggested that gene loss was more of a clock-like process than gene gain, which showed a tendency to occur in isolated episodes [19]. The present results suggest that this trend is not manifest at the short evolutionary scale of the ATGCs, compatible with more recent observations of an apparent clock-like character of HGT, at least among universally conserved prokaryotic genes [60].

Figure 1
figure 1

The clock of genome dynamics. The figure shows the correlation of branch lengths and number of (a) gains, (b) losses, (c) expansions and (d) reductions. It excludes singletons, i.e., gains in the terminal branches of the tree. Both x and y axes are have a logarithmic scale. All P < 0.0001. BL, branch length or number of nucleotide substitutions per site.

The demonstration of the existence of the genomic clock justifies the estimation of the rates of gain, loss, expansion and reduction per unit of nucleotide substitution and in what follows, we primarily use this measure.

Table 1 shows the rates of each type of GDE for the 35 ATGCs. Five major trends are immediately apparent:

  1. i).

    The rates of gain and loss are approximately an order of magnitude greater than the rates of expansion and reduction.

  2. ii).

    The loss rate typically is nearly threefold higher than the gain rate.

  3. iii).

    The expansion rate is almost 1.5 times higher than the reduction rate.

  4. iv).

    The rates of all types of GDE vary within a broad range, spanning almost two orders of magnitude (Figure 2a); the ratios between the rates of different events vary within similar ranges (Figure 2b,c). This trend is independent of the number of species in an ATGC (Additional file 2: Figure S4).

  5. v).

    The rates of genome change are remarkably high, typically tens of thousands of GDEs per nucleotide substitution per site, or tens to hundreds of GDEs per substitution per gene (Table 1, Additional file 3: Table S2).

Table 1 Rates of the four types of genome dynamics events a
Figure 2
figure 2

Distributions of the genome dynamics rates across the ATGCs. (a) Rates of gain, loss, expansion and reduction per nucleotide substitution per site. (b) Loss/gain and reduction/expansion ratios. (c) Gain/expansion and loss/reduction ratios. G/E, gain/expansion; L/G, loss/gain; L/R, loss/reduction; R/E, reduction/expansion.

On the whole, the dominant process in prokaryote genome evolution is the loss of gene families, i.e. genome contraction. This finding provides definitive quantitative support for the previous, more qualitative conclusions of the importance of genome streamlining in evolution, particularly among prokaryotes [19],[21],[23],[25],[34],[61]. However, due to the high variation in the rates of different processes of genome evolution, this general trend is reversed in some of the analyzed groups of microbes (Table 1). In particular, despite the overall dominance of gene family loss, there are clear gainers among the analyzed bacteria, such as Enterobacter, Klebsiella, Campylobacter and Listeria. Furthermore, the switch from the loss mode of evolution to the gain mode appears to occur in the course of evolution of some relatively close-knit groups of bacteria (Table 1 and Figure 3). Specifically, the Enterobacteriaceae and Campylobacterales clades include both gainer and loser ATGCs (Table 1 and Figure 3). Rapid, extensive gene loss (loss + reduction to gain + expansion ratio >10) is particularly prominent in Legionella and Corynebacterium (Table 1 and Figure 3).

Figure 3
figure 3

Distribution of the gain, loss, expansion and reduction rates over the evolutionary tree of prokaryotes. The tree is from MicrobesOnline [62]. The areas of the circles are proportional to the rates of the respective events to a logarithmic scale. The numbers in parenthesis indicate the number of species in the ATGC. The ATGCs with episodes of rapid gene gain are denoted with *(<10% of branches) or **(>10% of branches). ATGC, alignable tight genome cluster.

Estimates of gene dynamics rates and phylogenetic depth

When the events are analyzed on individual tree branches, the rates of all four types of GDE strongly and negatively correlate with the phylogenetic depth of the respective branch (Figure 4). Most likely, this observation reflects the fact that Count only estimates the number of GDEs for those gene families that survived in at least one extant genome. Genes that were present at some point during the history of the ATGC but have been subsequently lost, are missed altogether so that the corresponding GDEs do not contribute to the calculations. These findings point to the high prevalence of such transient GDEs in microbial evolution and suggest that our reported estimates (Table 1) represent the low bound of the actual gene flux.

Figure 4
figure 4

Dependence of the rates of gains, losses, expansion and reductions on phylogenetic depth. (a) Gains, (b) losses, (c) expansions and (d) reductions per unit of branch length vs the phylogenetic depth. The figure excludes singletons, i.e., gains in the terminal branches of the tree are not represented. Both x and y axes have a logarithmic scale. The phylogenetic depth is measured in the number of nucleotide substitutions per site.

To estimate the extent to which Count underreports the number of GDEs, we used the dependence of the estimated rates on the depth of the branch mid-point. Within each ATGC, the observed rates were normalized to 100% at the depth of 0.0001 substitutions per site (Figure 4; Additional file 2: Figure S5). The results indicate that at a phylogenetic depth of 0.1 (the deepest branches among all ATGCs), Count might underestimate the rates by up to 40%. At the more typical tree branch depth, the expected deficit is much lower. Thus, we expect our estimates to be accurate within a factor of 2 at most. Furthermore, we show that the relative GDE rates are consistent independent of the phylogenetic depth (Additional file 2: Figure S6). In agreement with the trend observed for individual branches, the ATGC-wide estimate of the gene flux rate (overall GDE rate; see below) also shows significant negative correlation with the total phylogenetic tree depth, estimated as the mean root-to-leaves distance (Additional file 2: Figure S7).

Factors of microbial genome dynamics

Despite the substantial variability among individual ATGCs, the genome dynamics rates appeared to be (nearly) randomly scattered across the diversity of prokaryotes (Figure 3) and in particular showed no significant differences between the three major bacterial phyla represented by multiple ATGCs, namely Proteobacteria, Actinobacteria, and Firmicutes (Figure 5a). Thus, the trends of genome contraction (loss and reduction) and expansion (gain and expansion) appear to hold for most of the lineages across the entire bacterial domain.

Figure 5
figure 5

Dependence of the rates of gain, loss, expansion and reduction on bacterial taxonomy and lifestyle. (a) Rates of the four types of event for Actinobacteria, Firmicutes and Proteobacteria. (b) Rates of the four types of event for bacteria and archaea with three different lifestyles. FHA, facultative host-associated; FL, free-living; P, obligate intracellular parasite.

We also compared the rates of gain, loss, expansion and reduction between microbes with three lifestyles, free-living, facultative host-associated and obligate intracellular parasite (Figure 5b). Perhaps unexpectedly, given the typically much smaller genomes of the intracellular parasites, the overall relationship between the rates of the four types of event did not depend on the lifestyle: Loss > Gain > > Expansion > Reduction. Nevertheless, among the three groups, free-living bacteria present the highest rates of gain (not significant), expansion (P < 0.01 compared to parasites) and reduction (P < 0.01 compared to parasites), whereas obligate intracellular parasites and facultative host-associated bacteria show a modest but significantly higher rate of gene family loss than free-living bacteria (P < 0.05) (Figure 5b). Thus, on the whole, genomes of free-living prokaryotes appear to be more dynamic than genomes of intracellular parasites which is compatible with the greater exposure to HGT in extracellular compared to intracellular habitats.

Perhaps surprisingly, no connection was found to exist between the rates of the GDEs and the strength of the selection pressure on protein sequences estimated as the ratio of non-synonymous to synonymous substitution rates in protein-coding genes (dN/dS) [63], which shows a robust correlation with various ATGC-wide characteristics [64] (Additional file 2: Figure S8). Neither did we detect any dependence of the four GDE rates on the genomic GC content (Additional file 2: Figure S9), notwithstanding the strong positive correlation between the GC content and genome size [65],[66], or with the genome shuffling rate (see Additional file 2: Figure S10 and Methods for the details of the shuffling rate calculation).

Flux and balance in prokaryotic genome evolution

We further examined possible correlations between different types of GDEs. Strikingly, relatively high, statistically significant, positive correlation was shown to exist between all types of event (Figure 6). These findings suggest that the dynamics of genome evolution in prokaryotes is largely determined by the overall gene flux.

Figure 6
figure 6

Correlations between the rates of gain, loss, expansion and reduction.

To further investigate key factors of genome dynamics, we performed principal component analysis (PCA) of the rates of gain, loss, expansion and reduction (Figure 7a and Additional file 2: Figure S11). In this case, the PCA was remarkably efficient in revealing major trends of genome evolution. The first principal component explained approximately 64% and the second principal component approximately 19% of the variance in the GDE rates, indicating that each of these composite variables reflected a major trend of genome evolution. The loadings plot (Figure 7b) shows that all four rates (gain, loss, expansion and reduction) contributed to the first principal component with the same sign. Accordingly, this principal component appears to reflect the overall gene flux, which thus appears to be the key determinant of genome dynamics. The second principal component was dominated by gene family gain and loss, which contribute with opposite signs (Figure 7b). Thus, this component reflects the balance of family loss and gain.

Figure 7
figure 7

Principal component analysis of the rates of gains, losses, expansions and reductions. (a) XY-plot of the two first two principal components. (b) Principal component analysis loadings. Comp., component.

To account for the difference in the contributions of different GDEs to the net extent of the genome change (Loss > Gain > > Expansion > Reduction), we use the sum of all event rates as the measure of the total gene flux and the gain + expansion to loss + reduction ratio as the measure of the balance.

Gene flux but not the balance of the GDE positively and significantly correlates with the genome size (Figure 8 and Additional file 2: Figure S12). Combined with the observations on the transient character of many genomic events, this finding implies that the larger microbial genomes are products of recent and conceivably short-lived gene accretion.

Figure 8
figure 8

Correlation between gene flux and genome size. The horizontal axis shows the median number of genes in a genome in an ATGC. ATGC, alignable tight genome cluster; GDE, total gene flux (number of genome dynamics events per nucleotide substitution per site).

Evolutionary dynamics of different functional classes of genes

We further estimated the overall gene flux and the rates of each type of GDE for broad functional categories of genes as defined in the COGs [67],[68]. All rates showed largely consistent ranking of functional categories (Figure 9), in agreement with the overall positive correlation between them (Figure 6). Predictably, genes encoding protein components of the translation system, which constitute the great majority of the (nearly) universal genes in cellular life forms [69]-[71], make up the category with the lowest flux per gene (mostly static), closely followed by enzymes of nucleotide and coenzyme metabolism as well as molecular chaperones, which also tend to be highly conserved in evolution (Figure 9). In contrast, by far the most dynamic class included genes of mobile elements, followed by uncharacterized genes and genes involved in defense functions (Figure 9). It appears likely that numerous genes in the uncharacterized category actually are unidentified components of the mobilome or defense systems [72], suggesting that, as one might expect, these two categories jointly make up the most dynamic component of microbial genomes. The difference between the per gene flux rates of the mobilome components and the translation genes was approximately fourfold, and when the gain rates were compared, the difference was greater than an order of magnitude. These findings are generally compatible with the patterns of long-term gene conservation [6] and emphasize the heterogeneity of gene dynamics in bacterial and archaeal genomes.

Figure 9
figure 9

Genome flux by COG functional categories. (a) Flux. (b) Gain. (c) Loss. (d) Expansion. (e) Reduction. Designations of the functional categories (modified from [67]): C, energy production and conversion; D, cell division; E, amino acid metabolism and transport; F, nucleotide metabolism and transport; G, carbohydrate metabolism and transport; H, coenzyme metabolism; I, lipid metabolism; J, translation; K, transcription; L, replication and repair; M, membrane and cell wall structure and biogenesis; N, secretion and motility; O, post-translational modification, protein turnover and chaperone functions; P, inorganic ion transport and metabolism; Q, biosynthesis, transport and catabolism of secondary metabolites; R, general functional prediction only (typically, prediction of biochemical activity); S, function unknown; T, signal transduction; U, intracellular trafficking and secretion; V, defense systems; X, mobilome. COG, cluster of orthologous genes.

Supergenome size estimation

The results described above indicate that prokaryote genome evolution is an extremely dynamic process that involves rapid gain and loss of numerous gene families. This process can be naturally represented as sampling of a gene pool by the evolving genomes, which draw new gene families at random. We denote this gene pool the supergenome of an ATGC, to differentiate it from the pangenome, the empirically observed superset of genes of a group of genomes. The size of the supergenome is unknown but can be estimated from the number of families that have been gained multiple times: obviously, with a vast supergenome, the chance to draw the same family again is effectively nil, whereas with a supergenome only slightly exceeding the typical genome size of a given group, many families will be gained repeatedly. We developed an ML model to estimate the size of the supergenome from the number of repeated gains; the estimates were obtained for two models of genome evolution, namely the simplest conceivable model, with a uniform probability of drawing a gene from the supergenome and a more complex model with a power-law distribution of the drawing probabilities (see Methods for details).

The results obtained with the two approaches were consistent and showed a wide spread of estimated supergenome sizes, from approximately four genomic equivalents (hence numerous repeated gains) to effectively open supergenomes (no or very few repeated gains) (Table 2 and Figures 10a and 11). In ATGCs with closed supergenomes (Table 2), the characteristic size of the supergenome was estimated at about an order of magnitude larger than the typical number of families in a genome (Figure 10b). For these closed supergenomes, the estimates were highly reliable, with the confidence intervals typically less than 10% of the estimate (Table 2).

Table 2 Supergenome size estimates
Figure 10
figure 10

Comparison of genome, pangenome and estimated supergenome sizes. (a) Median genome vs supergenome size. (b) Density distribution of median genome, pangenome and supergenome size.

Figure 11
figure 11

Distribution of the median genome, pangenome and estimated supergenome sizes over the evolutionary tree of prokaryotes. The tree is from MicrobesOnline [73]. Areas of the circles are proportional to the number of genes in the respective genomes (median), pangenome, a006Ed supergenome. FHA, facultative host-associated; FL, free-living; O, open supergenome; P, obligate intracellular parasite.

The estimated supergenome size positively correlated with the mean genome size in an ATGC, indicative of a trend of genome growth dependent on the pool of available genes (Figure 10a). By contrast, and perhaps unexpectedly, the estimated supergenome size, expressed either as the number of families or relative to the genome size (in genomic units), does not correlate significantly with the gene flux or with gain, loss, expansion and reduction rates separately (Additional file 2: Figure S13). In other words, microbes with large supergenomes can, at least transiently, evolve in a relatively static regime and conversely microbes with rapidly evolving genomes can have small supergenomes.

Supergenome size estimates show a strong positive correlation with the total tree depth (Additional file 2: Figure S14a). This dependence is likely to stem from at least two factors. First, the deeper the divergence of the ATGC, the more variation in the history of the environments, and therefore, in the adaptive requirements and the available gene pool, is expected. Second, our supergenome estimate procedure is intrinsically dependent on the number of multiple gene gains, derived from the phyletic patterns. As shown above (Figure 4a; Additional file 2: Figures S5), the gene gain rate is underestimated in the deeper trees and branches; accordingly, the number of multiple gene gains is underestimated as well, resulting in inflation of the supergenome size estimates.

Supergenome size estimates also show a strong and significant negative correlation with the ATGC-wide estimate of the dN/dS ratio (Additional file 2: Figure S14b). This dependence might reflect genuine relationships between the characteristic population dynamics of the respective group, which affect the strength of the purifying selection on the protein-coding genes [64]. However, the dN/dS ratio estimates themselves are negatively correlated with the ATGC tree depth (Additional file 2: Figure S14c) and genome size [64] (Additional file 2: Figure S14d). Thus, the apparent connection between the supergenome size and the protein-level selection might be due, at least in part, to indirect effects.

There are clear connections between the obtained supergenome size estimates and the microbial lifestyle. Thus, nine of the eighteen free-living microbes in the analyzed set but only two of the seventeen host-associated microbes were estimated to possess open supergenomes (chi-squared, P = 0.015) (Table 2 and Figure 11). This substantial excess of open supergenomes among free-living organisms could be expected as the result of their greater exposure to diverse gene pools. Also in line with the lifestyles of the respective microbes, by far the smallest supergenomes were estimated for intracellular (Chlamydia) and extracellular (Mycoplasma) parasites with highly reduced genomes; intracellular parasites with somewhat larger genomes (Rickettsia) appeared to have larger supergenomes, suggestive of a distinct evolutionary history (Table 2 and Figure 11).

Previously published supergenome size estimates (mostly referred to as pangenome size estimates by the authors) can be broadly classified into three categories. Estimates based on the sampling curve use an approximation for the number of new genes brought into the pangenome by additional genomes with a (semi-arbitrary) function [35]-[37],[40]. If such a function converges to a finite total number of genes, a supergenome is considered closed and a quantitative estimate can be produced, otherwise the supergenome is considered open. Another category of estimates employs an explicit sampling model that assumes random independent sampling of genes from a common pool into individual genomes [41],[42]. The third category employs a tree-based model of evolution of a group of genomes where sampling is performed along the tree branches [43],[44]. The latter two approaches explicitly or implicitly fit the sampling model parameters to the observed distribution of gene frequencies in the analyzed set of genomes.

A comparison of the published estimates with those obtained in this work (Additional file 4: Table S3) shows that our estimates are consistently higher than those obtained with the models that assume that genomes are random independent collections of genes. The likely explanation is that the independence assumption inherent in these models leads to overestimates of the number of multiple gene gains by counting such gains for a family found in each genome within a clade, whereas the most likely scenario is that this family had been gained only once. Overestimation of the number of multiple gains necessarily leads to underestimation of the supergenome size. The estimates based on sampling curves yield open supergenomes for the majority of microbial groups. These approaches attempt to derive the exponent of the approximating power functions from the limited available samples, leading to much uncertainty. The fact that the infinitely many genes model could not be rejected, even in a tree-based analysis [44], might indicate that many microbial supergenomes are close to the closed/open boundary. Thus, it appears possible that neither the sampling curves nor the distributions of gene frequencies contain sufficient information to produce robust supergenome estimates. The approach employed here used the number of multiple gene gains directly inferred from the superposition of phyletic patterns of genes over a phylogenetic tree and, being independent of the assumptions of other models, could potentially improve the reliability of supergenome estimates, compatible with the narrow confidence intervals (Table 2).


The rapid dynamics of prokaryotic genome evolution

The analysis of prokaryotic genome dynamics described here unequivocally shows that rapid gene flux involving extensive loss of genes and families, partially balanced by gain of new gene families via HGT, is the principal mode of microbial evolution. Indeed, the estimated rates of gene family gain and loss in some groups of bacteria are such that multiple genes appear to come and go over the time required for a single nucleotide substitution to occur in an evolving gene. These findings are compatible with experimental results demonstrating bacterial genome contraction in real time [73].

Given that the great majority (typically, around 90%) of nucleotide substitutions in evolving microbial genomes are silent [64],[74], and even among those that affect protein sequences many are effectively neutral, it seems indisputable that rapid gene flux is the most important route of change in prokaryotic evolution. Notably, the present results show substantially greater rates of gene family loss and gain compared to family contraction and expansion rates. Thus, the gene flux is not only rapid and extensive but often leads to qualitative changes in the gene repertoires. In general terms, these results emphasize that prokaryotic evolution is largely driven not by small variations, such as single nucleotide substitutions, but by much more dramatic changes brought about by HGT and gene loss.

In agreement with previous observations made for larger evolutionary scales [19],[21], we found that on the microevolutionary perspective that is provided by the ATGC analysis, gene family loss prevails over gain, and the difference between the rates of loss and gain is often substantial. Why, then, do prokaryotic genomes not shrink out of existence? The answer is likely to be twofold. Some of the bacteria actually might be headed towards extinction as observed for the tiny genomes of some intracellular parasites [7],[26],[27]. However, the more common scenario would involve evolving prokaryotic lineages going through long phases of genome contraction, and our analysis caught most of them, punctuated by shorter bursts of extensive gene gain, which we detected in a few groups and which compensate for the gradual gene loss. The non-clock like character of gene gain inferred on a longer timescale [19] implies that such periods could be short but would involve massive amounts of genetic material, making detection of such episodes through comparative analysis of tight groups of microbes unlikely.

The median relative and to a large extent even absolute (per nucleotide substitution) rates of loss, gain, reduction and expansion are highly consistent between the major bacterial phyla, surprisingly do not depend on the ATGC-wide dN/dS estimates, and only weakly depend on the lifestyle (parasitic vs independent). However, these rates showed a clear link to gene function, in a good agreement with prediction made from the long-term conservation of different functional classes of prokaryotic genes.

The variation in the GDE rates among individual ATGCs is substantial, spanning nearly two orders of magnitude (Table 1). At present, it is unclear why some genomes rotate genes fast and others slowly, apparently irrespective of the size of the gene pools that are available for HGT. The weakness of taxonomic coherence and connection to the microbial lifestyle imply that the defining factors have to do with specific, local aspects of microbial ecology. Identification of these factors is a major challenge for future comparative genomic and experimental studies.

A limitation of the evolutionary reconstructions reported here is that the ML approach implemented in Count [22],[75] takes as input the gene family membership matrix, without explicitly exploiting information on the level of sequence similarity and phylogenetic relationships within individual families. In principle, more precise reconstructions taking into account this additional information are possible through the use of tree reconciliation approaches that compare gene trees to species trees [76]-[78]. Most of the available tree reconciliation algorithms are computationally prohibitive but recently, efficient, fast methods have been reported [79],[80]. Nevertheless, application of even this promising approach on the scale addressed in this work and statistical assessment of the results remain challenges for future studies. In practice, it appears that tree reconciliation has the potential to uncover cases where a gene in a particular lineage, although included in a COG, shows a phylogenetic position significantly different from that in the species tree, and hence could actually have been acquired via HGT. Such cryptic HGT events could be of two kinds: (i) displacement of an existing family member by a xenolog, i.e. a homolog from a distant lineage, known as xenologous gene displacement and (ii) acquisition of a `pseudoparalog',, i.e. an additional family member, again from a distant source [18]. Importantly, only events within families would be involved and the family gain estimates would remain unaffected. The findings of this work indicate that the contribution of pseudoparalog acquisition is small, given that the estimated family expansion rates are about an order of magnitude lower than the gain rates. The extent of xenologous gene displacement is unknown and remains an interesting target for further analysis. Regardless, it should be emphasized that the refinement of the GDE rate estimates that potentially could be obtained through tree reconciliation, can only lead to an upward reassessment of the rate of HGT. Thus, taking into account also the apparent underestimation by Count of events at deeper branches of the ATGC trees (see Discussion above and Figure 4), the GDE rates obtained here, even if strikingly high, should be considered as lower bound estimates.

The supergenomes

The supergenome size estimates from the rates of repeated gene gain showed, in a broad agreement with previous observations, that some of the microbes possess well-defined closed supergenomes whereas other supergenomes appeared to be open. The typical closed supergenome size was estimated to be about tenfold larger than the characteristic genome size in the respective ATGC, indicating a large but clearly limited gene pool available to genome dynamics. The supergenome size does not show significant correlation with the overall gene dynamics, but seems to be associated with the phylogenetic depth of the ATGC tree and with the lifestyle, as open supergenomes are substantially more common among free-living compared to host-associated microbes. The accelerating sequencing of microbial genomes will put these estimates to test without much delay.


The reconstruction of short-term GDEs shows that microbial genomes exist in a state of perennial flux, gaining, losing, expanding and contracting gene families. Typically, genome dynamics processes are rapid, with gains and losses of multiple gene families occurring within the time frame of a single nucleotide substitution per gene. Thus, gene flux is the dominant mode in microbial evolution such that microbes primarily differ from each other on the scale from static to highly dynamic. The rates of gene family gain and loss in most microbial groups are approximately an order of magnitude greater than the rates of expansion and contraction of pre-existing families, indicating that HGT is the principal source of new genes in prokaryote evolution. Overall, gene family loss notably prevails over gain, i.e. evolving genomes appear to spend more time contracting than expanding. It seems most likely that the gradual gene loss is compensated for by episodes of rapid gene gain; most such bursts are outside the evolutionary scale accessible through ATGCs although a few were detected. The absolute as well as relative rates of GDEs show remarkable variance among bacteria, spanning almost two orders of magnitude, and do not significantly depend on the ATGC-wide dN/dS estimates, the taxonomic affinity of microbes or their lifestyle. Conceivably, genome dynamics is highly sensitive to local ecological factors, the exact nature of which remains to be elucidated. The analysis of genome dynamics allowed us to estimate the size of microbial supergenomes, which in the majority of the analyzed microbial groups turned out to be large but closed, exceeding the characteristic genome size by about an order of magnitude, but for a minority of microbes appeared to be open.


The extended ATGC dataset

Genomic data was obtained from an updated version of the ATGC database [81] containing data from >4.5 million proteins present in >1,500 genomes of prokaryotes (approximately 60% of proteins and 62% of genomes from RefSeq as of June 2013) that met the same criteria as in the original ATGCs [58]. Specifically, these criteria include having at least 85% conserved synteny across any pair of genomes (alignable), and having synonymous substitution rate <1.5 (tight). Aside from the increase in the number and size of ATGCs due to the inclusion of new genomes of bacteria and archaea, the major difference between this and previous ATGC versions was the exclusion of lower-quality drafts of incomplete genomes. In addition, the pangenome of each ATGC is now represented by automatically derived COGs [67],[68],[82].

The COG construction was performed in two stages. First, COGs were constructed as clusters of bidirectional best-matching proteins, with the threshold e-value 1 ± 10−5 and protein coverage of 75% [82]. Second, proteins unassigned in the first stage were added to the cluster that they match best using the COGNITOR method [83], with the stringent threshold e-value 1 ± 10−20 and protein coverage of 75%.

The ATGCs also include pre-calculated dN/dS values [84] for all orthologous gene pairs from each pair of genomes. We analyzed 35 of the largest ATGCs (34 bacterial and one archaeal genomic cluster) that contained ten or more genomes (up to a maximum of 109; Additional file 2: Figures S1 and S15, Additional file 1: Table S1 and Additional file 3: Table S2). These selected ATGCs encompass many universal genes, i.e., genes that are present in all genomes within the ATGC.

Species trees

First, a concatenated alignment of all universal genes with conserved synteny among species was constructed for each ATGC from the alignments of the respective protein sequences that were generated using MUSCLE [85] and converted back to the alignments of the respective nucleotide sequences using an in-house script. The concatenated alignments were used to reconstruct a species tree for each ATGC using the program FastTree [86] under the General Time Reversible (GTR) nucleotide substitution model [87]. The program Count that was employed for evolutionary reconstruction as described below [22], requires rooted phylogenetic trees as an input. Accordingly, all trees were rooted using the least-squares modification of the mid-point method [88].

Phylogenetic birth-and-death analysis

The rates of gain, loss, expansion and reduction were estimated using the program Count [22]. This program requires two inputs, namely a matrix that contains the number of gene copies in each species and a rooted species tree, to calculate gain, loss, expansion and reduction rates. Count calculates these rates using a phylogenetic birth-and-death model that requires the following parameters: κ (rate of gene gain), λ (individual gene duplication rate) and μ (individual gene loss rate) (Additional file 2: Figure S16). Thus, a gene family of size n decreases at a rate and increases at a rate (κ + ). The parameters (κ,λ,μ) are different for each gene family and across edges of the species tree. These parameters are computed by Count using ML optimization [75]. It is recommended that the parameters are optimized iteratively, in several rounds of increasing computational complexity, such that in each round the rates from the previous round are used as the starting point [22]. We optimized the parameters through 11 rounds of increasing complexity. The first two rounds started with uniform rates of gain and expansion and in the subsequent rounds the number of discrete categories for the gamma distribution (for gain, loss and expansion) increased from one to two for each type of event (Additional file 5: Table S4). The parameter values obtained in the final round were used to estimate the numbers of gains, losses, expansions and reductions for all gene families at different branches of the species tree. This final analysis was performed using the `posteriors' option of Count, which analyzes and integrates several phylogenetic scenarios and calculates rates of gain, loss, expansion and reduction across all branches. The sum across all branches and across all families is taken as the estimate of the number of events across the entire history of a given group of organisms.

Estimates of the supergenome size

Supergenome size was estimated using two different methods. We implemented the currently widely adopted binomial mixture method as well as the capture/recapture method [41]. We also utilized the posterior gain probabilities computed by Count to estimate the size of the reservoir (supergenome) from which gains originate. Only gains of non-ancestral families were included in the analysis. A family was designated as ancestral if the posterior probability of this family being present in the root node genome (the ancestor of the analyzed group) was above 0.5. The number of these ancestral families was added to the estimate of the supergenome size with the implicit assumption that they were gained before the common ancestor of the group came into existence.

The simplest procedure for estimating the supergenome size is to assume that every time a family is gained, it is drawn at random from a well-mixed reservoir of size S and maximize the probability

L= C S P S K

that P distinct families are discovered as a result of K random independent samples from the reservoir, with respect to S. Here C S P is a binomial coefficient. The total number of gains K is estimated as the sum of Count reported gain probabilities over all branches and families plus the sum of the probabilities of presence at the root node. Having an exact expression for the probability allows one to estimate the confidence region of the estimated supergenome size S. In reality acquisitions of genes from the supergenome are not random uncorrelated events. The effect of these correlations on the supergenome size estimates is likely to be complex and even its sign cannot be determined a priori. In the absence of ancillary information on gene gain correlations, the assumption of random independent gains provides a useful null model.

An alternative, more complex model of the reservoir posits that the probability p i of gaining a family i can vary with i. The exact probability of observing P distinct families from K samples can no longer be computed explicitly. However, if we introduce the number X m of families gained exactly m times, we can compute its expectation.

g m = X m = C K m i = 1 S p i m 1 p i K m

where C K m is a binomial coefficient. Assuming that X m is composed of a large number of independent binomials, it has a Poisson distribution. Therefore if the empirically observed number of families gained m times is O m , the log likelihood of observing O1, O2, etc. is

lnL~ m = 1 K g m + O m ln g m ,

where g m is computed via Equation 2. The gain probabilities p i are parameterized using a power law distribution:

p i = A i α ,i=1,,S,

where A is a normalization constant that ensures that i = 1 S p i =1. The power law above is the simplest non-trivial one-parameter distribution with a broad range of gain probabilities. The maximization of the approximate likelihood in Equation 3 with respect to S and α yields the estimate of the supergenome size S and the gain probability distribution in the supergenome characterized by α. The observed numbers O m of multiple gains are computed by binning the total gain probabilities (including the ancestral branch gain probability), i.e. if the total Count reported gain probability for some family is between m − 0.5 and m + 0.5 for some integer m, O m is incremented.

Statistical analysis

Statistical analysis was performed in the R environment. Spearman rank correlations are reported. P values were obtained by a permutations test with 100,000 rounds of reshuffling. The input variables for the PCA were the total number of gains, losses, expansions and reductions per unit branch length in each ATGC (Table 1). These values were transformed into the logarithmic scale prior to the analysis. PCA was performed using the function princomp from the R statistical package. The sign of principal component 1 was inverted.

Synteny distance

Within each ATGC, the pairwise synteny distance (dY) between genomes is defined as 1 – F s where F s is the fraction of orthologs in syntenic genome segments [64]. Linear regression with a double logarithmic scale between pairwise synteny distance and nucleotide substitution distance was used to estimate the shuffling rate, which was reported for the nucleotide distance of 0.01 substitution/site.

Bootstrap analysis of the genome dynamics event rates

The robustness of the median estimated GDE rates was assessed using bootstrap sampling of the estimated values of each type of GDE for the 35 ATGCs. Median values of 1,000 replicates of all distributions were collected and plotted in Additional file 2: Figure S3. The probability density of each distribution was calculated using the function density of the R statistical package.

Additional files



alignable tight genome cluster


cluster of orthologous genes


genome dynamics event


horizontal gene transfer






maximum likelihood, PCA, principal component analysis


  1. Kolsto AB: Dynamic bacterial genome organization. Mol Microbiol. 1997, 24: 241-248.

    Article  PubMed  Google Scholar 

  2. Koonin EV, Galperin MY: Prokaryotic genomes: the emerging paradigm of genome-based microbiology. Curr Opin Genet Dev. 1997, 7: 757-763.

    Article  PubMed  Google Scholar 

  3. Casjens S: The diverse and dynamic structure of bacterial genomes. Annu Rev Genet. 1998, 32: 339-377.

    Article  PubMed  Google Scholar 

  4. Bellgard MI, Itoh T, Watanabe H, Imanishi T, Gojobori T: Dynamic evolution of genomes and the concept of genome space. Ann N Y Acad Sci. 1999, 870: 293-300.

    Article  PubMed  Google Scholar 

  5. Doolittle WF: Lateral genomics. Trends Cell Biol. 1999, 9: M5-M8.

    Article  PubMed  Google Scholar 

  6. Koonin EV, Wolf YI: Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 2008, 36: 6688-6719.

    Article  PubMed Central  PubMed  Google Scholar 

  7. McCutcheon JP, Moran NA: Extreme genome reduction in symbiotic bacteria. Nat Rev Microbiol. 2012, 10: 13-26.

    Google Scholar 

  8. Shih PM, Wu D, Latifi A, Axen SD, Fewer DP, Talla E, Calteau A, Cai F, Tandeau De Marsac N, Rippka R, Herdman M, Sivonen K, Coursin T, Laurent T, Goodwin L, Nolan M, Davenport KW, Han CS, Rubin EM, Eisen JA, Woyke T, Gugger M, Kerfeld CA: Improving the coverage of the cyanobacterial phylum using diversity-driven genome sequencing. Proc Natl Acad Sci USA. 2013, 110: 1053-1058.

    Article  PubMed Central  PubMed  Google Scholar 

  9. Schneiker S, Perlova O, Kaiser O, Gerth K, Alici A, Altmeyer MO, Bartels D, Bekel T, Beyer S, Bode E, Bode HB, Bolten CJ, Choudhuri JV, Doss S, Elnakady YA, Frank B, Gaigalat L, Goesmann A, Groeger C, Gross F, Jelsbak L, Jelsbak L, Kalinowski J, Kegler C, Knauber T, Konietzny S, Kopp M, Krause L, Krug D, Linke B, et al: Complete genome sequence of the myxobacterium Sorangium cellulosum. Nat Biotechnol. 2007, 25: 1281-1289.

    Article  PubMed  Google Scholar 

  10. Makarova KS, Aravind L, Galperin MY, Grishin NV, Tatusov RL, Wolf YI, Koonin EV: Comparative genomics of the Archaea (Euryarchaeota): evolution of conserved protein families, the stable core, and the variable shell. Genome Res. 1999, 9: 608-628.

    PubMed  Google Scholar 

  11. Charlebois RL, Doolittle WF: Computing prokaryotic gene ubiquity: rescuing the core from extinction. Genome Res. 2004, 14: 2469-2477.

    Article  PubMed Central  PubMed  Google Scholar 

  12. Kettler GC, Martiny AC, Huang K, Zucker J, Coleman ML, Rodrigue S, Chen F, Lapidus A, Ferriera S, Johnson J, Steglich C, Church GM, Richardson P, Chisholm SW: Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 2007, 3: e231-

    Article  PubMed Central  PubMed  Google Scholar 

  13. Gan HM, Hudson AO, Rahman AY, Chan KG, Savka MA: Comparative genomic analysis of six bacteria belonging to the genus Novosphingobium: insights into marine adaptation, cell-cell signaling and bioremediation. BMC Genomics. 2013, 14: 431-

    Article  PubMed Central  PubMed  Google Scholar 

  14. Thompson CC, Silva GG, Vieira NM, Edwards R, Vicente AC, Thompson FL: Genomic taxonomy of the genus Prochlorococcus. Microb Ecol. 2013, 66: 752-762.

    Article  PubMed  Google Scholar 

  15. Vejborg RM, Hancock V, Schembri MA, Klemm P: Comparative genomics of Escherichia coli strains causing urinary tract infections. Appl Environ Microbiol. 2011, 77: 3268-3278.

    Article  PubMed Central  PubMed  Google Scholar 

  16. Perna NT, Plunkett G, Burland V, Mau B, Glasner JD, Rose DJ, Mayhew GF, Evans PS, Gregor J, Kirkpatrick HA, Pósfai G, Hackett J, Klink S, Boutin A, Shao Y, Miller L, Grotbeck EJ, Davis NW, Lim A, Dimalanta ET, Potamousis KD, Apodaca J, Anantharaman TS, Lin J, Yen G, Schwartz DC, Welch RA, Blattner FR: Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature. 2001, 409: 529-533.

    Article  PubMed  Google Scholar 

  17. Kaas RS, Friis C, Ussery DW, Aarestrup FM: Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes. BMC Genomics. 2012, 13: 577-

    Article  PubMed Central  PubMed  Google Scholar 

  18. Koonin EV, Makarova KS, Aravind L: Horizontal gene transfer in prokaryotes: quantification and classification. Annu Rev Microbiol. 2001, 55: 709-742.

    Article  PubMed  Google Scholar 

  19. Snel B, Bork P, Huynen MA: Genomes in flux: the evolution of archaeal and proteobacterial gene content. Genome Res. 2002, 12: 17-25.

    Article  PubMed  Google Scholar 

  20. Mirkin BG, Fenner TI, Galperin MY, Koonin EV: Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol. 2003, 3: 2-

    Article  PubMed Central  PubMed  Google Scholar 

  21. Kunin V, Ouzounis CA: The balance of driving forces during genome evolution in prokaryotes. Genome Res. 2003, 13: 1589-1594.

    Article  PubMed Central  PubMed  Google Scholar 

  22. Csuros M: Count: evolutionary analysis of phylogenetic profiles with parsimony and likelihood. Bioinformatics. 2010, 26: 1910-1912.

    Article  PubMed  Google Scholar 

  23. Csuros M, Miklos I: Streamlining and large ancestral genomes in Archaea inferred with a phylogenetic birth-and-death model. Mol Biol Evol. 2009, 26: 2087-2095.

    Article  PubMed Central  PubMed  Google Scholar 

  24. Makarova K, Slesarev A, Wolf Y, Sorokin A, Mirkin B, Koonin E, Pavlov A, Pavlova N, Karamychev V, Polouchine N, Shakhova V, Grigoriev I, Lou Y, Rohksar D, Lucas S, Huang K, Goodstein DM, Hawkins T, Plengvidhya V, Welker D, Hughes J, Goh Y, Benson A, Baldwin K, Lee JH, Díaz-Muñiz I, Dosti B, Smeianov V, Wechter W, Barabote R: Comparative genomics of the lactic acid bacteria. Proc Natl Acad Sci USA. 2006, 103: 15611-15616.

    Article  PubMed Central  PubMed  Google Scholar 

  25. Wolf YI, Makarova KS, Yutin N, Koonin EV: Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer. Biol Direct. 2012, 7: 46-

    Article  PubMed Central  PubMed  Google Scholar 

  26. Merhej V, Raoult D: Rickettsial evolution in the light of comparative genomics. Biol Rev Camb Philos Soc. 2011, 86: 379-405.

    Article  PubMed  Google Scholar 

  27. Merhej V, Georgiades K, Raoult D: Postgenomic analysis of bacterial pathogens repertoire reveals genome reduction rather than virulence factors. Brief Funct Genomics. 2013, 12: 291-304.

    Article  PubMed  Google Scholar 

  28. Boscaro V, Felletti M, Vannini C, Ackerman MS, Chain PS, Malfatti S, Vergez LM, Shin M, Doak TG, Lynch M, Petroni G: Polynucleobacter necessarius, a model for genome reduction in both free-living and symbiotic bacteria. Proc Natl Acad Sci USA. 2013, 110: 18590-18595.

    Article  PubMed Central  PubMed  Google Scholar 

  29. Richards VP, Palmer SR, Pavinski Bitar PD, Qin X, Weinstock GM, Highlander SK, Town CD, Burne RA, Stanhope MJ: Phylogenomics and the dynamic genome evolution of the genus Streptococcus. Genome Biol Evol. 2014, 6: 741-753.

    Article  PubMed Central  PubMed  Google Scholar 

  30. Luo H, Csuros M, Hughes AL, Moran MA: Evolution of divergent life history strategies in marine Alphaproteobacteria. MBio. 2013, 4: 4-

    Article  Google Scholar 

  31. Swan BK, Tupper B, Sczyrba A, Lauro FM, Martinez-Garcia M, Gonzalez JM, Luo H, Wright JJ, Landry ZC, Hanson NW, Thompson BP, Poulton NJ, Schwientek P, Acinas SG, Giovannoni SJ, Moran MA, Hallam SJ, Cavicchioli R, Woyke T, Stepanauskas R: Prevalent genome streamlining and latitudinal divergence of planktonic bacteria in the surface ocean. Proc Natl Acad Sci USA. 2013, 110: 11463-11468.

    Article  PubMed Central  PubMed  Google Scholar 

  32. Mira A, Ochman H, Moran NA: Deletional bias and the evolution of bacterial genomes. Trends Genet. 2001, 17: 589-596.

    Article  PubMed  Google Scholar 

  33. Kuo CH, Ochman H: Deletional bias across the three domains of life. Genome Biol Evol. 2009, 1: 145-152.

    Article  PubMed Central  PubMed  Google Scholar 

  34. Wolf YI, Koonin EV: Genome reduction as the dominant mode of evolution. Bioessays. 2013, 35: 829-837.

    Article  PubMed Central  PubMed  Google Scholar 

  35. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, Angiuoli SV, Crabtree J, Jones AL, Durkin AS, Deboy RT, Davidsen TM, Mora M, Scarselli M, Margarit y Ros I, Peterson JD, Hauser CR, Sundaram JP, Nelson WC, Madupu R, Brinkac LM, Dodson RJ, Rosovitz MJ, Sullivan SA, Daugherty SC, Haft DH, Selengut J, Gwinn ML, Zhou L, Zafar N, et al: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial `pan-genome'. Proc Natl Acad Sci USA. 2005, 102: 13950-13955.

    Article  PubMed Central  PubMed  Google Scholar 

  36. Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev. 2005, 15: 589-594.

    Article  PubMed  Google Scholar 

  37. Tettelin H, Riley D, Cattuto C, Medini D: Comparative genomics: the bacterial pan-genome. Curr Opin Microbiol. 2008, 11: 472-477.

    Article  PubMed  Google Scholar 

  38. Lapierre P, Gogarten JP: Estimating the size of the bacterial pan-genome. Trends Genet. 2009, 25: 107-110.

    Article  PubMed  Google Scholar 

  39. Norman A, Hansen LH, Sorensen SJ: Conjugative plasmids: vessels of the communal gene pool. Philos Trans R Soc Lond B Biol Sci. 2009, 364: 2275-2289.

    Article  PubMed Central  PubMed  Google Scholar 

  40. Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, Oggioni M, Dunning Hotopp JC, Hu FZ, Riley DR, Covacci A, Mitchell TJ, Bentley SD, Kilian M, Ehrlich GD, Rappuoli R, Moxon ER, Masignani V: Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010, 11: R107-

    Article  PubMed Central  PubMed  Google Scholar 

  41. Snipen L, Almoy T, Ussery DW: Microbial comparative pan-genomics using binomial mixture models. BMC Genomics. 2009, 10: 385-

    Article  PubMed Central  PubMed  Google Scholar 

  42. Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, Post JC, Ehrlich GD: Characterization and modeling of the Haemophilus influenzae core and supragenomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007, 8: R103-

    Article  PubMed Central  PubMed  Google Scholar 

  43. Baumdicker F, Hess WR, Pfaffelhuber P: The infinitely many genes model for the distributed genome of bacteria. Genome Biol Evol. 2012, 4: 443-456.

    Article  PubMed Central  PubMed  Google Scholar 

  44. Collins RE, Higgs PG: Testing the infinitely many genes model for the evolution of the bacterial core genome and pangenome. Mol Biol Evol. 2012, 29: 3413-3425.

    Article  PubMed  Google Scholar 

  45. Gogarten JP, Doolittle WF, Lawrence JG: Prokaryotic evolution in light of gene transfer. Mol Biol Evol. 2002, 19: 2226-2238.

    Article  PubMed  Google Scholar 

  46. Gogarten JP, Townsend JP: Horizontal gene transfer, genome innovation and evolution. Nat Rev Microbiol. 2005, 3: 679-687.

    Article  PubMed  Google Scholar 

  47. O'Malley MA, Boucher Y: Paradigm change in evolutionary microbiology. Stud Hist Philos Biol Biomed Sci. 2005, 36: 183-208.

    Article  PubMed  Google Scholar 

  48. Kunin V, Goldovsky L, Darzentas N, Ouzounis CA: The net of life: reconstructing the microbial phylogenetic network. Genome Res. 2005, 15: 954-959.

    Article  PubMed Central  PubMed  Google Scholar 

  49. Ge F, Wang LS, Kim J: The cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLoS Biol. 2005, 3: e316-

    Article  PubMed Central  PubMed  Google Scholar 

  50. Dagan T: Phylogenomic networks. Trends Microbiol. 2011, 19: 483-491.

    Article  PubMed  Google Scholar 

  51. Dagan T, Martin W: The tree of one percent. Genome Biol. 2006, 7: 118-

    Article  PubMed Central  PubMed  Google Scholar 

  52. Dagan T, Martin W: Getting a better picture of microbial evolution en route to a network of genomes. Philos Trans R Soc Lond B Biol Sci. 2009, 364: 2187-2196.

    Article  PubMed Central  PubMed  Google Scholar 

  53. Bapteste E, Susko E, Leigh J, MacLeod D, Charlebois RL, Doolittle WF: Do orthologous gene phylogenies really support tree-thinking?. BMC Evol Biol. 2005, 5: 33-

    Article  PubMed Central  PubMed  Google Scholar 

  54. Bapteste E, O'Malley MA, Beiko RG, Ereshefsky M, Gogarten JP, Franklin-Hall L, Lapointe FJ, Dupre J, Dagan T, Boucher Y, Martin W: Prokaryotic evolution and the tree of life are two different things. Biol Direct. 2009, 4: 34-

    Article  PubMed Central  PubMed  Google Scholar 

  55. Puigbo P, Wolf YI, Koonin EV: The tree and net components of prokaryote evolution. Genome Biol Evol. 2010, 2: 745-756.

    Article  PubMed Central  PubMed  Google Scholar 

  56. O'Malley MA, Koonin EV: How stands the tree of life a century and a half after the origin?. Biol Direct. 2011, 6: 32-

    Article  PubMed Central  PubMed  Google Scholar 

  57. Treangen TJ, Rocha EP: Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 2011, 7: e1001284-

    Article  PubMed Central  PubMed  Google Scholar 

  58. Novichkov PS, Ratnere I, Wolf YI, Koonin EV, Dubchak I: ATGC: a database of orthologous genes from closely related prokaryotic genomes and a research platform for microevolution of prokaryotes. Nucleic Acids Res. 2009, 37: D448-D454.

    Article  PubMed Central  PubMed  Google Scholar 

  59. Bromham L, Penny D: The modern molecular clock. Nat Rev Genet. 2003, 4: 216-224.

    Article  PubMed  Google Scholar 

  60. Creevey CJ, Doerks T, Fitzpatrick DA, Raes J, Bork P: Universally distributed single-copy genes indicate a constant rate of horizontal transfer. PLoS One. 2011, 6: e22099-

    Article  PubMed Central  PubMed  Google Scholar 

  61. Lynch M: Streamlining and simplification of microbial genome architecture. Annu Rev Microbiol. 2006, 60: 327-349.

    Article  PubMed  Google Scholar 

  62. MicrobesOnline. [], []

  63. Hurst LD: The Ka/Ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002, 18: 486-

    Article  PubMed  Google Scholar 

  64. Novichkov PS, Wolf YI, Dubchak I, Koonin EV: Trends in prokaryotic evolution revealed by comparison of closely related bacterial and archaeal genomes. J Bacteriol. 2009, 191: 65-73.

    Article  PubMed Central  PubMed  Google Scholar 

  65. Nakabachi A, Yamashita A, Toh H, Ishikawa H, Dunbar HE, Moran NA, Hattori M: The 160-kilobase genome of the bacterial endosymbiont Carsonella. Science. 2006, 314: 267-

    Article  PubMed  Google Scholar 

  66. Ran W, Kristensen DM, Koonin EV: Coupling between protein level selection and codon usage optimization in the evolution of bacteria and archaea. MBio. 2014, 5: 2-

    Article  Google Scholar 

  67. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003, 4: 41-

    Article  PubMed Central  PubMed  Google Scholar 

  68. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science. 1997, 278: 631-637.

    Article  PubMed  Google Scholar 

  69. Koonin EV: Comparative genomics, minimal gene-sets and the last universal common ancestor. Nature Rev Microbiol. 2003, 1: 127-136.

    Article  Google Scholar 

  70. Ciccarelli FD, Doerks T, von Mering C, Creevey CJ, Snel B, Bork P: Toward automatic reconstruction of a highly resolved tree of life. Science. 2006, 311: 1283-1287.

    Article  PubMed  Google Scholar 

  71. Puigbo P, Wolf YI, Koonin EV: Search for a tree of life in the thicket of the phylogenetic forest. J Biol. 2009, 8: 59-

    Article  PubMed Central  PubMed  Google Scholar 

  72. Makarova KS, Wolf YI, Koonin EV: Comparative genomics of defense systems in archaea and bacteria. Nucleic Acids Res. 2013, 41: 4360-4377.

    Article  PubMed Central  PubMed  Google Scholar 

  73. Nilsson AI, Koskiniemi S, Eriksson S, Kugelberg E, Hinton JC, Andersson DI: Bacterial genome size reduction by experimental evolution. Proc Natl Acad Sci USA. 2005, 102: 12112-12116.

    Article  PubMed Central  PubMed  Google Scholar 

  74. Jordan IK, Rogozin IB, Wolf YI, Koonin EV: Microevolutionary genomics of bacteria. Theor Popul Biol. 2002, 61: 435-447.

    Article  PubMed  Google Scholar 

  75. Csuros M, Miklos I: Mathematical framework for phylogenetic birth-and-death models. 2009, [], []

  76. Doyon JP, Ranwez V, Daubin V, Berry V: Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011, 12: 392-400.

    Article  PubMed  Google Scholar 

  77. Swenson KM, El-Mabrouk N: Gene trees and species trees: irreconcilable differences. BMC Bioinformatics. 2012, 13: S15-

    PubMed Central  PubMed  Google Scholar 

  78. Libeskind-Hadas R, Wu YC, Bansal MS, Kellis M: Pareto-optimal phylogenetic tree reconciliation. Bioinformatics. 2014, 30: i87-i95.

    Article  PubMed Central  PubMed  Google Scholar 

  79. Bansal MS, Alm EJ, Kellis M: Efficient algorithms for the reconciliation problem with gene duplication, horizontal transfer and loss. Bioinformatics. 2012, 28: i283-i291.

    Article  PubMed Central  PubMed  Google Scholar 

  80. Szollosi GJ, Rosikiewicz W, Boussau B, Tannier E, Daubin V: Efficient exploration of the space of reconciled gene trees. Syst Biol. 2013, 62: 901-912.

    Article  PubMed Central  PubMed  Google Scholar 

  81. Clusters of Orthologous Genes (COGs) in the ATGC. [], []

  82. Kristensen DM, Kannan L, Coleman MK, Wolf YI, Sorokin A, Koonin EV, Mushegian A: A low-polynomial algorithm for assembling clusters of orthologous groups from intergenomic symmetric best matches. Bioinformatics. 2010, 26: 1481-1487.

    Article  PubMed Central  PubMed  Google Scholar 

  83. Tatusov RL, Galperin MY, Natale DA, Koonin EV: The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28: 33-36.

    Article  PubMed Central  PubMed  Google Scholar 

  84. Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591.

    Article  PubMed  Google Scholar 

  85. Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004, 32: 1792-1797.

    Article  PubMed Central  PubMed  Google Scholar 

  86. Price MN, Dehal PS, Arkin AP: FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One. 2010, 5: e9490-

    Article  PubMed Central  PubMed  Google Scholar 

  87. Rogers JS: Maximum likelihood estimation of phylogenetic trees is consistent when substitution rates vary according to the invariable sites plus gamma distribution. Syst Biol. 2001, 50: 713-722.

    Article  PubMed  Google Scholar 

  88. Wolf YI, Aravind L, Grishin NV, Koonin EV: Evolution of aminoacyl-tRNA synthetases – analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res. 1999, 9: 689-710.

    PubMed  Google Scholar 

Download references


The authors are supported by intramural funds of the US Department of Health and Human Services (to the National Library of Medicine).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Eugene V Koonin.

Additional information

Competing interests

The authors declare no competing financial interests.

Authors' contributions

PP and AEL performed the research. DMK collected the data. YIW and EVK designed the study. PP, AEL, YIW and EVK analyzed the data. PP, AEL and EVK wrote the manuscript; all authors read and approved the final version.

Electronic supplementary material

Additional file 1: Table S1.: List of ATGCs and genomes. (DOCX 25 KB)


Additional file 2: Figure S1.: Scheme of the pipeline used in this study. We analyzed all ATGCs with ten or more species. The species tree was reconstructed with FastTree [87] from the concatenated alignment of universal COGs and rooted using the least-squares variation of the mid-point rooting [88]. The species tree and the phyletic distribution of COGs were used to calculate rates of genome dynamics with Count. Figure S2. Positive significant correlation of branch length (BL) with the number of (a) gains, (b) losses, (c) expansions and (b) reductions in individual tree branches. Figure S3. Distributions of the genome dynamics rates across the ATGCs and bootstrap analysis for (a) gains, (b) losses, (c) expansions and (b) reductions. Figure S4. Correlation of the rates of the four classes of GDE with the number of species in ATGCs. Figure S5. Relative genome dynamics by phylogenetic depth. Figure S6. Distributions of the genome dynamics rates across ATGCs. (a) Rates of gain, loss, expansion and reduction. (b) Rates of gain, loss, expansion and reduction considering only shallow branches (phylogenetic depth < 0.05). (c) Comparison of gain rates in all branches (solid green line) and in shallow branches only (dashed black line). (d) Comparison of loss rates in all branches (solid red line) and in shallow branches only (dashed black line). (e) Comparison of expansion rates in all branches (dashed green line) and in shallow branches only (dashed black line). (f) Comparison of reduction based in branches (dashed red line) and in shallow branches only (dashed black line). Figure S7. Correlation of overall genome dynamics with ATGC tree depth. Figure S8. Correlation of dN/dS and (a) gain, (b) loss, (c) expansion and (d) reduction rates in ATGCs. Figure S9. Correlation of GC content and (a) gain, (b) loss, (c) expansion and (d) reduction rates in ATGCs. Figure S10. Correlation of genome shuffling rate (dY) and (a) gain, (b) loss, (c) expansion and (d) reduction rates in ATGCs. Figure S11. Principal components analysis (PCA). (a) Values of the first and second components in the PCA across all ATGCs. (b) PCA loads. (c) Cumulative variance of the four components. Figure S12. Correlation of median number of protein coding genes (genome size) and (a) gain, (b) loss, (c) expansion and (d) reduction rates in ATGCs. Figure S13. Correlation of the estimated supergenome size (uniform model) and gene flux rates in ATGCs. (a) Supergenome size vs flux rate (all GDEs). (b) Supergenome size vs gain rate. (c) Supergenome size vs loss rate. (d) Supergenome size vs expansion rate. (e) Supergenome size vs reduction rate. Figure S14. Correlations between the estimated supergenome size (uniform model), dN/dS ratio, tree depth and average genome size in ATGCs. (a) Supergenome size vs tree depth. (b) Supergenome size vs dN/dS. (c) Tree depth vs dN/dS. (d) Genome size vs dN/dS. Figure S15. Frequency distribution of ATGCs by the number of species. Figure S16. Schematic representation of gains, losses, expansions and reductions. (a) Phylogenetic birth-and-death ML model used in Count. (b) Simple example to show that expansions and gains account for two possible evolutionary scenarios. (PPTX 1 MB)

Additional file 3: Table S2.: Characteristics of ATGCs. (PDF 66 KB)


Additional file 4: Table S3.: Comparison of the present supergenome size estimates with previously published estimates. (DOCX 17 KB)

Additional file 5: Table S4.: Parameters used in the Count analysis. (DOCX 24 KB)

Authors’ original submitted files for images

Rights and permissions

Open Access  This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit

The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Puigbò, P., Lobkovsky, A.E., Kristensen, D.M. et al. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol 12, 66 (2014).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: