Skip to main content
Fig. 4 | BMC Biology

Fig. 4

From: To kill or to be killed: pangenome analysis of Escherichia coli strains reveals a tailocin specific for pandemic ST131

Fig. 4

Pangenome plot of E. coli genomes across different number of genomes used for pangenome construction. Along the x-axis, we indicate the number of genomes used for pangenome construction and the y-axis shows the number of identified gene families. Each small filled circle specifies the average number of genes identified across 100 random permutations of randomly selected genomes. The number of genomes tested range from 5, 10... 50, 100, 150, 200, 250...1200, 1250, 1300, to 1324. The red and blue lines represent the CD-HIT and ProteinOrtho method results, respectively. The pangenome, core genome and softcore genome lines are shown accordingly. The procedures using CD-HIT or ProteinOrtho give the same or very similar softcore and core genome sizes and, therefore, the respective two curves overlap. Tettelin et al. [41] demonstrated that the number N of distinct gene families (= pangenome size) computed from n genomes can be estimated with a power law-type model (Heap’s Law) as N = kn(1 − α)with curve fitting constants k and α. The pangenome is said to be open (infinitely growing with n) if α < 1. Otherwise (α ≥ 1), it is a closed pangenome. This pangenome seems open (if computed with ProteinOrtho data, α ~ 0.7439 and k = 4206; for the CD-HIT curve, α = 0.7521 and k = 4221)

Back to article page