Skip to main content
Fig. 1 | BMC Biology

Fig. 1

From: Charting the diversity of uncultured viruses of Archaea and Bacteria

Fig. 1

Flowchart summarizing the methodology used to establish GL-UVAB. The initial dataset of genomic sequences consisted of the NCBI RefSeq and viral genomic sequences obtained through culturing independent approaches adding up to 195,698 genomic sequences from which 4,332,223 protein encoding genes (PEGs) were identified. After the initial filtering, 6646 sequences were selected for phylogenomic reconstruction. Dice distances were calculated between this set, and the resulting distance matrix was used for phylogenomic reconstruction through neighbor-joining. The obtained tree was used to identify lineages at three levels, based on minimum node depth: level 1 (node depth equal or above 0.0014, and number of representatives equal or above 20), level 2 (node depth equal or above 0.0056, and number of representatives equal or above 10), and level 3 (node depth equal or above 0.0189, and number of representatives equal or above 3). Lineage abundances were estimated in metagenomic datasets by read mapping. Lineage pan-genomes were determined by identifying clusters of orthologous genes. Finally, sequences that were not included in the original tree were assigned to the lineages by closest relative identification (CRI). Closest relatives were determined based on percentage of matched genes (minimum value of 70%) and average amino acid identity (minimum value of 50%)

Back to article page
\