Studying the gut virome in the metagenomic era: challenges and perspectives

The human gut harbors a complex ecosystem of microorganisms, including bacteria and viruses. With the rise of next-generation sequencing technologies, we have seen a quantum leap in the study of human-gut-inhabiting bacteria, yet the viruses that infect these bacteria, known as bacteriophages, remain underexplored. In this review, we focus on what is known about the role of bacteriophages in human health and the technical challenges involved in studying the gut virome, of which they are a major component. Lastly, we discuss what can be learned from studies of bacteriophages in other ecosystems.


Introduction to the virome
With an estimated population of 10 31 , viruses are the most numerous biological entities on Earth, inhabiting diverse environments ranging from the oceans to hydrothermal vents to the human body [1]. The human body is inhabited by both prokaryotic (mostly bacterial) and eukaryotic (mostly human) viruses. Researchers have historically focused on eukaryotic viruses because of their well-known impact on human health, including the influenza virus that causes seasonal flu epidemics and the viruses that cause devastating health consequences like HIV and Ebola. However, increasing evidence suggests that prokaryotic viruses can also impact human health by affecting the structure and function of the bacterial communities that symbiotically interact with humans [2,3]. The viruses that infect bacteria, called bacteriophages, can play a key role in shaping community structure and function in ecosystems with high bacterial abundance [4,5] such as the human gut.
In recent years viruses have gained their own "-ome" and "-omics": the virome and (meta)viromics. These terms encompass all viruses inhabiting an ecosystem along with their genomes and the study of them, respectively. These viruses can be classified in many ways including on the basis of their host (Fig. 1). In this review we focus on bacteriophages, mainly in the human gut ecosystem, and discuss their role in human health. We then lay out the challenges associated with the study of the gut virome, the existing solutions to these challenges, and the lessons that can be learned from other ecosystems.

Bacteriophages: dynamic players in ecosystems
Bacteriophages are the most abundant group of viruses and are obligatory parasites propagating in bacterial hosts. The potential host range is phage-specific and can vary from only one bacterial strain to multiple bacterial species. During infection, a bacteriophage attaches to the bacterium surface and inserts its own genetic material into the cell. The bacteriophage then follows one of two main life cycles: a lytic cycle or a lysogenic cycle.
Lytic cycles are lethal to host cells and culminate in the production of new phages. Well-known examples of viruses with lytic cycles are the T7 and Mu phages that mainly infect Escherichia coli. These phages initially hijack the bacterial cell machinery to produce virions. Thereafter, the bacterial cell is lysed, releasing  virions into the surrounding environment where they can infect new bacterial cells. They can thus play an important role in regulating the abundance of their host bacteria.
In contrast, a lysogenic cycle refers to phage replication that does not directly result in virion production. A temperate phage is a phage that has the ability to display lysogenic cycles. Under certain conditions, such as DNA damage and low nutrient conditions, these phages can spontaneously extract themselves from the host genome and enter the lytic cycle [7]. This excision, called induction, may occur with the capture of specific parts of the bacterial genome. The ability of phages to transfer genes from one bacterium to another by means of lysogenic conversion or transduction (as reviewed in [8]) can lead to increased diversification of viral species and of their associated bacterial host species. These phenomena may cause the spread of toxins, virulence genes, and possibly antibiotic resistance genes through a bacterial population [8]. A well-known example of temperate phage is the phage CTXφ of Vibrio cholera that alters the virulence of its bacterial host by incorporating the genes that code for the toxin that induces diarrhea [9]. Phages may thus serve as important reservoirs and transmitters of genetic diversity. The classification of phages based on their life cycle is a topic of much debate [10] and variations of life cycles like pseudolysogeny and carrier-states have been proposed [11,12].
In the human gut ecosystem, temperate bacteriophages dominate over lytic bacteriophages [13][14][15]. It is believed that the majority of bacterial cells have at least one phage inserted into their genome, the so-called prophage. Some prophages may be incorporated in bacterial genomes for millions of generations, losing their ability to excise from host genomes because of genetic erosion (degradation and deletion processes) [16]. These prophages, which are called cryptic or defective, have been shown to be important for the fitness of the bacterial host [17] and thus represent an essential part of a bacterial genome.

Major hallmarks of the human gut virome
The human gut virome develops rapidly after birth During early development, the virome, like the bacteriome, is extremely dynamic [18][19][20]. In 2008 Breitbart et al., using direct epifluorescent microscopy, concluded that meconium (earliest infant stool) contained no phages [21]. Just 1 week later the infant stool contained 10 8 viral-like particles (VLPs) per gram of feces [21]. Similar to the bacteriome, the infant virome was found to be less diverse than that of adults [21]. The exact mechanism of the origin of phages in the infant gut has yet to be identified, although one hypothesis could be that the phages arise as a result of the induction of prophages from gut bacteria. Numerous other factors are also thought to shape the infant gut virome, including environmental exposures, diet, host genetics, and mode of delivery [15,19,20]. McCann et al. compared the virome of infants born via vaginal delivery to that of infants born via cesarean delivery and found that the alpha-and beta-diversity of the infant virome differed significantly between birth modes [19]. The authors were able to identify 32 contigs that were differentially abundant by birth mode, including several contigs bearing high levels of nucleotide homology to Bifidobacteria temperate phages. This was thought to reflect differential colonization by Bifidobacterium with birth mode. Furthermore, an increased abundance of the vertebrate ssDNA virus Anelloviridae was found in infants born via vaginal delivery, suggesting its vertical transmission from mother to baby [19]. The abundance of this virus had previously been shown to decrease after the age of 15 months [15], but it nonetheless remains highly prevalent in humans worldwide [22]. Diet may also play a role in colonization of infant gut, as Pannaraj et al. showed that a significant proportion of bacteriophages were transferred from mothers to infants through breast milk [23]. Despite these interesting results, only a few studies to date have investigated the infant virome longitudinally. In 2015, Lim et al. conducted a longitudinal study of the virome and bacteriome in four twin pairs, from birth to 2 years, and found that the expansion of the bacteriome with age was accompanied by a contraction and shift in the bacteriophage composition [20].

The human gut virome consists mostly of bacteriophages
As in other environments, bacteriophages dominate over other viruses in the gut ecosystem. Transmission electron microscopy has shown that the human gut virome consists mostly of DNA bacteriophages from the order Caudovirales along with members of Myoviridae, Podoviridae, and Siphoviridae families (Fig. 2) [27,30]. Recently, the order Caudovirales was expanded to include Ackermannviridae and Herelleviridae [31]. In addition, CrAssphage has been found to be a prevalent constituent of the human gut microbiome, possibly representing a new viral family (Fig. 2) [28,32,33]. This phage was recently found to be present in thousands of human-feces-associated environments around the world, Fig. 1 Viruses can be classified based on various characteristics. These terms are used continuously throughout this manuscript. While all characters are important in determining taxonomic relationships, sequence comparisons using both pairwise sequence similarity and phylogenetic relationships have become one of the primary sets of characters used to define and distinguish virus taxa [6] confirming it as a strong marker for fecal contamination [34]. Highly divergent but fully colinear genome sequences from a few crAss-like candidate genera have been identified in all major groups of primates, suggesting that crAssphage has had a stable genome structure for millions of years [34]. This in turn suggests that the genome structure of some phages can be remarkably conserved in the stable environment provided by the human gut [34]. The abundance of eukaryotic viruses in the human gut is low, however, some studies report that small amounts are present in every faecal sample [35,36]. These amounts increase dramatically during viral gastrointestinal infections [14,[37][38][39].
The human gut virome is temporally stable in each individual but shows large inter-individual diversity A study by Minot et al. showed that approximately 80% of the phages in a healthy adult male were maintained over a period of 2.5 years (the entire duration of their study) [26]. This was recently also demonstrated by Shkoporov et al., who found that assemblies of the same or very closely related viral strains persist for as long as 26 months [40]. This compositional stability was further reflected in stable levels of alpha-diversity and total viral counts, suggesting that viral populations are not subject to periodic fluctuations [40]. In a longitudinal study where six individuals were exposed to a short-term fatand fiber-controlled dietary intervention, the gut virome was shown to be relatively stable in each individual [14]. The same study also showed that interpersonal variation in the gut virome was the largest source of variance, even among individuals following the same diet [14].
The large inter-individual variations in the virome are consistent with those seen in the bacteriome and appear largely due to environmental rather than genetic factors. It was recently shown in a cohort of monozygotic twins that co-twins did not share more virotypes than unrelated individuals and that bacteriome diversity predicts viral diversity [41].

Interaction of the human gut virome with the bacteriome in relation to health
In recent years, numerous associations have been established between the human intestinal bacteriome and a number of diseases, syndromes, and traits [42]. Support for these associations varies from anecdotal reports from individuals to results from large cohort studies. For example, in their large cohort study, Falony et al. found the core bacterial microbiome (i.e., the genera shared by 95% of samples) to be composed of 17 genera with a median core abundance of 72.20% [43]. Other studies have shown that a large percentage of the gut bacteriome is represented by members of the Firmicutes and Bacteroidetes, and that their relative levels change in individuals with conditions such as obesity, inflammatory bowel disease (IBD), and diabetes [44][45][46]. This suggests the existence of a "healthy" bacteriome that is disrupted in disease.
In recent years there have also been attempts to characterize a "healthy gut phageome". In 2016, Manrique et al. used ultra-deep sequencing to study the Structural information as well as genome sizes have been exported from the ICTV Online Report [24]. The prevalence of each family in the human gut has been inferred from the following studies: Inoviridae [20,25], Circoviridae, Adenoviridae, Microviridae, Podoviridae, Myoviridae, Siphoviridae [26], Anelloviridae [25][26][27], CrAss-like [28,29]. dsDNA double-stranded DNA. ssDNA single-stranded DNA presence of completely assembled genomes of phages in 64 healthy people around the world [47]. The authors proposed that the phageome could be split into three parts: i) the core, which is composed of at least 23 bacteriophages, one of them crAssphage, found in > 50% of all individuals; (ii) the common, which is shared among 20-50% of individuals; and (iii) the low overlap/unique, which is found in a small number of individuals. The latter fraction represented the majority of found bacteriophages in the whole dataset [47]. This study, amongst others, suggests that a core virome should not be determined as strictly as the core bacteriome has thus far been defined. Therefore, crAssphage, the abundance of which was not associated with any health-related variables, is likely to be a core element of the normal human virome [34].
An attractive model to study bacteria-phage interactions is through the use of gnotobiotic mice, which are colonized with a limited collection of bacteria that are well characterized yet still complex [48]. Recently, Hsu et al. colonized gnotobiotic mice with a defined set of human gut commensal bacteria and subjected them to predation by cognate lytic phages [49]. This revealed that phage predation not only directly impacted susceptible bacteria, but also led to cascading effects on other bacterial species via interbacterial interactions [49]. Fecal metabolomics in these mice revealed that phage predation in the mouse gut microbiota can potentially impact the mammalian host by changing the levels of key metabolites involved in important functions such as gastric mobility and ileal contraction [49].

Bacteriophages and disease
The high inter-individual variability of the virome in healthy individuals presents a challenge for disease association studies, but even with this challenge, compelling evidence is emerging for bacteriophage involvement in several diseases (Table 1). For example, in a study comparing individuals with IBD to household controls, IBD patients had a significant expansion of the taxonomic richness of bacteriophages from the order Caudovirales [52]. Cornault et al. found that prophages of Faecalibacterium prausnitzii, a bacterium usually depleted in individuals with IBD, are either more prevalent or more Parkinson's disease (PD) PD patients (n = 31) and control individuals (n = 28) Identified shifts of the phage/bacteria ratio in lactic acid bacteria known to produce dopamine and regulate intestinal permeability, both major factors implicated in PD pathogenesis Tetz et al. 2018 [58] abundant in the fecal samples of IBD patients compared to healthy controls, suggesting that these phages might play a role in the disease pathophysiology [59]. This supports the importance of studying the virome concurrently with the bacteriome in order to obtain a holistic picture of the gut ecosystem changes in a disease like IBD. Nor is this relationship between IBD and virome limited to human studies. Duerkop et al. [60] reported that, in murine colitis, intestinal phage communities undergo compositional shifts similar to those observed by Norman et al. in human IBD patients [52]. Specifically, Duerkop et al. observed a decrease in phage community diversity and an expansion of subsets of phages in animals with colitis. Furthermore, Clostridiales phages were decreased during colitis, and the authors suggested that members of the Spounaviridae subfamily of phages could serve as informative markers for colitis [60].
It is important to keep in mind that, although many diseases show associations with various bacteriophages, it is extremely hard to establish causality. Furthermore, in these association studies it is difficult to establish whether alterations in the microbiome and virome are a cause or a consequence of the disease. Koch's postulates are a set of criteria designed to establish a causative relationship between a microbe and a disease. In 2012, Mokili et al. proposed a metagenomic version of Koch's postulates [61]. In order to fulfill these metagenomic Koch's postulates, the following conditions must be met: i) the metagenomic traits in diseased subjects must be significantly different from those in healthy subjects; ii) the inoculation of samples from a diseased animal into a healthy control must lead to the induction of the disease state; and iii) the inoculation of the suspected purified traits into a healthy animal will induce disease if the traits form the etiology of the disease [61]. Many studies investigating the role of specific bacteriophages in human disease have been able to fulfill the first criterion and have found significant differences in viral contigs or specific phages between diseased and healthy individuals (Table 1). However, only a few of these studies are supported by animal experiments, and most of these experiments are in the form of fecal microbiota transplantation (FMT) rather than delivery of specific inoculated phages [62,63]. Furthermore, the question of causality becomes even more complex when, as is often the case, multiple phages are likely to be involved in the etiology of a disease (Table 1).
It is known that both the gut virome and gut microbiome can be pathologically altered in patients with recurrent Clostridium difficile infection [64], and FMT has rapidly become accepted as a viable and effective treatment [65]. Ott et al. described the greater efficacy of bacteria-free fecal filtrate transfer compared to FMT in reduction of symptoms in patients with C. difficile infection [66]. The filtrate recovered from normal stool contains a complex of bacteriophages, as shown by analysis of VLPs from the filtrate, which suggests that phages may mediate the beneficial effects of FMT [66], although this could also be the effect of various metabolites.
Interestingly, phages can also directly influence human immunity. Recent research has shown phages to modulate both human innate and adaptive immunity (reviewed in [67]). One way in which phages can directly influence host immunity was described by Barr et al. as the Bacteriophage Adherence to Mucus model (BAM) [3]. In BAM, phages adhering to mucus reduce bacterial colonization of these surfaces, thereby protecting them from infection and disease [3].
Since their discovery in the early twentieth century, lytic bacteriophages have been seen to have promising potential as antimicrobial agents, although this potential was broadly surpassed by the rapid development of antibiotics as our main antibacterial agents. Currently, the applications of lytic bacteriophages go far beyond their antimicrobial activity as they are now engineered as vehicles for drug delivery and vaccines [68,69] and broadly used in molecular biology and microbiology [70,71].
In recent years there have been some attempts to systematically study the effect of phages in trial settings. Yen et al. showed that prophylactic administration of a Vibrio cholerae-specific phage cocktail protects against cholera by reducing both colonization and cholera-like diarrhea in infant murine and rabbit models [72]. In contrast, Sarker et al. showed that oral coliphages, though safe for use in children suffering from acute bacterial diarrhea, failed to achieve intestinal amplification and improve diarrhea outcome [73]. This was possibly due to insufficient phage coverage and too low E. coli pathogen titers, meaning that higher oral phage doses were probably required to achieve the desired effect [73]. These studies demonstrate how bacteriophage therapy is still in its infancy despite its long use in the field of medical sciences [74][75][76] and emphasize the need for more systematic fundamental in vitro studies, translational animal studies, and large, properly controlled, randomized controlled trials.

Studying the human gut virome
The extensive study of the bacteriome that has been taking place over the past few years may partly be due to the presence of universal phylogenetic markers such as the 16S rRNA gene. In contrast to bacteria, viruses lack such a universal marker. Studying the virome therefore requires large-scale metagenomic sequencing (MGS) approaches (Fig. 3). However, there are numerous challenges to be overcome in the process of viral MGS data generation and analysis. Below we outline and discuss the common challenges in widely used methods of studying the virome, as well as their possible solutions. A summary of the challenges of virome studies and the approaches to tackle them are outlined in Table 2.

Sample collection and storage
The first challenge in gut-microbiome-related studies is the limited number of samples an individual can provide, particularly in the framework of biobanks and largescale studies. Moreover, in low biomass samples such as viral communities from certain environmental ecosystems and human-related specimens, researchers need to be extremely careful of environmental contamination from kits and reagents [105].
Post-sampling, bacteria and bacteriophages remain in contact with each other and will continue having ecological interactions, which means that prolonged incubation of samples at room temperature can affect the ratio of microbes to the point that they are no longer representative of in situ conditions [78]. Overcoming this issue requires extracting viral genetic material immediately after collection (if possible) or rapidly freezing samples at − 80°C.

Nucleic acid extraction
Similar to gut microbiome studies, gut virome studies begin by isolating the genetic material from intestinal specimens (Fig. 3). Given the perceived predominance of DNA viruses in human stool [14,15], current virome studies mainly use DNA extraction from fecal samples [78][79][80]. However, the current conception of gut virome composition might underestimate the abundance of RNA viruses. For example, RNase I is commonly used in VLP isolation protocols to remove free capsidunprotected RNA of non-viral origin [78,79]. However, RNase I has recently also been shown to affect the RNAfraction of the virome [84]. To get a true estimate of the RNA viruses in the sample, one needs to restrict the use of RNase I, although this might come at a cost of increased contamination ( Table 2).
The main hurdle in studying the virome, however, is the parasitic nature of bacteriophages. Their ability to be incorporated into the host bacterial genome causes the Fig. 3 The steps in metagenomic study of the virome. Nucleic acid extraction: the virome can be studied by extraction of nucleic acids from both fractions of the total microbial community which includes bacteria and viruses (left) and purified viral-like particles (VLPs; right), and different types of VLP-enriching techniques might be applied to obtain the latter fraction (see main text for details). Genomic library preparation: the extracted viral genetic material is subjected to sequencing after genomic library preparation. Both the choice of genomic library preparation technique and the sequencing coverage can affect the representation of specific members of the viral community in the sample (see discussion in the main text). Quality control: the raw sequencing reads are further trimmed of sequencing adapters, and low-quality and overrepresented reads are discarded. Virome annotation: there are two main ways of studying viral communities-read-mapping to closed reference databases or de novo assembly of viral genomes with optional, but advised, validation of contigs via reference databases • Existence of active and silent fractions of viromes • Total nucleic acid isolation protocols (TNAI): + Allow characterization of microbiome along with virome potential = holistic picture of all components of the microbiome + High-throughput -Lead to inflation of false-positive hits from bacteria in the subsequent data analysis • Viral-like particle (VLP) isolation protocols: + Ensure true positives on viruses due to physical removal of bacteria by filtration -Give a low-concentration output [79] that may complicate the genomic library preparation step -Usually require multiple time-consuming steps of VLP and nucleic acid precipitation [78,80] • Combination of TNAI and VLP isolation protocol approaches [81] Genomic library preparation • Limited amount of viral genetic material available • Use of more sensitive genomic library preparation kits • MDA may lead to overrepresentation of circular ssDNA viruses [82] and underrepresentation of viruses with extreme GC content [83] • Restricted use of MDA • Studying RNA viruses requires additional effort due to the relative instability of RNA genetic material: -Use of reverse transcriptase to convert RNA to cDNA -Restricted usage of RNase in protocols handling both DNA and RNA viruses [84] -May require separate isolation protocol (arising from the previous point) and, therefore, increase of the starting material -Some of the WGA techniques that precede the genomic library preparation procedure might introduce biases into the representation of ssDNA viruses [77,82,85] -The majority of current genomic library preparation procedures cannot handle ssDNA genomes due to the use of dsDNA adapters -ssDNA viruses have been shown to have higher mutation rates than dsDNA viruses [86], thus increasing the microdiversity of the metagenome, which limits reference-based approach • Use of ssDNA adaptors in adaptor-ligation reaction at the genomic library preparation step [77] • Selection of an appropriate cut-off for coverage is complicated • Studies report discoveries of a huge number of viruses at a depth of 1-15 × 10 6 reads per sample [60,[78][79][80] Quality control • Removal of bacterial sequences is complicated by the viral signals from prophages (both cryptic and inducible) carried by bacterial genomes • Use of tools for identification of prophages in bacterial genomes [87][88][89], though some are limited to known prophages. The combination of multiple methods has been shown to enrich the set of detected prophages [90] and therefore prevent their concurrent removal with bacterial sequences.

Data analysis
• Existing databases do not fully represent viral diversity [91] • Use of de novo assembly approaches • Rapid evolution and diversity of viral genomes limits reference-based approaches • Use of reference databases that include both cultured viruses and computationally identified viral contigs [25,92] • Use of a protein-based search • Use of a profile hidden Markov model based on protein domains allows the identification of remote homologs [93] • De novo assembly approach is sensitive to biases introduced during genomic library preparation and sequencing: -Low DNA input for genomic library preparation decreases the percentage of reads that map back to the corresponding assemblies [94,95] -Use of a DNA amplification step might affect the distribution of read coverage [94,96] -Shifts in GC content during genomic library preparation [97] affect the completeness of genomes and cause assembly fragmentation • Adjustment of the assembly pipeline according to applied genomic library preparation procedure [96]: use of modes suitable for an uneven distribution of read coverage such as single-cell SPAdes [98,99] preceded by read de-duplication [96] or Velvet-SC [100] • Use of genomic library preparation protocols without any amplification procedure (needs high DNA input, probably not applicable for viromics) [101,102] • Reproducibility of assembly results when combining different assemblers is complicated by technical challenges [103,104] and the possibility of the appearance of chimera assemblies [104] nominal division of the virome into active (lytic phages) and silent (prophages) fractions (Table 2). Depending on the targeted fraction of the virome, DNA extraction protocols may differ substantially. For instance, the active virome is primarily studied through the extraction of DNA from VLPs obtained by filtration, various chemical precipitations [14,15,29,47], and/or (ultra)centrifugation [106,107]. In contrast to studying the active virome, the concurrent targeting of both the silent and active virome (so-called "virome potential") requires total nucleic acid isolation (TNAI) from all the bacteria and viruses in the sample [56][57][58]. While both approaches have their pros and cons (Table 2), a combination of both is desirable, albeit expensive, because this will give the complete picture of the microbiome communities.
In addition to the exclusion of RNA viruses during the isolation of genetic material in some common extraction protocols, ssDNA viruses might also be overlooked. Sequencing of ssDNA virus genomes is difficult because of the limited number of genomic library preparation kits that allow in situ representation of ssDNA viruses without amplification bias (Table 2) [77]. Thus, the current conception that the gut virome is predominantly composed of dsDNA viruses might be biased by the relative ease of processing dsDNA.

Genomic library preparation
At the step of preparation of genomic libraries, low viral biomass poses a new challenge since many existing genomic library preparation kits require inputs of up to micrograms of DNA, amounts that are rarely available for virome samples. Taking into account the perceived predominance of bacteriophages in human stool (see "Major hallmarks of the human gut virome" section), the typical input amount of DNA after the extraction step can be estimated as follows: the number of bacteriophages in 1 g of human feces is 10 9 [108][109][110] and the average genome size of a bacteriophage is 40 kbp [111] (Fig. 2), so the total amount of bacteriophage DNA in 1 g of human feces is 40 • 10 9 kbp with the weight of 43.6 ng. Thus, depending on the elution volume (usually 50-200 μl), any VLP isolation protocol for stool will result in a minuscule concentration of bacteriophage DNA: [0.22-0.87] ng/μl. This is also the range observed in the benchmarking of VLP extraction protocols, although with variations that can reach an order of magnitude in some cases [78][79][80]. Therefore, the application of more sensitive kits that enable the handling of nano-and picograms of DNA input [77] or whole-(meta)genome amplification (WGA) is needed ( Table 2). Although WGA has been shown to be a powerful tool for studying the human gut virome [19,20], some WGA techniques, even non-PCR-based methods such as multiple displacement amplification (MDA), unevenly amplify linear genome fragments and might introduce biases into the representation of ssDNA circular viruses [82,85]. Therefore, in the presence of MDA, the downstream analysis of viral community composition might be limited to presence-absence statistics because relative abundances might be biased towards specific viruses. Another type of WGA, adaptase-linker amplification (A-LA), is preferable for studying differentially abundant viruses since it keeps them quantifiable and allows unbiased representation [77]. Moreover, A-LA allows the study of both ssDNA and dsDNA viruses compared to other quantitative WGA methods such as alternative linker amplification (LA) and tagmentation (TAG), which are mostly focused on dsDNA viruses [77,85].
At the sequencing step, the selection of a coverage cut-off poses an additional challenge ( Table 2). In general, as a very complex and diverse community, the virome requires ultra-deep sequencing [47], even though such sequencing might also complicate downstream analysis [112]. Generally, the increase of coverage leads to an increase in the number of duplicated reads with sequencing errors. These duplicated reads might align to each other and create spurious contigs that prevent assembly of longer contigs [112,113].

Quality control
After overcoming the barriers faced in isolation and sequencing of virome communities, new challenges need to be overcome in the data analysis. Initially, it is necessary to discard human-host and bacterial-host reads that may introduce biases into the virome community profiling. While there are now many tools that remove nearly all human-related reads, filtering of bacterial reads may be challenging due to the presence of prophages within bacterial genomes. As inducible and cryptic prophages are important players in the gut ecosystem [16,17], it is necessary to filter bacterial reads carefully since they may contain prophage genome sequences that should be taken into consideration during the virome analysis. There are now several tools that can identify prophage sequences in MGS data ( Table 2).

Data analysis
Sequencing reads passing quality control are thereafter subjected to virome profiling. Currently, there are two general strategies for virome profiling based on MGS data: (i) reference-based read mapping and (ii) de novo assembly-based profiling (Fig. 3). Both strategies face challenges in the characterization of viral community ( Table 2). The reference-based read mapping approach, which is the one broadly used in microbiome studies, is limited by a scarcity of annotated viral genomes [114]. However, the enormous viral diversity and viral genetic microdiversity will also complicate de novo assembly of metagenomes [115,116] (Table 2). Rapid evolution, an innate feature of viruses that allow them to inhabit almost every ecological niche, leads to substantial intraspecies divergence [117]. Although the human gut virome has been shown to be stable over time, partly due to the temperate character of the majority of human gut viruses, some members of the human gut virome can evolve quickly. For example, it has been shown for lytic ssDNA bacteriophages from Microviridae inhabiting the human gut that a 2.5-year period is sufficient time for a new viral species to evolve [26]. This may limit the use of reference-based approaches in studying the virome, although some studies have successfully used this method for virome annotation in combination with the de novo assembly-based method [55,118] ( Table 2).
The de novo assembly of metagenomes that was successfully used for the discovery of CrAssphage [28] does not rely on the reference databases. Therefore, de novo assembly-based approaches give a more comprehensive estimation of the complexity of viral communities and viral dark matter (uncharacterized metagenomic sequences originating from viruses) (Fig. 3) [119]. However, metagenome assembly outcome is highly dependent on the read coverage [113] since the default assembly workflow assumes an even coverage distribution for each genome [99]. Some biases introduced during sample processing might affect the coverage distribution and therefore hamper de novo assembly in terms of completeness of genomes and assembly fragmentation. The sources of such bias include low DNA input for genomic library preparation [94,95], use of A-LA [94,96], and shifted GC content associated with MDA [97]. In addition, it has been shown that the choice of sequencing technology has a minimal effect on the de novo assembly outcome [95], while the choice of assembly software crucially affects results [104] ( Table 2).
Regardless of the method chosen for virome annotation, more challenges come at the step of taxonomy assignment to viral sequences. Currently, only 5560 viral species have been described and deposited with the International Committee on Taxonomy of Viruses (ICTV) [31]. Despite the rapid growth of the ICTV database after it allowed the deposition of de novo assembled viral sequences that were not cultured or imaged [120] and the application of genesharing networks to viral sequences for taxonomy assignment [121], levels above genus are still unavailable for many known viruses. Nonetheless, there are reasons to be optimistic. The ICTV committee recently decided to expand the taxonomical classification of viruses to levels above rank and order [122], and the first-ever viral phylum [123] has already been reported. More higher-order ranks can be expected given the rise of pace and uniformity of novel viral genomes deposited [124].

Lessons from other ecosystems
Fortunately, the majority of the technical challenges described in Table 2 have already been addressed in studies of viral communities in other human organs (such as skin [125,126] and lungs [127]) and in environmental ecosystems (such as seawater [128,129] and soil [130]). Some of the solutions from environmental studies are now being applied to similar challenges in the human gut (Table 2). However, we still need a systematic approach to studying the gut virome as a complex community. Environmental studies have a long history of taking the entire complex community into account: from the sequencing of the first viral metagenome of an ocean sample in 2002 [131] to the 2019 global ocean survey that revealed almost 200,000 viral populations [132]. This is in striking contrast to human-oriented studies, which have often been limited to the identification of specific pathogens in order to combat them. Given this historical context, additional analytical approaches and hypotheses developed in cutting-edge viral ecogenomic studies of environmental samples might also be applicable to the human gut virome.
Many environmental studies have benefited from the use of multi-omics approaches [81,116,133]. For example, Emerson et al. showed the potential of bacteriophages to influence complex carbon degradation in the context of climate change [81]. This has been possible partially due to the advantages of metatranscriptomics and the concurrent reconstruction of bacterial and viral genomes from soil metagenomics [81]. Additionally, combining metaproteomic and metagenomic approaches has identified highly abundant viral capsid proteins from the ocean, and these proteins may represent the most abundant biological entity on Earth [133].
Next to these multi-omic approaches, viral metagenomic assembly can be complemented by single-virus genomics (SVG), which includes individual sequencing of the genome of the viruses once each viral particle has been isolated and amplified. Therefore, unlike de novo assembly of metagenomes, de novo assembly of SVG genomes can address viral genetic microdiversity and thereby enable the reconstruction of more complete viral genomes [116]. SVG has identified highly abundant marine viral species that have, so far, not been found via metagenomic assembly [116]. These newly identified viral species possess proteins homologous to the aforementioned abundant capsid proteins, confirming their widespread presence in oceans [133]. Furthermore, another challenge of de novo assembly-the presence of low coverage regions-might be overcome through the use of long-read sequencing (> 800 kbp), which was recently shown to recover some complete viral genomes from aquatic samples [134].
In addition to the advances in data generation from viral communities, approaches to overcoming the problem of dominance of unknown sequences in viral metagenomes have been suggested in several environmental studies. Brum et al. used full-length similarity clustering of the proteins predicted from viral genomic sequences to reveal the set of core viral genes shared by samples originating from seven oceans, the diversity patterns of marine viral populations, and the ecological drivers structuring these populations [135]. Taking into account the huge inter-individual variation of the human gut virome (see "Major hallmarks of the human gut virome" section), it might be useful to use a similar approach to identify the core viral genes in the human gut.
To understand the mechanisms behind the phagehost interaction in the context of the gut ecosystem, it might also be useful to use viral-encoded auxiliary metabolic genes (AMGs). The analysis of AMGs and their abundance in marine samples facilitated the identification of the role of bacteriophages in nitrogen and sulfur cycling by affecting the host metabolism [136]. Furthermore, the study of viral communities in the polar region of the Southern Ocean highlighted the value of AMG analysis in understanding how lytic and temperate phages survive during seasonal changes in their bacterial host abundance, which follows the availability of nutrient resources [137]. Another approach applied by Zeigler Allen et al. in the study of the marine microbiome community suggests using bacteriophage sequence signatures, together with measures of the virus/bacteria ratio and bacterial diversity, to evaluate the influence of viruses on the bacterial community instead of direct comparison of co-abundance profiles [138]. This method redefined the viral infection potential and confirmed the role of bacteriophages in shaping the entire marine community structure.
Similarly, in soil ecosystems, where bacteria dominate over archaea and eukaryotes as they do in marine ecosystems, it has been shown that phages play an important role in defining ecosystem composition and function [81,130,139]. Moreover, in ecosystems such as anaerobic digesters, more than 40% of the total variation of the prokaryotic community composition is explained by the presence of certain phages, and this is much higher than the explanatory potential of abiotic factors (14.5%) [140]. Studies in plants have also demonstrated that phages are a major factor influencing bacterial composition [141]. However, the applicability of these findings to the human gut, which is also a bacteria-dominated ecosystem, has yet to be explored.
It is important to bear in mind that ecological concepts from one ecosystem might have limited applicability to another. Even if two ecosystems have similar viral community structures, the underlying ecological relationships may differ. For example, a predominance of temperate viruses was reported in a polar aquatic region [137]. This predominance of temperate phages corresponds to that in the gut ecosystem. However, for the polar marine ecosystem, it was shown that temperate phages switch from lysogeny to lytic infection mode with the rise of bacterial abundance [137]. This is opposite to the Piggyback-the-Winner model observed in the human gut, where temperate phages dominate over lytic phages when the bacterial host is abundant [142,143]. This difference in ecological concepts between the gut and distinct marine ecosystem reflects the exposure to different factors of the environment. The polar aquatic region has a periodic nature owing to the change of seasons, while the gut ecosystem can be considered relatively stable (see "Major hallmarks of the human gut virome" section). Therefore, while human gut viromics might benefit from considering some cutting-edge approaches developed in environmental studies, caution should be exercised in extrapolating ecological concepts found in distinct ecosystems to situations pertaining to the human gut.

Concluding remarks
Given the fascinating and challenging nature of viruses, emerging evidence for the role of gut bacteriophages in health and disease and on-going paradigm shifts in our understanding of the role of certain viruses in other ecosystems, the further development of viromics is much warranted. Once we have overcome the current challenges of gut virome research, for example, through optimization of virome isolation protocols and expansion of the current databases of (un)cultivated viruses, future directions for development in the study of the human gut virome will be: (i) to establish a core gut virome and/or core set of viral genes through the use of large longitudinal cohort studies; (ii) to study the long-term evolution of bacteriome-virome interactions under the influence of external factors; and (iii) to establish the causality of the correlations with host-related phenotypes through the use of model systems, multi-omics approaches, and novel bioinformatic techniques, possibly including those inherited from environmental studies.