In this work, we present a method that automatically predicts the taxonomic labels (phylum to genus) of the hosts for viral contigs. Although host taxonomy prediction can be conducted on species level or even strain level, considering both polyvalent phages and the lack of known virus-host relationships, we focus on predicting the hosts’ taxonomic ranking from phylum to genus in order to deliver more reliable results.
The key component of our method is the semi-supervised learning model GCN [26]. GCN can flexibly model the sequence-level relationships between viruses or prokaryotes using a knowledge graph and conducts convolution using node features and the topological structure. One big difference between CNN and GCN is that each node in GCN can have a different convolutional filter/kernel depending on its connections with other nodes. The convolution is conducted on each node using its own feature and the combined feature of its neighboring nodes. Thus, the information can be passed between the labeled samples/nodes and the unlabeled samples/nodes. In biological data analyses, there exist many topological structures such as gene-sharing network, disease-drug relationship graph, and diseases-gene relationship graph. Utilizing these relationships in GCN has led to several successful applications [27–31].
In our problem of predicting hosts for viruses, we will create a knowledge graph that integrates three types of information. First, although virus receptor binding proteins play an important role in helping viruses attach to the target hosts, many other proteins are involved in the process of virus infection [32]. Thus, viruses sharing more genes tend to infect host in the same taxonomic group. The similarity of gene sharing can be represented by edges between viruses. Second, viruses and their hosts can possess local sequence similarities, which is a feature used by many available host prediction programs. The sequence similarities can be modeled as edges between viruses and hosts in the knowledge graph. Third, the nodes in the knowledge graph can be encoded as numerical vectors using a CNN for taxonomic classification of viruses. Then, the graph convolutional layer is conducted for each node and its neighbors based on the knowledge graph. The error minimization process in training will help the model fit labeled samples and back propagate the loss to the whole graph. After training, the learned convolutional filters will then be applied to predict test samples that are connected to the knowledge graph.
Construction of the knowledge graph
Figure 1 (IV) sketches the knowledge graph. Viral and host sequences are represented by circles and triangles, respectively. All host nodes have their taxonomic labels. For virus nodes, the colored ones are the training sequences with known hosts and thus their labels are the taxonomic labels of the hosts. White nodes represent query virus genomes or contigs without host information (i.e., test data). The semi-supervised learning will finally assign labels for the white nodes.
To encode the nodes, a pre-trained CNN is applied to capture motif-related patterns from input sequence (Fig. 1.I). There are two types of edges in the knowledge graph: virus to virus and virus to host. The edge between viruses represents sequence similarity and the similarity between shared protein families. The edge between virus and host nodes represents local similarities between the genomes. By combining the nodes’ features and edges, we construct a knowledge graph and feed it to the GCN for training (Fig. 1.IV). In addition, in order to quantify the confidence of the prediction, we combine the expected calibrated error (ECE) and mean square error (L2) in the training process. After training, the knowledge graph and the learned convolution parameters are used to predict the host for new viruses. We will discuss the details of edge construction in the “Edge construction” section and node encoding in the “Node construction.”
Edge construction
Virus-virus connection
In this section, we first introduce the method of constructing protein clusters, which are used to establish edges between viruses (Fig. 1. III). There are three steps to construct the protein clusters. First, we extract proteins from all the virus sequences and apply DIAMOND to measure the protein similarity. For available reference genomes, the protein sequences are downloaded from NCBI RefSeq. For query/new viral contigs, we conducted gene finding and protein translation using Prodigal [33]. Then, we employ DIAMOND to conduct all-against-all pairwise alignment between contigs’ translations and reference protein sequences. DIAMOND will output the alignment of each protein pairs with E-values below a given cutoff (the default cutoff is 1e −5). Second, based on the alignment results, we can construct a protein similarity network, where the nodes are the proteins, and the edges represent the alignments. The edge weight is the negative logarithm of the corresponding E-value. Finally, the protein clusters can be identified by the Markov clustering algorithm (MCL).
$$ P(y \ge c) = {\sum_{i=c}^{min(a,b)}} \frac{{{a}\choose{i}}{{n-a}\choose{b-i}}}{{{n}\choose{b}}} $$
(1)
$$ E_{virus-virus} = \left\{\begin{array}{lc} 1,& if \ - log\left(P(y \ge c) \times {{N}\choose{2}}\right) \ge \tau_{1}\\ 0,& otherwise \end{array}\right. $$
(2)
Following the idea in [34, 35], we calculate the expected number of sequences sharing at least an observed number of common proteins. By making a simplification that all protein clusters have the same probability of being chosen, we can calculate the probability of any two sequences containing a and b protein clusters share at least c clusters by Eq. 1, where y is the number of common protein and n is the number of protein clusters. Then we calculate the expected number of sequence pairs with at least c common proteins out of \(\binom {N}{2}\) sequence pairs, where N is the total number of sequences. As shown in Eq. 2, the expected value will be finally utilized to determine whether there is an edge between two sequences. The threshold τ1 is 1 by default. With the increase of c, P become small enough to return a positive Evirus−virus. Because the size of the protein clusters varies a lot, different clusters have different probabilities of being chosen/shared. Eq. 1 is an inaccurate but practically useful approximation in order to compute the background probability efficiently.
Virus-host connection
While Evirus−virus is used to evaluate whether two viruses share a significant number of proteins, Evirus−host is used to measure the sequence similarity between viruses and host. We employ BLASTN to generate the sequence alignment significance between viruses and host. For P virus genomes and B prokaryote genomes, we will create P×B virus-host pairs. Then, as shown in Eq. 3, only pairs whose BLASTN E-value smaller than τ2 (default 1e-5) will form virus-host connections. Noted that if there are multiple alignment results between a virus and a potential host, we will only create one edge between them. In addition, because we have some known virus-host connections from the public dataset, we connect the viruses with their known hosts regardless of their alignment E-values.
$$ E_{virus{-}host} = \left\{\begin{matrix} 1,& if \ virus{-}host\ interaction \ exists\\ 1,& if \ E-value < \tau_{2}\\ 0,& otherwise \end{matrix}\right. $$
(3)
Node construction
Recent research shows that CNN has the ability to learn motif-related features automatically [36, 37]. Following [30], we take advantage of CNN to learn features common to viruses of the same taxonomic group. These features are used to encode each virus node. Figure 2 shows the training (A) and encoding (B) modes in the CNN. In the training mode, we use all the virus reference genomes with known genus labels to train the model for phage genus-level classification. In the encoding mode, the model is used to output the feature vectors of the first dense layer in the pre-trained CNN. This outputs represent encoded features of the original genomes/contigs and will be the node features in the knowledge graph.
CNN training
There are four steps in the training mode. Because CNN only considers inputs with the same length, we first split the input genome into 2 kbp segments. The segments have the same labels as the original genome. Second, we train a skip-gram model to convert the sequence into a numerical vector, because it can map proximate k-mers into similar vectors in high dimensional space to improve the learning ability [38]. Thus, each 2 kbp segment will be converted into a matrix \(X \in \mathbb {R}^{2000-k \times d}\). k is the length of k-mers. d is the number of hidden units in skip-gram model (default 100). Detailed description about the skip-gram model can be found in supplementary file.
$$ Z^{i,k}(X) = w^{k}_{{conv}} * X[i:i+d_{1}-1][1:100] + b_{k} $$
(4)
$$ H^{(0)} = \left[Maxpool(Z_{0}(X)),\cdots,Maxpool(Z_{N_{{conv}}}(X))\right] $$
(5)
Third, the embedded matrix X will be fed to convolutional layers. As shown in Fig. 2A, rather than stacking convolutional layers, we apply multiple convolutional layers with different filter sizes in parallel. With the benefit of this design, CNN can capture sequence patterns with different lengths. For each convolutional layer, the feature value at location i in the kth feature map Zi,k(X) is calculated by Eq. 4. \(w^{k}_{{conv}}\) is the kth filter/kernel in the convolutional layer. d1 is the filter size. bk is the bias in the kth feature map. Each convolutional layer will generate a high-dimensional tensor. Then, as shown in Eq. 5, we apply max pooling to maintain the most important feature from these tensors and concatenate them as H(0). Nconv is the total number of convolutional layers used in the structure. Detailed parameters are listed in the supplementary file.
$$ H^{(l+1)} = ReLU\left(H^{(l)},w^{(l)}\right),\ l \in \{0,1\} $$
(6)
$$ output\ of\ train\ mode = SoftMax\left(H^{(2)}, w^{(2)}\right) $$
(7)
Finally, H(0) will be fed to dense layers to compress the information and make predictions. The dense layer is interpreted as Eq. 6. H(l) is the feature map in the lth hidden layer. We apply the SoftMax function to generate the predictions (Eq. 7) and minimize the error between labels and predictions accordingly.
Virus nodes
In encoding mode, we use the pre-trained CNN to encode viruses. We use genomes/contigs as input to feed the pre-trained CNN. If the input genome is longer than 2kbp, we follow the first step in the training mode and cut it into several segments of 2kbp. As shown in Eq. 8, we use the output vector of the first dense layer as the learned encoded features. Thus, the pre-trained CNN will output an encoded vector for each segment. We will add the vectors of all segments and divide the summed vector by the number of segments.
$$ output\ of\ encoding\ mode = ReLU\left(H^{(0)}, w^{(0)}\right) $$
(8)
Host nodes
In order to ensure consistency in node encoding, we use weighted averaged feature vectors of viruses to encode host nodes. A virus-host connection with smaller a E-value indicates higher similarity, and thus being assigned with a bigger weight.
$$ {Node}_{{host}} = \frac{ {\sum_{i}^{N_{{virus}}}}(i) * (-log(E_{{value}}(i)))}{ \sum_{i}^{N_{{virus}}}(-log{E_{{value}}(i)))}} $$
(9)
As shown in Eq. 9, the feature vector of the host node is calculated by its neighboring virus nodes in the knowledge graph. Thus, host genomes from new taxa can still be encoded as feature vectors. This encoding method allows convenient extension of the knowledge graph to include new labels as discussed in the “Extension to new labels” section.
The GCN model
After constructing the knowledge graph, we train a GCN to decide the taxonomic group of the viruses’ hosts. As shown in Fig. 3, in the graph convolutional layer, each node will use the information traversed from its’ neighbor. Eq. 10 shows the basic concept of the graph convolutional layer.
$$ H^{(l+1)} = ReLU \left(\tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{\frac{1}{2}}H^{(l)}\theta^{(l)}\right),\ l \in \{0,1\} $$
(10)
$$ Out = SoftMax\left(H^{(2)} \theta_{{dense}}\right) $$
(11)
\(A \in \mathbb {R}^{K \times K}\) is the adjacency matrix, where K is the number of nodes in the knowledge graph. \(\tilde {A} = A + I\), where \(I \in \mathbb {R}^{K \times K}\) is the identity matrix. \(\tilde {D}\) is the diagonal matrix calculated by \(D_{{ii}} = {\sum _{j}} \tilde {A}_{{ij}}\). H(l) is the hidden feature in the lth layer and \(H^{(0)} \in \mathbb {R}^{K \times 512}\) is the node feature vector. θ(l) is a matrix of the trainable filter parameters in the layer. Then, we feed the output of the graph convolutional layer to a dense layer and utilize the SoftMax function to give the final output (Eq. 11). θdense is the weight parameters in the dense layer. During training, we calculate the SoftMax value for all nodes in the knowledge graph. Only the SoftMax value of labeled nodes will be utilized to calculate the loss and update parameters. Because all host nodes are labeled, and thus, they are also used to minimize the loss. After training, the SoftMax value of each unlabeled node will be used to assign taxonomic label accordingly.
Since we have host prediction at multiple taxonomic rankings, we will train one model for each taxonomic level (from genus to phylum) separately. Specifically, we re-use the same knowledge graph for training, but the label of the nodes are different according to the taxonomic level. Since there might exist inconsistencies in predicted host taxonomy, HostG will only output the higher taxonomic label when a conflict occurs.
Expected calibrated error (ECE)
Recent research shows that the SoftMax value cannot represent the real confidence of the prediction [39]. To improve the prediction reliability of our method, we add ECE [40] to the objective function and update parameters with the L2 as shown in Eq. 12.
$$ \mathcal{L} = ECE + L2 $$
(12)
ECE aims to minimize the differences between the SoftMax value and the accuracy. By updating parameters with ECE, the prediction with a higher SoftMax value will have a higher probability to be correct so that we can use the SoftMax value to represent the confidence of the prediction. We first define the ECE function. Suppose we split the SoftMax value (ranging from 0 to 1) into Nb bins with each bin covering a region of size \(\frac {1}{N_{b}}\), such as [0, \(\frac {1}{N_{b}}\)), [\(\frac {1}{N_{b}}, \frac {2}{N_{b}}\)), etc. As shown in Fig. 4. In each training epoch, the model will output a prediction for each sample with the corresponding SoftMax value. Then, we can calculate the accuracy and the average SoftMax value for each bin. Finally, as shown in Eq. 13, ECE is computed by the weighted sum of the difference between accuracy and average confidence (SoftMax value) in each bin. T is the number of total samples and Ti is the number of samples in the ith bin. Acci is the accuracy of the ith bin. The average confidence in each bin can be computed by Eq. 14. \(\hat {p}(x_{{ij}})\) is the SoftMax value of the jth sample in the ith bin.
$$ ECE = {\sum_{i}^{N_{b}}\frac{T_{i}}{T} \left | {Acc}_{i} - {conf}_{i} \right | } $$
(13)
$$ {conf}_{i} = \frac{{\sum_{j}^{T_{i}}} \hat{p}(x_{{ij}})} {S_{i}} $$
(14)
Then, we calculate the total loss of the current training epoch according to Eq. 12 and update trainable parameters in GCN. After training with ECE loss, the difference between accuracy and average SoftMax value in each bin will become smaller. The bin with a higher SoftMax value achieves higher accuracy, and thus, the SoftMax values can be used to represent the confidence of the predictions.
Extension to new labels
As the number of sequenced viruses or prokaryotes is still increasing rapidly every year, there might exist viruses infecting prokaryotes whose taxa are not included in current virus-host databases. Many existing tools can only predict the host whose taxa are in the training data. For example, if the training sequences only contain hosts from taxa y1,y2, and y3, These tools can only learn to predict hosts with the three taxa. When the input viruses infect hosts from y4, these tools are unable to give a correct prediction.
Our semi-supervised learning model allows HostG to extend to new host taxa by adding new nodes to the knowledge graph. The main idea of graph extension is to integrate new taxonomic labels by adding more host nodes into the knowledge graph. Because all the labeled nodes (including some virus nodes and all host nodes) will be used to calculate the loss and update parameters when training, these new labels will be propagated through the topological structure. Therefore, HostG not only learns from the existing virus-host interactions, but also learns the similarity between viruses and new prokaryotes to predict a new host taxa. An example is sketched in Fig. 1 (IV). When we need to extend the GCN to include new hosts taxa (orange nodes) that do not exist in the given dataset, we will create nodes to represent these hosts. Edges connecting to the new nodes can be constructed conveniently according to Evirus−virus and Evirus−host described in the “Edge construction” section. Then, GCN will learn and traverse the information within the knowledge graph and the new taxonomic label can also be propagated by the edges. In this case, HostG can predict the new taxa for query viruses. We will demonstrate that after adding the new taxa, the model can still achieve reliable performance in the experiments. Thus, users can conveniently extend HostG to any taxa according to their needs.