 Methodology article
 Open access
 Published:
TPpredLE: therapeutic peptide function prediction based on label embedding
BMC Biology volume 21, Article number: 238 (2023)
Abstract
Background
Therapeutic peptides play an essential role in human physiology, treatment paradigms and biopharmacy. Several computational methods have been developed to identify the functions of therapeutic peptides based on binary classification and multilabel classification. However, these methods fail to explicitly exploit the relationship information among different functions, preventing the further improvement of the prediction performance. Besides, with the development of peptide detection technology, peptide functions will be more comprehensively discovered. Therefore, it is necessary to explore computational methods for detecting therapeutic peptide functions with limited labeled data.
Results
In this study, a novel method called TPpredLE based on Transformer framework was proposed for predicting therapeutic peptide multiple functions, which can explicitly extract the function correlation information by using label embedding methodology and exploit the specificity information based on functionspecific classifiers. Besides, we incorporated the multilabel classifier retraining approach (MCRT) into TPpredLE to detect the new therapeutic functions with limited labeled data. Experimental results demonstrate that TPpredLE outperforms the other stateoftheart methods, and TPpredLE with MCRT is robust for the limited labeled data.
Conclusions
In summary, TPpredLE is a functionspecific classifier for accurate therapeutic peptide function prediction, demonstrating the importance of the relationship information for therapeutic peptide function prediction. MCRT is a simple but effective strategy to detect functions with limited labeled data.
Background
Therapeutic peptides play an essential role in human physiology, treatment paradigms, and biopharmacy [1,2,3]. Over the last few decades, peptidebased therapeutics have received a great deal of attention from researchers due to their advantages in drug discovery and design [4, 5]. During the epidemic of COVID19, therapeutic peptides have shown their potential as the agents against SARSCoV2 [6,7,8]. In addition to the antiviral function, therapeutic peptides also show different functions, such as antimicrobial, anticancer, antiinflammatory, etc. [9, 10]. The recognition of the functions of therapeutic peptides is important.
The datadriven computational methods have been widely used in therapeutic peptide function prediction over the past decade. Those methods can be categorized into two groups in terms of the methodologies: (i) binary classification methods and (ii) multilabel classification methods.
The binary classification methods usually utilize conventional machine learning predictors by employing different feature extraction methods. PEPredSuite [9] is an efficient approach based on random forest (RF) for therapeutic peptide function prediction by integrating distinct feature descriptors for different peptide functions. PPTPP [11] is also a RFbased method, where a feature extraction method MRMD2.0 was adopted to produce and rank physicochemical propertyrelated features. TPpredATMV [10] adopted multiview learning, which assumed that different property features were derived from the common latent subspaces, and utilized the high correlation among different features to predict the peptide functions. The aforementioned methods independently constructed the specific predictor for each therapeutic peptide function, ignoring the correlation among different peptide functions.
The multilabel classification methods have attracted more and more attentions in recent years. MLBP [12] treated the prediction of bioactivate peptides as a multilabel classification task and adopted convolutional neural network (CNN) and bidirectional gated recurrent units (BiGRU) to predict the multifunctional bioactivate peptides. PrMFTP [13] introduced the attention mechanism [14] based on MLBP. However, these predictors only consider the sequence information, failing to explicitly incorporate the relationship information among multifunctional peptides, such as the correlation information and the specificity information.
As discussed above, the existing methods are suffering from two major disadvantages: (i) the existing methods failed to explicitly and accurately capture the relationship among different therapeutic peptide functions. For example, the methods based on binary classifiers only consider the specificity information of monofunctional therapeutic peptides ignoring their correlation information among different functions. (ii) For the newly sequenced therapeutic peptides, the existing computational predictors cannot accurately detect their comprehensive functions. Therefore, it is desired to recognize their unknown functions with limited labeled data.
In this study, we proposed a computational predictor called TPpredLE for multifunctional therapeutic peptide prediction. TPpredLE exploits the relationship among different functions based on label embedding and functionspecific classifiers, including the correlation information and the specificity information. Furthermore, we proposed a multilabel classifier retraining approach (MCRT) based on the classifier retraining approach (cRT) [15], which was incorporated into TPpredLE to detect new functions with limited labeled data. Experimental results demonstrate that TPpredLE achieves the stateoftheart performance.
Results
An overview of TPpredLE
In this study, we exploit the prediction ability of TPpredLE on the benchmark dataset with 15 different therapeutic peptide functions, including AMP (antimicrobial peptide), TXP (toxic peptide), ABP (antibacterial peptide), AIP (antiinflammatory peptide), AVP (antiviral peptide), ACP (anticancer peptide), AFP (antifungal peptide), DDV (drug delivery vehicle peptide), CPP (cellpenetrating peptide), CCC (cell–cell communication peptide), APP (antiparasitic peptide), AAP (antiangiogenic peptide), AHTP (antihypertensive peptide), PBP (polystyrene surfacebinding peptide), and QSP (quorum sensing peptide). The benchmark dataset is divided into training dataset, validation dataset, and independent test dataset.
The framework of TPpredLE is illustrated in Fig. 1. TPpredLE contains three modules: (i) sequence embedding module, (ii) label embedding module, and (iii) classifier module. The sequence embedding module is mainly composed of the Transformer encoder, in which the residueresidue attention embeds the information relationship between any two different residues along the sequence. The Transformer decoder plays an essential role in the label embedding module, in which the functionfunction attention learns the relationship information between different therapeutic peptide functions, and the functionresidue attention integrates the residue embedding and the function embedding. A representation vector corresponding to each function is constructed after the two embedding processes. Finally, each function is accurately predicted by the classifier module based on the representation vectors.
We use the multilabel metrics ACC_{example} (examplebased Accuracy) and F1_{label} (labelbased F1score) to evaluate the overall performance of TPpredLE [16]. Besides, we also utilize the binary classification metrics to evaluate the performance for each therapeutic peptide function prediction task in a oneversusall form, including the area under the ROC curve (AUC) [17], Matthews’s correlation coefficient (MCC) [18], the F1 measure [19], and the Kcategory correlation coefficient (RkCC) [20].
Relationship information among therapeutic peptide functions can improve the performance
We conduct ablation experiments to investigate the importance of the relationship information among different therapeutic peptide functions, including the correlation information and the specificity information. The corresponding results are listed in Table 1, from which we can see that TPpredLE achieves the best performance, because it takes advantage of both the correlation information and the specificity information, demonstrating the importance of relationship information among therapeutic peptide functions for therapeutic peptide prediction. Specifically, TPpredLE outperforms model E, which removes the functionspecific classifiers and all functions share the single classifier, indicating that it is useful to learn a unique decision boundary by the functionspecific classifiers for each function with unique feature distribution.
We further visualize Pearson’s correlation coefficient [21] of the functions in the training set and the average Pearson’s correlation coefficient by averaging the coefficient scores of the function representations learned by the label embedding module in TPpredLE. The detailed mathematical formulas are described in Additional file 1: Supplementary Material S1 [21]. The results are shown in Fig. 2, from which we can see the that the relevant functions tend to show similar representations, indicating that the function representations are able to capture the characteristics of the therapeutic peptide functions.
Performance comparison among different predictors for therapeutic peptide function prediction
Most of the existing methods only predict some specific therapeutic peptide functions and treat this task as binary classification problem. In contrast, TPpredLE is the only method for comprehensively predicting 15 different therapeutic peptide functions. The performance of TPpredLE is measured by binary classification metrics and compared with the stateoftheart binary classification methods for therapeutic peptide prediction, including PEPredSuite [9], PPTPP [11], and TPpredATMV [10]. The results are listed in Table 2, from which we can see that TPpredLE achieves the best performance. Because the binary classifier methods are suffering from the high falsepositive rate problem (see Additional file 3: Table S1), they tend to predict the negatives as the positives. Different from these methods, TPpredLE is simultaneously trained with 15 different therapeutic peptide functions and explicitly explores the correlation information of different therapeutic peptide functions to learn more discriminative features. As a result, TPpredLE are obviously better than the other existing predictors, especially for predicting multifunctional therapeutic peptides with imbalanced training data. The comprehensive performance of other functions and the results are available in Additional file 3: Table S2 in terms of the Rkcc metric.
The compared methods are based on conventional machine learning, and they have extracted the handcrafted manual features by integrating different properties. To erase the impact of the input features, we further compare TPpredLE with oneversusall RFs trained on onehot and PSSMencoded sequences of the training set as TPpredLE. We trained a RF model for each function with a oneversusall strategy. Besides, we constructed the input for RFs in two strategies: concatenating or averaging all the input residual vectors. The results are shown in Table 3, from which we can be shown that the oneversusall RFs fail to effectively predict the therapeutic peptide functions, demonstrating the necessity of complex deep networks.
Performance comparison between TPpredLE and other multilabel classification methods
To further evaluate the performance of TPpredLE, we compare TPpredLE with MLBP [12] and PrMFTP [13], which are multilabel classification models for multifunctional peptide identification. To fairly and comprehensively evaluate the performance of TPpredLE and the other methods, we retrain the other methods and evaluate them on \({\mathbb{S}}_{benchmark}\) (cf. Equation 1). The results are shown in Fig. 3A, from which we can see that TPpredLE outperforms the other methods in all metrics. Figure 3B shows that MLBP and PrMFTP achieve lower performance on mediumshot functions, and MLBP even fails to predict the fewshot functions. In contrast, TPpredLE achieves stable performance in all groups. Therefore, TPpredLE is a useful tool for multilabel functional therapeutic peptide prediction.
TPpredLE can predict new therapeutic peptide functions with limited labeled training data
In previous therapeutic peptide prediction studies, there is an assumption that all peptide sequences are comprehensively labeled. However, the assumption hardly holds in reality [22, 23]. With the development of therapeutic peptide function analysis methods, more and more potential functions of therapeutic peptides are discovered in the future, which means that the currently known training data may only contain limited functions being annotated. Therefore, it is essential and desired to predict the newly detected therapeutic peptide functions with the limited labeled data for training. The limited labeled data means part of the positive samples are mislabeled as the negative samples for a function. For example, a sequence with AMP and ACP functions is only marked as AMP, which is called mislabeled. In this regard, to simulate this real world application, we construct a series of training and validation datasets by randomly removing the labels with the weak label ratio (WL ratio) [22] varying from 50 to 90% with 10% as the interval. The detailed construction steps are described in Additional file 2: Supplementary Material S2 [22].
The performance of TPpredLE* (TPpredLE with MCRT), TPpredLE, MLBP, and PrMFTP on these datasets with various WL ratios are shown in Fig. 4, from which we can see the following: (i) both the TPpredLE* and TPpredLE consistently outperform MLBP and PrMFTP on all the datasets; (ii) TPpredLE* achieves the best performance by using the MCRT, indicating that the square root resampling plays a key factor in reducing the likelihood of selecting mislabeled samples. Therefore, the square root resampling strategy can improve the robustness of the model when there exist mislabeled samples. These results demonstrate that TPpredLE is a useful for method for analyzing the newly detected therapeutic peptide functions with limited labeled data.
Visualization of the attentions
To investigate the role of three types of attention in the model, we visualize the learned attention weights in the last layer of the Transformer encoder and decoder. We visualize the overall received attention weights for all functions and residues. The results are illustrated in Fig. 5. The weights distribution in Fig. 5A closely resembles the distribution in Fig. 3B. It shows that functions with larger quantities tend to have better prediction performance, so that they are likely to receive more attention. Figure 5B shows the overall functionresidue functions. We can see that different functions are likely to have distinct preferences for the residues in the prediction process.
Furthermore, we focus on a single peptide sequence to visualize the three types of attentions. We take the peptide “GVAKFGKAAAHFGKGWIKEMLNS” as an example, which has the functions of AMP, TXP, and ABP. The weights of three types of attention are shown in Fig. 6. We can see that different residues and functions are likely to pay attention to different regions (residues) by using the sequential information in Fig. 6A and B. The functionfunction attention shown in Fig. 6C suggests the prediction process for its functions of AMP, TXP, and ABP. When predicting AMP, TPpredLE pays more attention to ABP, ACP, APP, and so on. In other words, TPpredLE utilizes the information from other functions to predict the AMP function for this sequence. The prediction processes of the other functions are in the same way. Therefore, TPpredLE can leverage the relationship information among functions and residues to enhance the ability of multifunctional therapeutic peptides.
Discussion
The aforementioned results reveal limitations in the predictive capabilities of current methods for therapeutic peptide function prediction. On one hand, the binary classification techniques focus on specific peptide functions, while overlooking the relationship information among different peptide functions. These methods frequently yield a high falsepositive rate, resulting in lower precision. On the other hand, the existing multilabel classificationbased methods still fail to explicitly employ the relationship information, which leads to unsatisfactory accuracy, particularly when dealing with a limited number of training samples.
TPpredLE is an innovative approach designed for the prediction of multifunctional therapeutic peptides, which incorporates the relationship information between different peptide functions effectively. This method utilizes the encoder and decoder to learn the correlation information among residues and functions to improve the prediction ability as shown in the ablation experiment. Furthermore, TPpredLE benefits from the integration of the attention mechanism, which allows for the straightforward visualization of attention weights for three different types. The three difference weight types improve the performance of TPpredLE. Finally, we introduced the label missing problem in the therapeutic peptide function prediction field and proposed the MCRT algorithm to solve it. The study on the limited training labeled data is promising to predict the function more comprehensively. There still exist some limitations in the TPpredLE. For example, TPpredLE’s reliance on deep neural networks demands a substantial volume of training samples to effectively learn patterns. In the future, we are planning to incorporate the pretrained models to improve the performance on therapeutic peptide prediction.
Conclusions
In this paper, we propose a novel method called TPpredLE for therapeutic peptide function prediction. Compared with the other existing computational methods, TPpredLE has the following advantages: (i) it accurately and comprehensively predicts the 15 different therapeutic peptide functions; (ii) it incorporates label embedding and functionspecific classifiers to measure the correlation relationship and the specificity relationship among peptide functions, respectively; (iii) it is able to stably detect the newly detected therapeutic peptide functions with limited labeled data by introducing the MCRT algorithm; and (iv) its web server is constructed, only requiring the peptide sequences in FASTA format as inputs.
Methods
Benchmark dataset
In this study, we constructed a comprehensive benchmark dataset with 15 different therapeutic peptide functions, including AMP, TXP, ABP, AIP, AVP, ACP, AFP, DDV, CPP, CCC, APP, AAP, AHTP, PBP, and QSP. They were derived from SATPdb [4], PEPredSuite [9], DRAMP 2.0 [24], Basith S’s review [25], and AntiCP 2.0 [26]. The details were listed in Additional file 3: Table S3 [4, 9, 24,25,26]. The benchmark dataset can be represented as:
where \({\mathbb{S}}_{\mathrm{AMP}},{\mathbb{S}}_{\mathrm{TXP}},\dots ,{\mathbb{S}}_{\mathrm{QSP}}\) are the subsets containing the specific therapeutic peptide functions. Sequences sharing similarity higher than 90% in each subset were removed [27,28,29,30] by using CDHIT [31]. Finally, the benchmark dataset contains 10,237 unique sequences with one or more functions. The statistical information of the benchmark dataset is shown in Fig. 7. The detailed distribution of different multifunctions and their relationship is shown in Additional file 4: Fig S1.
As illustrated in Fig. 7, the number of samples with different functions is obviously imbalanced, following a longtail distribution [23]. In order to better examine the performance variations across functions with different numbers of samples, we divide all the 15 functions into 3 groups according to their number of samples [15, 32, 33]: manyshot group (more than 1000 samples), mediumshot group (200 ~ 1000 samples), and fewshot group (less than 200 samples). To train and evaluate models, we randomly split the \({\mathbb{S}}_{benchmark}\) into training dataset, validation dataset, and independent test dataset roughly with the ratio of 8:1:1. The homology similarity between training dataset and independent test dataset as well as the validation dataset is less than 90% for each function.
Sequence embedding and label embedding
The embedding modules in TPpredLE learn the discriminative representations of sequences and the therapeutic peptide functions.
Firstly, the input sequences and all the functions are embedded as numerical vectors. Onehot encoding [34] and positionspecific scoring matrix (PSSM) [35] are adopted to encode the peptide sequences. Onehot is a binary vector encoding the amino acid in each position into a vector with the dimension of 20 to represent the composition information of the sequence. PSSM captures the evolutionary information of the sequence and encodes each amino acid into a vector with the dimension of 20. We generate the PSSMs through the multiple sequence alignments (MSAs) by using PSIBLAST [35] (‘num_iterations 3 evalue 0.01’) to search against the NR database [36]. Finally, the feature vector of each sequence is obtained by concatenating the two features. The functions are represented as onehot encoding, and each peptide function class is represented as a vector with the dimension of 15.
For a given sequence, the length of the input sequence is L, which is fixed as 50 in this study. If the length of the sequence is less than \(50\), we pad it with zeros at the end of the sequence, while if the length of the sequence exceeds \(50,\) two subsequences with length of 25 from its Nterminal and Cterminal are extracted and concatenated [37]. We have also tested another sequence truncating strategy, which only extracts the subsequence from the sequence beginning side (Nterminal) as [5] or most of the natural language processing (NLP) tasks [38] generally do. The performance results listed in Additional file 3: Table S4 show that the above two truncating strategies are comparable to each other. Since the majority of sequences in the benchmark dataset have a length of less than 50 (see Additional file 4: Fig S2), the sequence truncating strategy only needs to be applied to a small number of sequences. Therefore, the choice of truncation strategy has minimal impact on this study, and we just chose the Nterminal and Cterminal. Moreover, as most of the sequences in our benchmark dataset have at least 10 amino acids after performing homology reduction, the sequences with lengths less than 10 are likely to have a bias prediction. We limited the minimum length of the input to 10 in our webserver.
An encoded sequence is represented as \({\mathbf{X}}^{s}={\left\{{x}_{i}^{s}\right\}}_{i=1}^{L}\in {\mathbb{R}}^{L\times 40}\), and the encoded function set is defined as \({\mathbf{X}}^{t}={\left\{{x}_{i}^{t}\right\}}_{i=1}^{C}\in {\mathbb{R}}^{C\times C}\), where \(C\) is the number of all therapeutic peptide functions. In this study, C is set as 15.
We adopt Transformer [39] to learn the representation of sequences and functions. The selfattention mechanism [39] in Transformer allows the model to focus on the prediction related regions. The attentions in Transformer can be divided into three types according to their different roles: (i) residueresidue attention, (ii) functionfunction attention, and (iii) functionresidue attention as shown in Fig. 8. The residueresidue attention has been used in the other studies to learn the representation of protein sequences [40, 41]. The correlation relationship among different therapeutic peptide functions is ignored by the exiting methods. Therefore, we explore the correlation relationship among therapeutic peptide functions based on label embedding methodology [42,43,44,45] through the Transformer decoder. There are two attentions in the label embedding module, including functionfunction attention and functionresidue attention. The functionfunction attention allows each function updates its representation according to the information from the other functions, while the functionresidue attention integrates the information between residues and functions. The mathematical description of the all attentions in Transformer can be represented as [39]:
where \({d}_{model}\) represents the hidden dimension of the model. \(\mathbf{Q}\), K, and V are the query, key, and value matrices, respectively.
Multihead attention mechanism allows the model to attend to information from different perspectives [37, 39, 41] adopted in [39]:
where the \(\mathbf{X}\) represents the input of the encoder or decoder. \({\mathbf{W}}_{i}^{Q}, {\mathbf{W}}_{i}^{K},{\mathbf{W}}_{i}^{V}\in {\mathbb{R}}^{{d}_{model}\times {d}_{model}}\) are the projection matrix of query, key, and value, respectively. The \(h\) represents the number of attention heads. \({\mathbf{W}}^{O}\in {\mathbb{R}}^{h{d}_{model}\times {d}_{model}}\) transforms the dimension of the concatenated vectors into the feature space with the dimension of \({d}_{model}\).
The encoder takes \({\mathbf{X}}^{s}\) as input, and the decoder takes \({\mathbf{X}}^{t}\) as input. The function representation \(\mathbf{Z}={\left\{{z}_{i}\right\}}_{i=1}^{C}\in {\mathbb{R}}^{C\times {d}_{model}}\) is learned by Transformer [39]:
where \({f}_{enc}(\cdot )\) and \({f}_{dec}(\cdot )\) are linear projection layers converting the lowdimensional input vectors into the feature space with the high dimension of \({d}_{model}\). \(Transformer(\cdot )\) represents the complete Transformer neural network as shown in Fig. 2. Please refer to [39] for more details of the Transformer.
The positional encodings (\(\mathbf{P}\mathbf{E})\) are added into the input sequence embedding to preserve the residue order information [39]:
where \(pos\) indicates the position of the amino acid in the sequence (\(0\le pos\le L1\)) and \(0\le i<{d}_{model}^{\mathrm{^{\prime}}}/2\). In this study, \({d}_{model}^{\mathrm{^{\prime}}}\) is equal to \({d}_{model}\).
Functionspecific classifiers
For each sequence, the output of the embedding modules is a function representation matrix \(\mathbf{Z}\). To transform the high dimensional representation \(\mathbf{Z}\) into the output space, a common approach is to simply add a single linear layer:
where \({w}_{single}\in {\mathbb{R}}^{{d}_{model}}\) and \({b}_{single}\in {\mathbb{R}}\), which are shared with all functions. The \(\widehat{y}\in {\mathbb{R}}^{C}\) is the predicted probabilities for all therapeutic peptide functions.
However, this approach fails to capture the specificity of different therapeutic peptide functions (see Fig. 9A). Therefore, for each therapeutic peptide function, we design an independent classifier to learn an independent decision boundary for each function according to the distinct feature distribution (see Fig. 9B). In addition, each classifier can be regulated independently without interfering the classifiers for the other functions, which allows us to train all classifiers in a multilabel classification approach; meanwhile, we can adjust each classifier in a binary classification manner, demonstrating its scalability. The prediction process of TPpredLE based on functionspecific classifiers can be represented as:
where \({w}_{i}\in {\mathbb{R}}^{{d}_{model}}\) and \({b}_{i}\in {\mathbb{R}}\).
Finally, we obtain the predicted functions for each peptide with the threshold of 0.5.
Multilabel classifier retraining (MCRT)
In order to predict the new therapeutic peptide functions with limited labeled data, we propose the multilabel classifier retraining (MCRT) strategy for detecting new functions with limited labeled data.
Classifier retraining (cRT) has been confirmed to be an effective approach for longtailed multiclass classification [15], which learns the representation using the original imbalanced data and employs the resampled balanced training data to retrain the classifier with the representation module keeping fixed. In this study, we extend the cRT approach to the multilabel classification task so as to enhance the prediction ability of TPpredLE for detecting new functions with the limited labeled data.
Benefiting from the scalability of the functionspecific classifiers, we treat the model as \(C\) binary classifiers and retrain each classifier separately. For each classifier, we resample the training dataset to get the corresponding balanced training dataset with \(N\) samples based on the bootstrap strategy [46]. The square root sampling strategy [47, 48] is used in this study. The sampling probability \({p}_{cj}\) is defined as [15]:
where \(c\in \{AMP, TXP,\dots ,QSP\}\) represents a specific function, \(j\in \left\{positive, negative\right\}, {n}_{cj}\) is the number of positive or negative training samples of a specific class, and \(N\) is the number of all training samples.
MCRT retrains each classifier with the resampling training dataset for each function. When retraining the classifier \(c\), we feed the corresponding sampled training dataset and freeze the embedding modules and the classifiers for the other functions. As a result, the prediction of other functions will not be affected.
The model implementation
In TPpredLE, each function will be projected into a distinct output space due to the independency of each functionspecific classifier, which will adversely affect the label embedding process. Therefore, we utilize two training steps to train TPpredLE. In the first training step, the single classifier is used to learn the embedding in the same output space so as to extract the correlation information among labels. In the second training step, we replace the single classifier with the functionspecific classifiers to train the model with the label embedding module keeping fixed, and each classifier will obtain a distinct decision boundary according to its specificity information. The detailed training process is shown in Algorithm 1. Besides, the training process of TPpredLE based on MCRT is shown in Algorithm 2. The \({E}_{seq}, {E}_{func}, {F}_{single},{F}_{specific}\) represent the learnable parameters in the sequence embedding module, the label embedding module, the single classifier, and the specific classifiers, respectively. Binary cross entropy loss [49] is used to measure the gap between the ground truth labels and the prediction [49]:
where \({y}_{ij}\in {\mathbb{R}}\) is the ground truth label, and \({\widehat{y}}_{ij}\in {\mathbb{R}}\) is the prediction probability corresponding to function \(j\) for the sample \(i\). AdamW [50] algorithm is used to optimize the trained parameters. Each training step runs 30 epochs. The hyperparameters are determined by the grid search strategy according to the minimum of the validation loss in each training setting. The detailed hyperparameters and their optimal values of TPpredLE are listed in Additional file 3: Table S5. In this work, each experiment is run for 5 times with different random seeds, and the average results are reported so as to ensure the reliability.
Evaluation metrics
For multilabel classification, the evaluation metrics are generally categorized into two groups [16]: examplebased metrics and labelbased metrics. Examplebased metrics are the averaged measure for all samples. Labelbased metrics consider each function has equal importance and perform averaging among all functions. The previous works [12, 13, 51] only reports the examplebased metrics ignoring the labelbased metrics. As a result, the prediction ability for the functions with fewer samples cannot be clearly illustrated, such as the functions in fewshot groups as shown in Fig. 6. Therefore, we comprehensively evaluate our method by using two types of metrics:
where \(AC{C}_{example}\) is used as the examplebased metric following [12, 13, 52], \({L}_{i}\) is the ground truth label set, and \({\widehat{L}}_{i}\) is the predicted label set. When calculating labelbased metrics, we split the mutilabel classification task into multiple binary classification tasks and average them to obtain the final metrics. \(F{1}_{label}\) (macroF1) is used as the measure of the labelbased metric. We also utilize the binary classification metrics to evaluate the binary prediction performance, including AUC [17], MCC [18], F1 [19], and RkCC [20].
Availability of data and materials
The TPpredLE webserver is accessible at http://bliulab.net/TPpredLE/ [53].
The data and codes utilized in this study is available at http://bliulab.net/TPpredLE/data/ [53] and https://github.com/HongWuL/TPpredLE [54] respectively. The source codes reach a bronze standard of reproducibility.
Abbreviations
 RF:

Random forest
 AMP:

Antimicrobial peptide
 TXP:

Toxic peptide
 ABP:

Antibacterial peptide
 AIP:

Antiinflammatory peptide
 AVP:

Antiviral peptide
 ACP:

Anticancer peptide
 AFP:

Antifungal peptide
 DDV:

Drug delivery vehicle peptide
 CPP:

Cellpenetrating peptide
 CCC:

Cellcell communication peptide
 APP:

Antiparasitic peptide
 AAP:

Antiangiogenic peptide
 AHTP:

Antihypertensive peptide
 PBP:

Polystyrene surfacebinding peptide
 QSP:

Quorum sensing peptide
 ACC _{example} :

Examplebased accuracy
 F1_{label} :

Label based F1score
 AUC:

The area under the ROC curve
 MCC:

Matthews’s correlation coefficient
 Rkcc:

Kcategory correlation coefficient
 cRT:

Classifier retraining approach
 MCRT:

Multilabel classifier retraining approach
References
Fosgerau K, Hoffmann T. Peptide therapeutics: current status and future directions. Drug Discovery Today. 2015;20(1):122–8.
Lau JL, Dunn MK. Therapeutic peptides: historical perspectives, current development trends, and future directions. Bioorg Med Chem. 2018;26(10):2700–7.
Yan K, Lv H, Guo Y, Peng W, Liu B. sAMPpredGAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics. 2022;39(1):btac715.
Singh S, Chaudhary K, Dhanda SK, Bhalla S, Usmani SS, Gautam A, Tuknait A, Agrawal P, Mathur D, Raghava GP. SATPdb: a database of structurally annotated therapeutic peptides. 2016. https://doi.org/10.1093/nar/gkv1114.
Yan K, Guo Y, Liu B. PreTP2L: identification of therapeutic peptides and their types using twolayer ensemble learning framework. Bioinformatics. 2023;39(4):btad125.
Shah JN, Guo GQ, Krishnan A, Ramesh M, Katari NK, Shahbaaz M, Abdellattif MH, Singh SK, Dua K. Peptidesbased therapeutics: emerging potential therapeutic agents for COVID19. Therapie. 2022;77(3):319–28.
Heitmann JS, Bilich T, Tandler C, Nelde A, Maringer Y, Marconato M, Reusch J, Jäger S, Denk M, Richter M, et al. A COVID19 peptide vaccine for the induction of SARSCoV2 T cell immunity. Nature. 2021;601(7894):617–22.
Abdelmageed MI, Abdelmoneim AH, Mustafa MI, Elfadol NM, Murshed NS, Shantier SW, Makhawi AM. Design of a multiepitopebased peptide vaccine against the E protein of human COVID19: an immunoinformatics approach. Biomed Res Int. 2020;2020:2683286.
Wei L, Zhou C, Su R, Zou Q. PEPredSuite: improved and robust prediction of therapeutic peptides using adaptive feature representation learning. Bioinformatics. 2019;35(21):4272–80.
Yan K, Lv H, Guo Y, Chen Y, Wu H, Liu B. TPpredATMV: therapeutic peptides prediction by adaptive multiview tensor learning model. Bioinformatics. 2022;38(10):2712–8.
Zhang YP, Zou Q. PPTPP: a novel therapeutic peptide prediction method using physicochemical property encoding and adaptive feature representation learning. Bioinformatics. 2020;36(13):3982–7.
Tang W, Dai R, Yan W, Zhang W, Bin Y, Xia E, Xia J. Identifying multifunctional bioactive peptide functions using multilabel deep learning. Brief Bioinform. 2022;23(1):bbab414.
Yan W, Tang W, Wang L, Bin Y, Xia J. PrMFTP: multifunctional therapeutic peptides prediction based on multihead selfattention mechanism and class weight optimization. PLoS Comput Biol. 2022;18(9): e1010511.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. In: Advances in neural information processing systems. 2017. p. 5998–6008.
Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, Kalantidis Y. Decoupling representation and classifier for longtailed recognition. In Proc Int Conf Learn Representations. 2020. https://doi.org/10.48550/arXiv.1910.09217.
Zhang ML, Zhou ZH. A review on multilabel learning algorithms. IEEE Trans Knowl Data Eng. 2014;26(8):1819–37.
Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 1997;30(7):1145–59.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):1–13.
Powers DM. Evaluation: from precision, recall and Fmeasure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061. 2020. https://doi.org/10.48550/arXiv.2010.16061.
Gorodkin J. Comparing two Kcategory assignments by a Kcategory correlation coefficient. Comput Biol Chem. 2004;28(5–6):367–74.
Lee Rodgers J, Nicewander WA. Thirteen ways to look at the correlation coefficient. Am Stat. 1988;42(1):59–66.
Sun YY, Zhang Y, Zhou ZH. Multilabel learning with weak label. In: Twentyfourth AAAI conference on artificial intelligence. 2010.
Liu W, Wang H, Shen X, Tsang IW. The emerging trends of multilabel learning. IEEE Trans Pattern Anal Mach Intell. 2021;44(11):7955–74.
Kang X, Dong F, Shi C, Liu S, Sun J, Chen J, Li H, Xu H, Lao X, Zheng H: DRAMP 2.0, an updated data repository of antimicrobial peptides. 2019. https://doi.org/10.1038/s415970190154y.
Basith S, Manavalan B, Hwan Shin T, Lee G. Machine intelligence in peptide therapeutics: a nextgeneration tool for rapid disease screening. Med Res Rev. 2020;40(4):1276–314.
Agrawal P, Bhagat D, Mahalwal M, Sharma N, Raghava GP. AntiCP 2.0: an updated model for predicting anticancer peptides. Brief Bioinform. 2021;22(3):bbaa153.
Khosravian M, Kazemi Faramarzi F, Mohammad Beigi M, Behbahani M, Mohabatkar H. Predicting antibacterial peptides by the concept of Chou’s pseudoamino acid composition and machine learning methods. Protein Pept Lett. 2013;20(2):180–6.
Burdukiewicz M, Sidorczuk K, Rafacz D, Pietluch F, Chilimoniuk J, Rodiger S, Gagat P. Proteomic screening for prediction and design of antimicrobial peptides with AmpGram. Int J Mol Sci. 2020;21(12):4310.
Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics. 2018;34(16):2740–7.
Kavousi K, Bagheri M, Behrouzi S, Vafadar S, Atanaki FF, Lotfabadi BT, Ariaeenejad S, Shockravi A, MoosaviMovahedi AA. IAMPE: NMRassisted computational prediction of antimicrobial peptides. J Chem Inf Model. 2020;60(10):4691–701.
Huang Y, Niu B, Gao Y, Fu L, Li W. CDHIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–2.
Yang Y, Wang H, Katabi D. On MultiDomain LongTailed Recognition, Generalization and Beyond. arXiv preprint arXiv:2203.09513. 2022. https://doi.org/10.48550/arXiv.2203.09513.
Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX. Largescale longtailed recognition in an open world. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. p. 2537–46.
Hancock JT, Khoshgoftaar TM. Survey on categorical data for neural networks. J Big Data. 2020;7(1):28.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, Rajput B, Robbertse B, SmithWhite B, AkoAdjei D. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2016;44(D1):D733–45.
Wang D, Zhang Z, Jiang Y, Mao Z, Wang D, Lin H, Xu D. DM3Loc: multilabel mRNA subcellular localization prediction and analysis based on multihead selfattention mechanism. Nucleic Acids Res. 2021;49(8): e46.
Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text classification algorithms: a survey. Information. 2019;10(4):150.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems. 2017. p. 5998–6008.
Pang Y, Liu B. SelfATFold: protein fold recognition based on residuebased and motifbased selfattention networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2020.
He W, Wang Y, Cui L, Su R, Wei L. Learning embedding features based on multisensescaled attention architecture to improve the predictive performance of anticancer peptides. Bioinformatics. 2021;37(24):4684–93.
Wang G, Li C, Wang W, Zhang Y, Shen D, Zhang X, Henao R, Carin L. Joint embedding of words and labels for text classification. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. p. 2321–31.
Xiong Y, Feng Y, Wu H, Kamigaito H, Okumura M. Fusing label embedding into bert: An efficient improvement for text classification. In: Findings of the Association for Computational Linguistics: ACLIJCNLP 2021. 2021. p. 1743–50.
Chen ZM, Wei XS, Wang P, Guo Y. Multilabel image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019. p. 5177–86.
You R, Guo Z, Cui L, Long X, Bao Y, Wen S. Crossmodality attention with semantic graph embedding for multilabel classification. In: Proceedings of the AAAI conference on artificial intelligence. 2020. p. 12709–16.
Efron B. Bootstrap methods: another look at the jackknife. In: Breakthroughs in statistics. New York: Springer; 1992. p. 569–93.
Evensen G. Sampling strategies and square root analysis schemes for the EnKF. Ocean Dyn. 2004;54(6):539–60.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. 2013.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.
Loshchilov I, Hutter F. Decoupled weight decay regularization. In Proc Int Conf Learn Representations. 2019. https://doi.org/10.48550/arXiv.1711.05101.
Wang Y, Zhai Y, Ding Y, Zou Q. SBSMPro: support biosequence machine for proteins. arXiv preprint arXiv:2308.10275. 2023. https://doi.org/10.48550/arXiv.2308.10275.
Lin W, Xu D. Imbalanced multilabel learning for identifying antimicrobial peptides and their functional types. Bioinformatics. 2016;32(24):3745–52.
Lv H, Yan K, Liu B: Webserver of TPpredLE. http://bliulab.net/TPpredLE. Accessed 9 Oct 2023.
Lv H, Yan K, Liu B: Source codes of TPpredLE. https://github.com/HongWuL/TPpredLE. Accessed 9 Oct 2023.
Acknowledgements
We are very indebted to the three anonymous reviewers, whose constructed several comments are very helpful for strengthening the paper.
Funding
This work was supported by the National Natural Science Foundation of China (No. 62325202, 62271049, U22A2039, 62102030, and 62372267).
Author information
Authors and Affiliations
Contributions
H.W.L was involved in the implementation, programming, manuscript writing, and correcting. K.Y was involved in the manuscript writing and correcting. B.L conceived the project and was involved in the manuscript writing and correcting. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
Supplementary Material S1. The calculation of the Pearson’s correlation coefficient.
Additional file 2:
Supplementary Material S2. The construction of the limited labelled datasets.
Additional file 3:
Table S1. The precision scores of various methods for predicting eight therapeutic peptide functions on the independent dataset. Table S2. The performance of TPpredLE for predicting 15 therapeutic peptide functions on the independent dataset. Table S3. The statistical information of the 15 therapeutic peptide functions. Table S4. The performance comparison of two strategies for truncating the sequences with length exceeding 50. Table S5. The search space for hyperparameters and their optimal values used in TPpredLE.
Additional file 4:
Fig S1. The distribution of different multifunctions and their relationship. Fig S2. The length distribution of the benchmark dataset.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Lv, H., Yan, K. & Liu, B. TPpredLE: therapeutic peptide function prediction based on label embedding. BMC Biol 21, 238 (2023). https://doi.org/10.1186/s1291502301740w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291502301740w