The BERT-based model predicts degrons of new sequence patterns
To train and evaluate models, we collected known degrons from ELM [16] database and three previous studies [1, 7, 20] (Fig. 1a). For the same degrons present on different isoforms of one gene, only main isoforms in UniProt [23] were reserved. In total, 303 degrons typically spanned 5-10 AAs were obtained (Additional file 1: Fig. S1a, Additional file 2: Table S1).
Previous predictors predict degrons by integrating protein features like flanking phosphorylation sites, intrinsically disordered regions, MoRFs, solvent accessibility, and flanking ubiquitinated Ks [20, 24]. However, these models cannot be applied to proteins without PTM data, and annotating proteins with these features is time-consuming. Thus, we turned to using BERT-based deep learning models, which have been shown to successfully represent fundamental and advanced properties of proteins, including secondary structure, target binding sites, contact, and PTMs [25, 26]. We built a BERT-based model to predict degrons that consists of a pre-trained TAPE BERT-encoding model [27], two bidirectional long short-term memory layers, and two fully connected layers. The architecture took a protein sequence as input and outputted scores for all AAs on the protein (Additional file 1: Fig. S1b, see Methods for details).
To explore the feasibility of using the BERT-based model to predict degrons, we compared the ability of our model with Motif_RF and a general MoRF predictor MoRFchibi [8] to classify degrons from motif matches. In the training and test stages, our collected degrons were labeled as 1, while randomly selected motif matches from proteins without known degrons were labeled as 0. We averaged predicted scores of AAs in degrons or negative motif matches to represent the score of the BERT-based model. The BERT-based model achieved comparable performance with Motif_RF under fivefold cross-validation, and both methods significantly outperformed MoRFchibi (Additional file 1: Fig. S1c, see Methods for detail). This result indicated that the BERT-based model provides an alternative to protein feature integrating predictors. The advantage of the BERT-based model is that it only needs protein sequence as input and has broader scope of application.
As degrons derived from motifs only represent a small proportion of all degrons bound by more than 600 E3s, we next used our model to score all AAs in sequences rather than only AAs in motif matches of a limited number of motifs. To provide more inputs for the training of the deep learning model, we augmented known degrons by sampling peptides from original proteins. We randomly sampled ten peptides of 128 AAs containing the degron from the original protein for each degron, and generated 3030 128AA-peptides containing known degrons in total (Fig. 1b). As transplantation of a degron confers instability on other proteins [10], we reasoned that degrons on 128AA-peptides can mediate the degradation of 128AA-peptides as well. Thus, we used these 128AA-peptides to train our model. In the training and test stages, AAs in known degrons were labeled as 1, while AAs in the other regions were labeled as 0 (see Methods for detail). We first trained a model (model 0) on 128AA-peptides from 240 randomly selected degrons and tested it on the other 63 degrons. As shown in Fig. 1c, model 0 attained an AUC of 0.8807. This result suggested that we can use the BERT-based model to predict degrons from protein sequences rather than only from motif matches.
Next, we explored whether the BERT-based model can predict degrons bound by E3s not present in our dataset. If the model trained on known degrons bound by a set of E3s can predict that of other E3s, we can infer that our model can discover degrons of new classes. Ideally, degrons used for training and test should be dissimilar in sequence. As it is hard to measure the similarity of degrons bound by different E3s, we grouped 303 known degrons into five clusters using sequence alignment [28]. As shown in Fig. 1d, clusters 1, 2, 4, and 5 possessed dominant classes accounting for about 50 percent, while cluster 3 lacked dominant class and acted as a trash bin during clustering. Next, we built five models, trained each model on degrons from four clusters, and tested each model on degrons of the remaining cluster (Additional file 1: Fig. S1d,e,f). As shown in Fig. 1c, models 1, 2, and 5 performed well in predicting degrons dissimilar with training degrons. Given the diversity of degrons in cluster 3, the performance of model 3 was also satisfactory. The dominant class in cluster 4 is phosphorylation-dependent degrons SCF_TRCP1 [29] (Fig. 1d); as the training degrons of model 4 are mostly modification-independent, model 4 might ignore PTM-related information in the BERT-encoding matrix and performed relatively poorly in predicting phospho-degrons. Overall, these results suggested that even though degrons in different clusters have little sequence homology, they share features beyond the primary sequence that can be captured by the BERT-based model.
To evaluate the importance of information in the BERT-encoding matrix in predicting degrons, we compared the BERT-based model with a new predictor possessing similar architecture and number of trainable parameters, except that it took one-hot encoding as input (Additional file 1: Fig. S1g). We trained and tested the one-hot model using the same strategy as the BERT-based model and found that the BERT-based model significantly outperformed the one-hot model in predicting degron in five clusters (Additional file 1: Fig. S1d,e,h). This result suggested that the rich information encoded in the TAPE BERT-encoding matrix helps our model discover novel degrons dissimilar to training degrons.
In summary, these findings suggested that the BERT-based model can be used as an alternative to feature integrating degron predictors and has wilder scope of application. In addition, our model can predict degrons of new sequence patterns with satisfy performance; thus, it can be used to discover new degrons proteome-widely.
Degpred expands the degron landscape and assists in identifying degrons from motif matches
Models 1–5 trained on degrons with different sequence patterns represent different aspects of degron properties. Thus, we assembled models 1–5 to build Degpred to take full advantage of known degrons and provide more comprehensive predictions (Fig. 2a). Degpred averages outputs from five models to score all AAs of the input protein. Taking 0.3 as the cut-off, Degpred attained a false discovery rate (FDR) of 0.512 (Additional file 1: Fig. S2a) and predicted 46,621 degrons present in the human proteome (UniProt [23] human reviewed proteins) (Additional file 2: Table S1).
To provide an overview of degrons predicted by Degpred, we first compared Degpred degrons with about 55,000 ELM motif matches in the human proteome and found that only 5522 Degpred degrons overlap with ELM motif matches (Fig. 2b). We further analyzed the averaged Degpred score of degrons that match ELM motifs and degrons that do not match ELM motifs. As shown in Fig. 2b, more than 41% of not overlapped degrons possess Degpred scores higher than the median score of overlapped degrons. Even though most training degrons were initially identified through ELM motifs, over 88% of Degpred degrons were beyond those discovered using motifs. These results suggested that Degpred expands the degron landscape. Next, we investigated the relationship between terminus located Degpred degrons and N-end and C-end destabilizing peptides in high-throughput GPS experiments [10, 17], which also constitute the training set of deepDegron [11]. Unexpectedly, we found that both Degpred degrons and known degrons tend to act as stabilizing peptides in GPS experiments (Additional file 1: Fig. S2b). This discordance might be because destabilizing peptides in the GPS experiment are a mixture of multiple functional peptides not limited to degrons [7, 18, 19]. Further investigations are needed to explore the underlying mechanism of destabilizing peptides in the high-throughput experiment.
Another major disadvantage of motif matching constitutes its high false-positive rate due to only considering local sequence patterns. We investigated whether Degpred can screen real degrons from motif matches by testing Degpred on the motif matches of extensively studied E3 βTrCP. The degron of βTrCP requires a special sequence patterns and di-phosphorylation to be recognized [29, 30] (Fig. 2c). The motif of βTrCP matches 1068 segments on 953 proteins in the human proteome, and 306 matches on 298 proteins overlap with Degpred degrons (Fig. 2d). To compare the possibility of motif matches with and without Degpred signal functioning as degrons, we first surveyed phosphorylation sites in the database PhosphoSitePlus [31] and Ub-sites in the database dbPTM [32]. Because real degrons bound by βTrCP possess two phosphorylation sites and are rich in Ub-sites located within 20 AAs [7]. As shown in Fig. 2e, a higher proportion of Degpred-screened matches were phosphorylated compared to the other matches, both single-phosphorylation and di-phosphorylation. Moreover, we found that Ub-sites were significantly enriched within 20 AAs of Degpred-screened matches compared to the other matches (Fig. 2f). Next, we analyzed potential βTrCP substrates identified by proximity-dependent biotin labeling (BioID) [30] and affinity purification mass spectrometry (AP-MS) [29]. As shown in Fig. 2g, proteins with Degpred-screened matches were identified at higher rates in both experiments compared to proteins with the other matches. These results suggested that Degpred helps identify real degrons from motif matches.
Overall, our deep learning degron predictor Degpred identifies novel degrons with new sequence patterns and helps reduce the false-positive rate of motif matches.
Degpred degrons exhibit typical degron properties and are rich in ubiquitination sites nearby
To explore the properties of predicted degrons, we first analyzed the AA composition of Degpred degrons and known degrons. As shown in Fig. 3a, the AA composition of Degpred degrons resembles that of known degrons. Proline (P), glutamic acid (E), serine (S), and tyrosine (T) which were reported to be enriched in degradation signals [33] were all enriched in Degpred degrons; S, T, and tyrosine (Y) which can be phosphorylated were enriched in Degpred degrons as well. Further analysis showed that not only phosphorylation sites, but also N-linked Glycosylation and Methylation sites were enriched in Degpred degrons (Additional file 1: Fig. S2c). These results indicated that Degpred successfully learns the correct AA preference of known degrons, and suggested that some PTMs might act as degron regulators and cross-talk with ubiquitination.
Furthermore, we compared the properties of predicted degrons of Degpred, ELM motif matching and Motif_RF [20]. As Motif_RF utilized 11 features including intrinsically disordered regions, MoRFs, solvent accessibility, and flanking ubiquitinated Ks to predict possible degrons, we first compared these sequence properties of predicted degrons of three methods. As expected, Motif_RF predicted degrons scored higher in the predictions of intrinsically disordered regions, MoRFs, and solvent accessibility [8, 34, 35] than ELM motif matches and random peptides from the human proteome (Fig. 3b–d). Surprisingly, Degpred degrons also scored higher in these predictions (Fig. 3b–d), which indicates that Degpred captures correct sequence features of degrons. Next, we surveyed Ks and Ub-sites [32] around predicted degrons of three methods. As shown in Fig. 3e, Ks were enriched around Degpred degrons, which provides a suitable environment for E3s to ubiquitinate substrates after binding to degrons. In addition, we found that both Ub-sites and ubiquitinated Ks were enriched around Degpred degrons as well (Fig. 3f, g). In comparison, Ks, Ub-sites, and ubiquitinated Ks were randomly distributed around ELM and Motif_RF predicted degrons. These results indicated that Degpred degrons might mediate ubiquitination of flanking Ks.
In summary, Degpred degrons exhibit typical degron properties and might promote ubiquitination of nearby Ks, supporting the assumption that Degpred degrons constitute the binding sites of E3s.
Predicting binding E3s of degrons using calculated motifs
After predicting degrons, we set out to predict the regulatory E3s for Degpred degrons. The most straightforward method is to match degrons with E3 motifs as used in motif-based methods, but only a small number of experimentally identified E3 motifs were available. Here, we computationally generated E3 motifs using Degpred degrons on substrates in our collected E3-substrate interactions (ESIs) dataset (Fig. 4a, Additional file 1: Fig. S3a, Additional file 3: Table S2, see Methods for detail). We chose 55 E3s with at least ten substrates in the ESI dataset and calculated their motifs respectively. For each E3, we used GibbsCluster [28] to align Degpred degrons on its substrates and drop dissimilar outliers, which might be the binding sites of other E3s. Subsequently, we generated motifs from the aligned Degpred degrons for each E3 (Fig. 4a, see Methods for detail). As shown in Fig. 4b, the calculated motifs for βTrCP, SPOP, and FZR1 resemble their experimentally identified motifs [16]. In addition, we generated motifs for four HECT E3s (WWP1, WWP2, SMURF2, NEDD4L) which recognize proline-rich motifs through the WW domain [36, 37]. Four generated HECT E3 motifs were rich in proline (Additional file 1: Fig. S3b). These results indicated that our procedure to generate motifs is reliable.
To evaluate the ability of our generated motifs to predict ESIs, we defined a score to measure the binding possibility of an E3 and a substrate: we scored all Degpred degrons of the substrate with the E3 motif and selected the maximal motif matching score to represent the binding possibility. As shown in Fig. 4c, our collected ESIs possessed significantly higher scores than random pairs. In addition, the manually collected ESIs of Ubibrowser2.0 [21] not in our dataset also had higher scores. This finding indicated that our generated motifs could discover new ESIs. Furthermore, we compared generated motifs and ChenESI [22] on manually collected ESIs of Ubibrowser2.0. We found that generated motifs and ChenESI predicted similar number of substrates for SPOP and FZR1 (Additional file 1: Fig. S3d). Next, we compared generated motifs, Ubibrowser2.0 and ChenESI on ubiquitylome and proteome data measured after SPOP overexpression [38]. We found that SPOP substrates from the generated motif showed increased ubiquitination levels and reduced protein levels after SPOP overexpression (Fig. 4d). In contrast, the substrates of ChenESI and Ubibrowser2.0 showed no significant change. Thus, these results suggested that our generated motifs can be used to predict ESIs. More importantly, our generated motifs provide information of binding degrons which is absent in Ubibrowser2.0 and ChenESI.
Finally, we set out to construct a protein degradation regulatory network using Degpred degrons and generated motifs. We calculated cut-offs for motifs (Fig. 4a) and used the cut-offs to estimate whether an E3 will bind a predicted degron (see Methods for detail). To assess the ability of 55 generated motifs to discover real ESIs, we predict our collected ESIs using 55 motifs. We found that 71% (39/55) of motifs can predict at least 40% of collected substrates (Additional file 1: Fig. S3c, Additional file 4: Table S3). We selected these 39 motifs to construct a protein degradation regulatory network, which consists of 25695 ESIs between 39 E3s and 8754 substrates (Additional file 1: Fig. S3e, Additional file 4: Table S3).
In summary, we generated E3 motifs using Degpred and our collected ESI dataset. These motifs expanded known E3 motifs in the ELM database and enabled us to predict new ESIs with binding site information.
E3-degron interactions affect half-lives of substrates
To evaluate the impact of Degpred degrons on the turnover of proteins, we analyzed half-lives of proteins in non-dividing B cells, natural killer cells, monocytes, and hepatocytes [39]. As shown in Fig. 5a, proteins characterized by dense degrons tend to possess shorter lifespans, which was more significant for proteins with at least five degrons per 1000 AAs. As degrons are more frequent in disordered regions and disorder fraction is positively correlated with degradation rates [19], we analyzed proteins with disorder fractions of 0–10%, 10–30%, 30–100%, respectively, and found that proteins with dense degrons own shorter half-lives in three groups (Additional file 1: Fig. S4). This finding suggested that proteins with more degrons are under stricter regulation of the UPS and are thus degraded faster. To investigate whether different E3s tend to regulate substrates with different half-lives, we compared the half-lives of predicted substrates of different E3s. As shown in Fig. 5b, predicted substrates of TRIM63, βTrCP, NEDD4L, and HUWE1 tend to live shorter, while predicted substrates of TRIM32, FBXL15, PJA1, and FBXL7 tend to possess longer half-lives.
Then, to further verify that predicted degrons prompt protein degradation and mediate E3 binding, we conducted experiments on Chromobox protein homolog 6 (CBX6). CBX6 possessed three Degpred degrons, and segment 269-273 (DARSS) was predicted to be bound by SPOP; CBX6 contains no ELM SPOP motif match. As S is enriched in our generated SPOP motif and is reported to be important in binding with SPOP [13], we mutated DARSS to DARAA. Mutating two AAs can also minimize the impact on protein folding and stability. We transfected wild-type and mutated CBX6 plasmids into HEK293T cells respectively, and cultured cells for 36 h to compare the expression of the transgenes. As shown in Fig. 5c, wild-type CBX6 had much less expression than the mutant, which indicated that mutated CBX6 is more stable in cells. Subsequently, we added cycloheximide to inhibit protein synthesis and found that mutated CBX6 was degraded slower than wild-type CBX6 (Fig. 5c). Next, to test whether DARSS on CBX6 interacts with SPOP, we transfected SPOP and wild-type or mutated CBX6 plasmids into HEK293T cells and conducted co-immunoprecipitation experiments. As shown in Fig. 5d, CBX6 and SPOP co-immunoprecipitated, and mutating CBX6 weakened the interaction with SPOP. These findings indicated that DARSS presenting on CBX6 represents a binding degron of SPOP.
Together, these results demonstrated that E3-degron interactions are principally linked to the control of protein half-lives and different E3s regulate substrates with different degradation rates, which implies that E3 might differ in degradation ability.
Degron-related mutations on short-lived proteins might drive cancer
Defects in degrons and E3s have been implicated in nearly all hallmarks of cancer [11, 20]. Previous studies found that highly mutated driver regions in cancer contain many known degrons [40], and degron-affecting mutations are positively selected in tumorigenesis [20]. By comparing Degpred degrons with these results, we found that Degpred degrons are enriched in the highly mutated driver regions (Additional file 1: Fig. S5a), including well-known degrons on TP53, MYC, CTNNB1, NFE2L2, and other newly predicted degrons (Additional file 6: Table S5). Besides, motif matches that overlapped with Degpred degrons are under more stringent selection in tumorigenesis than the other motif matches (Additional file 1: Fig. S5b). However, previous studies were limited by using a biased degron set and failed to link degrons to E3s. Here, we investigated alterations of the expanded degron landscape in human cancers and explored the importance of binding E3s in tumorigenesis. We analyzed mutations in 33 cancer types of The Cancer Genome Atlas (TCGA) [41, 42] and cancer driver mutations predicted by CATA-population, CATA-cancer, and Structural clustering [41].
By comparing the percentage of AAs with mutations in TCGA in degron-related regions (inside and flanking 10 AAs) and other regions, we found that AAs in degron-related regions are susceptible to mutations in cancer compared with AAs in other regions (Fig. 6a). In addition, we found a higher percentage of recurrent mutations (> = two tumor samples) occur in degron-related regions compared with mutations occurring only once (Fig. 6b). These findings suggested that degron-related mutations are common in human cancer. Then, we investigated degron-related mutations in specific cancer types and proteins. As shown in Fig. 6c, pheochromocytoma and paraganglioma (PCPG), and skin cutaneous melanoma (SKCM) have more mutations in degrons, while brain lower-grade glioma (LGG) contains more mutations near degrons. We next identified hundreds of proteins whose mutations were enriched in degrons in specific cancer types (Fig. 6d, Additional file 6: Table S5). In addition to well-known degron-mutation enriched proteins such as CTNNB1, NFE2L2, and EPAS1 [11, 20], we also identified several proteins rich in degron-mutations that have not been revealed before, such as RXRA in bladder urothelial carcinoma (BLCA), CRNKL1 in skin cutaneous melanoma (SKCM), VPS13D in head and neck squamous cell carcinoma (HNSC), and CIC in LGG. Overall, with the expanded degron landscape, we can explore degron-related mutations in cancer more comprehensively.
Degron-related mutations might interfere with protein degradation and result in abnormal accumulating oncogenes, thus ultimately driving tumorigenesis. We explored whether degron-related mutations tend to act as cancer drivers. Specifically, we focused on recurrent mutations (> = two tumor samples) which are more pathologically significant and tend to occur in degron-related regions (Fig. 6b). Using three different predictors, we found that degron-related mutations are more likely to function as cancer drivers (Fig. 6e). As degron-enriched proteins tend to be short-lived proteins (Fig. 5a) that regulate metabolism, cell proliferation, and differentiation (Additional file 1: Fig. S5c, Additional file 7: Table S6, [43]), we reasoned that degron-related mutations on short-lived proteins might be more pathogenic. To test this hypothesis, we analyzed 1017 short-lived proteins identified by quantitative proteomics in U2OS, HCT116, HEK293T, and RPE1 cell lines [44]. The percentages of driver mutations are significantly higher in short-lived proteins than the other proteins (Fig. 6f), which stressed that short-lived proteins are important in tumorigenesis. Surprisingly, we found that degron-related mutations on short-lived proteins tend to function as cancer drivers compared with other mutations. In contrast, there was no significant difference between these mutations on the other proteins. Further, we used another half-life dataset identified in four non-dividing cell types [39] and compared proteins with the top 1000 shortest half-lives in at least one experiment with the other proteins. We found that degron-related mutations on short-lived proteins in four non-dividing cell types also tend to drive cancer (Additional file 1: Fig. S5d). These results indicated that interfering with the degradation of short-lived proteins is more pathogenic in human cancer, which provides a new perspective for interpreting cancer driver mutations.
Then, we studied E3s in tumorigenesis by analyzing their predicted substrates and binding degrons. We found that approximately two mutations occur in one degron-related region, and the average numbers of mutations in degron-related regions bound by different E3s are comparable (Additional file 1: Fig. S5e). In addition, we found that mutations in degron-related regions bound by SPOP and RFWD2 are more likely to function as cancer drivers (Additional file 1: Fig. S5f), consistent with previous findings that SPOP and RFWD2 regulate the degradation of critical oncogenes [38, 45]. Finally, we analyzed the functions of short-lived substrates of each E3 and identified some well-known functions of these E3s (Fig. 6g), such as CHFR in chromatin remodeling and histone modifications [46], SPOP in histone H3K36 trimethylation and alternative splicing [47], BIRC3 in regulating the caspase and apoptosis pathways [48], and HUWE1 in chromatin modification [49]. Together, these results suggested that E3s regulate different pathways by controlling their substrates, and mutations on degrons bound by different E3s might exert different effects in tumorigenesis.
Finally, we highlighted 19021 degron mutations that alter the charge, hydrophobicity, phosphorylation sites, MoRF regions or predicted protein binding residues [50] of degrons, and 1524 mutations of flanking lysine (Additional file 1: Fig. S5g, Additional file 6: Table S5). These mutations change the properties of degrons and might hinder their function, thus constitute novel potential cancer drivers.
The web application
A freely available and fully functional website (http://degron.phasep.pro/) has been developed to access the collected and predicted data. Users can search all human proteins on the website according to their gene names and UniProt IDs. The detail page for each protein (Fig. 7) includes four sections: (1) basic information about the protein, haploinsufficiency, short half-life, oncogene, and tumor suppressor gene annotations, known degrons and E3s; (2) Degpred degrons and ELM motif matches of the protein; (3) an interactive and scalable interface [51] showing the regions of domains, intrinsically disorder score and Degpred score along the sequence; and (4) a sequence viewer displaying AAs of regions of interest on the protein sequence. The introduction and summary of the website are described on the “About” page; all data on the website can be freely downloaded on the “Download” page.