DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model

Intrinsically disordered proteins and regions (IDPs/IDRs) are functionally important proteins and regions that exist as highly dynamic conformations under natural physiological conditions. IDPs/IDRs exhibit a broad range of molecular functions, and their functions involve binding interactions with partners and remaining native structural flexibility. The rapid increase in the number of proteins in sequence databases and the diversity of disordered functions challenge existing computational methods for predicting protein intrinsic disorder and disordered functions. A disordered region interacts with different partners to perform multiple functions, and these disordered functions exhibit different dependencies and correlations. In this study, we introduce DisoFLAG, a computational method that leverages a graph-based interaction protein language model (GiPLM) for jointly predicting disorder and its multiple potential functions. GiPLM integrates protein semantic information based on pre-trained protein language models into graph-based interaction units to enhance the correlation of the semantic representation of multiple disordered functions. The DisoFLAG predictor takes amino acid sequences as the only inputs and provides predictions of intrinsic disorder and six disordered functions for proteins, including protein-binding, DNA-binding, RNA-binding, ion-binding, lipid-binding, and flexible linker. We evaluated the predictive performance of DisoFLAG following the Critical Assessment of protein Intrinsic Disorder (CAID) experiments, and the results demonstrated that DisoFLAG offers accurate and comprehensive predictions of disordered functions, extending the current coverage of computationally predicted disordered function categories. The standalone package and web server of DisoFLAG have been established to provide accurate prediction tools for intrinsic disorders and their associated functions. Supplementary Information The online version contains supplementary material available at 10.1186/s12915-023-01803-y.

Attention score between i-th and j-th residues 1  !# Attention weight between i-th and j-th residues 1  !Contextual vectors calculated for i-th residue 1×2048 !The output hidden representations for i-th residue 1×2048 Feature mapping layer n= [1,2…,6]  (%)  The weight variables of the fully connected layer 2048×1024  (%)  The bias variables of the fully connected layer 1×1024 !(%) The output functional semantic representations for i-th residue 1×1024 Graph-based interaction unit  Weighted adjacency matrix of the edges 6× 6 X Functional semantic representation vectors of the nodes 6×1024 GCN layer n= [1,2…,6]   The IG value between i-th and j-th functions  !
The semantic representations of i-th function 6×1024  (%) '  The weight variables of the GCN layer 1024×128  (%)' The bias variables of the GCN layer The aggregated semantic features for n-th function of i-th residue 1 ×128

Max pooling
Kernel size The shape of kernel in max pooling layer 6 ×1

Stride
The shape of stride in max pooling layer 6 ×1  !()* The weight variables of the fully connected layer 128× 1  (%)'' The bias variables of the fully connected layer 1×1

Table S3 .
The number of trainable variables and hyper-parameters of DisoFLAG.

Table S4 .
The definition of evaluation metrics.

Table S5 .
The performance ranking of DisoFLAG using different features.

Table S6 .
The statistical significance of differences (p-value) in predictive performance by different methods on the DP93 test dataset.The p-values are calculated by resampling half of the test dataset 20 times and using the two-sided paired t-test for each pair of the disordered function predictors.The upper triangle compares AUC value, and the lower triangle compares Fmax value.The p-value > 0.05 is highlighted in bold.

Table S7 .
Performance comparisons of DisoFLAG and other predictors on the DP94 independent test dataset.

Table S8 .
The statistical significance of differences (p-value) in predictive performance by different methods on the DP94 test dataset.The p-values are calculated by resampling half of the test dataset 20 times and using the two-sided paired t-test for each pair of the disordered function predictors.The upper triangle compares AUC value, and the lower triangle compares Fmax value.The p-value > 0.05 is highlighted in bold.

Table S9 .
Per-protein performance of different disordered function predictors on the DP93 test dataset.Metrics are averaged over the protein sequence.

Table S10 .
Per-protein performance of different disordered function predictors on the DP94 test dataset.Metrics are averaged over the protein sequence.

Table S11 .
Performance metrics for Disorder-Binding prediction on the CAID2 test dataset.
* Methods are sorted by AUC values.C, coverage of predictions.

Table S12 .
Performance metrics for Disorder-Linker prediction on the CAID2 test dataset.
* Methods are sorted by AUC values.C, coverage of predictions.