- Methodology article
- Open access
- Published:
Scanning sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data
BMC Biology volume 22, Article number: 218 (2024)
Abstract
Background
RNA-sequencing technology provides an effective tool for understanding miRNA regulation in complex human diseases, including cancers. A large number of computational methods have been developed to make use of bulk and single-cell RNA-sequencing data to identify miRNA regulations at the resolution of multiple samples (i.e. group of cells or tissues). However, due to the heterogeneity of individual samples, there is a strong need to infer miRNA regulation specific to individual samples to uncover miRNA regulation at the single-sample resolution level.
Results
Here, we develop a framework, Scan, for scanning sample-specific miRNA regulation. Since a single network inference method or strategy cannot perform well for all types of new data, Scan incorporates 27 network inference methods and two strategies to infer tissue-specific or cell-specific miRNA regulation from bulk or single-cell RNA-sequencing data. Results on bulk and single-cell RNA-sequencing data demonstrate the effectiveness of Scan in inferring sample-specific miRNA regulation. Moreover, we have found that incorporating the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. In addition, Scan can contribute to construct cell/tissue correlation networks and recover aggregate miRNA regulatory networks. Finally, the comparison results have shown that the performance of network inference methods is likely to be data-specific, and selecting optimal network inference methods is required for more accurate prediction of miRNA targets.
Conclusions
Scan provides a useful method to help infer sample-specific miRNA regulation for new data, benchmark new network inference methods and deepen the understanding of miRNA regulation at the resolution of individual samples.
Background
As a type of regulatory non-coding RNA (ncRNA), microRNA (miRNA, 19–25 nt in length) can modulate the expression levels of thousands of coding RNAs, thereby playing a key role in biological processes, signaling pathways and pathological processes of human cancers [1,2,3]. Due to the importance of miRNAs in biological system, they have the potential to be biomarkers of human cancers, e.g. breast cancer [4], leukemia [5] and prostate cancer [6] in clinical applications.
There are multiple types of biological networks, including gene regulatory network within a biological sample (cell or tissue). Among the different types of biological networks, gene regulatory network is regarded as an important feature of biological samples, and gives rise to the unique characteristics of each biological sample [7]. In the gene regulation field, miRNA regulation has attracted broad attentions on account of its potential clinical translation. Therefore, inferring and characterizing miRNA regulation is a central problem for revealing miRNA regulatory mechanisms in systems biology. Just like dynamic biological system in a biological sample (cell or tissue), miRNA regulation also tends to be dynamic. For example, by using bulk and single-cell RNA-sequencing data, existing studies [8,9,10,11,12] have found that the miRNA regulatory networks are condition-specific, tissue-specific and cell-specific. Moreover, it is widely known that biological samples are characterized by heterogeneity, and are shown to be unique. Thus, to understand the miRNA regulation specific to biological samples (cancer tissues or cells in particular), it is essential to investigate sample-specific miRNA regulation (i.e. the miRNA regulatory network for a specific sample).
At the single-sample level, previous methods including CSN [13], DEVC-net [14], EdgeBiomarker [15], SSN [16], LIONESS [17], c-CSN [18], LocCSN [19], SWEET [20], P-CSN [21] and Normi [22] have been presented to identify sample-specific gene regulation (one gene regulatory network for one sample), and the identified gene regulatory networks with these methods are undirected networks. To infer causal regulatory networks (represented as directed networks) specific to individual samples, an inference method [23] using drift–diffusion processes and NME [24] are also proposed. However, the above computational methods [13,14,15,16,17,18,19,20,21,22,23,24] are designed for exploring sample-specific transcriptional regulation rather than sample-specific miRNA regulation. To study miRNA regulation at the single-cell level, CSmiR [11] is introduced to infer cell-specific miRNA regulation from single-cell miRNA-mRNA co-sequencing data (where each single cell is considered as a sample). As stated in CSmiR, for single-cell miRNA-mRNA co-sequencing data with less than 100 single-cells, CSmiR needs to interpolate pseudo-cells by using a re-sampling technique. However, the original single-cell miRNA-mRNA co-sequencing data itself contains technical noise (e.g. high dropout rate), thus as a result, the enlarged data by adding pseudo-cells may aggravate the technical noise.
To model the dynamic regulatory processes of miRNAs at the single-sample level, in this work, we propose a Sample-specific miRNA regulation (Scan) framework to scan sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data. For exploring miRNA-mRNA regulatory relationships, a large number of network inference methods have been proposed, which can be categorized as Correlation, Distance, Information, Regression, Bayesian, Proportionality, and Causality [25]. Recently, two main strategies, statistical perturbation [16] and linear interpolation [17], have emerged that explicitly predict sample-specific networks. Due to the fact that a single network inference method or strategy does not always perform the best in all types of new data, Scan incorporates 27 network inference methods and two strategies to infer tissue-specific or cell-specific miRNA regulation from bulk or single-cell RNA-sequencing data. Given bulk or single-cell RNA-sequencing data with or without prior information of miRNA-mRNA interactions, 27 well-used network inference methods in Scan can be used to calculate the strength between miRNAs and mRNAs. Based on the strength between miRNAs and mRNAs, Scan adapts two strategies: statistical perturbation and linear interpolation to infer sample-specific miRNA regulatory networks.
By applying Scan to two RNA-sequencing datasets, a bulk dataset from breast cancer tissues, and a single-cell dataset from chronic myelogenous leukemia (CML) cells, we demonstrate the effectiveness of Scan in terms of accuracy and efficiency of miRNA target prediction. Moreover, the performance comparison between with and without using prior information indicates that the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. We also reveal that Scan can help to construct sample correlation network and recover aggregate miRNA regulatory network. The comparison with CSmiR shows that with using the prior information of miRNA targets, Scan mostly has better or similar performance compared with CSmiR [11] in inferring cell-specific miRNA regulation from the chronic myelogenous leukemia dataset. Furthermore, without using the prior information of miRNA targets, the performance of Scan is mostly better than or comparable to the performance of CSmiR in inferring tissue-specific miRNA regulation from the breast cancer dataset. Altogether, Scan provides a useful way to identify sample-specific miRNA regulation in human cancers, and can help to elucidate miRNA regulation mechanisms at the resolution of individual samples.
Results
Sample-specific miRNA regulation (Scan)
To investigate miRNA regulation at the resolution of single samples, we develop an approach called Scan to infer sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data (Fig. 1). The bulk RNA-sequencing data of 690 breast cancer (BRCA) tissues is obtained from The Cancer Genome Atlas (TCGA) [26, 27] project, and the single-cell RNA-sequencing data of 19 half K562 cells (the first human chronic myeloid leukemia cell line) is from [28, 29] (Methods). The optional prior information of miRNA-mRNA interactions is from two well-known databases: TargetScan v8.0 [30] and ENCORI [31] (Methods). For each network inference method, the prior information between miRNAs and mRNAs inferred from the third-party miRNA-target databases is used as an initial structure of miRNA-mRNA regulatory network.
Given bulk or single-cell RNA-sequencing data (i.e. matched miRNA and mRNA expression data) with or without prior information of miRNA-mRNA interactions, Scan apply one of 27 network inference methods (Methods) spanning seven types (including Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality) to infer miRNA-mRNA relation matrix. For each sample (cell or tissue), Scan needs to infer two miRNA-mRNA relation matrices (one for all samples and the other for all samples except the sample of interest). Based on the two identified miRNA-mRNA relation matrices, Scan further adapts one strategy (statistical perturbation or linear interpolation) to infer miRNA regulatory network specific to the sample of interest (Methods).
Sample-specific miRNA regulation is closely associated with the selected strategy
Using the miRNA-mRNA co-sequencing data in K562 and BRCA, we apply Scan to infer sample-specific miRNA regulation across 19 half K562 cells and 690 BRCA tissues. For the K562 dataset, Scan uses 27 network inference methods to infer miRNA-mRNA relation matrices. We do not report the results of four methods (including Dcor [32], Hoeffding [33], MIC [34] and Ridge [35]) due to long runtime and one method (IDA [36]) because of the combination of long runtime and large memory usage for the BRCA dataset, and we have stopped their computations before the results are produced. Therefore, Scan uses 22 network inference methods to identify miRNA-mRNA relation matrices for the BRCA dataset. For each sample (cell or tissue), each network inference method predicts two miRNA-mRNA relation matrices (one for all samples and the other for all samples except the sample of interest). Based on the identified miRNA-mRNA relation matrices, Scan further applies two strategies Scan.interp (using linear interpolation strategy) and Scan.perturb (using statistical perturbation strategy) to construct sample-specific miRNA regulatory networks across 19 half K562 cells and 690 BRCA tissues.
To systematically compare the performance of different combinations of network inference methods and strategies used, we have analyzed a total of 1026 networks: 27 (network inference methods) times 2 (strategies) times 19 (cells) in the K562 dataset, and a total of 30,360 networks: 22 (network inference methods) times 2 (strategies) times 690 (tissues) in the BRCA dataset. Overall, we have found that using different network inference methods results in different number of cell- and tissue- specific miRNA-mRNA interactions (Additional file 1: Fig. S1a, S2a, S3a and S4a). Particularly, Dcor [32] has the largest number of cell-specific miRNA-mRNA interactions predicted, and MI [37] and MIC [34] also predict relatively higher number of cell-specific miRNA-mRNA interactions (Additional file 1: Fig. S1a and S2a). A possible reason is that in contrast to other network inference methods, Dcor, MI and MIC can measure both linear and nonlinear associations between miRNAs and mRNAs. Furthermore, when the prior information of miRNA targets is used, more cell- and tissue- specific miRNA regulatory networks are found to have their node degree obey a power law distribution and thus are regarded as scale-free networks (Additional file 1: Fig. S1b, S2b, S3b and S4b). This result may be explained that the prior information of miRNA targets follows a power law distribution. Interestingly, with or without using the prior information of miRNA targets, the node degree of the identified cell-specific miRNA regulatory networks by Dcor [32] does not follow a power law distribution. The possible reason is that the number of predicted targets for each miRNA by Dcor obeys a uniform distribution. Finally, in terms of the predicted miRNA-mRNA interactions, the similarity of the majority of cell or tissue pairs is less than 0.50 (Additional file 1: Fig. S1c, S2c, S3c and S4c), indicating the heterogeneity of miRNA regulation for each K562 cell or BRCA tissue.
When Scan.interp is used for inferring cell-specific miRNA regulation, using Lasso [35] for strength calculation gives the best final performance in terms of accuracy and using Bcor [38] leads to the highest efficiency (Additional file 1: Fig. 2a, Table S1). Using GenMiR + + [39] for strength calculation gives the best final results in terms of accuracy and whereas using Bcor [38] has the highest efficiency when Scan.interp is used for inferring tissue-specific miRNA regulation (Additional file 1: Fig. 2b, Table S1). Overall, when using Scan.interp for inferring cell- and tissue-specific miRNA regulation, Canberra [40] and Phis [41] has the largest overall rank scores, respectively. When Scan.perturb is used for inferring cell-specific miRNA regulation, using Lasso has the best final performance in terms of accuracy whereas using Bcor leads to the highest efficiency (Additional file 1: Fig. 2a, Table S2). Using Rhop [41] gives the best final results in terms of accuracy whereas using Bcor has the highest efficiency when Scan.perturb is used for inferring tissue-specific miRNA regulation (Additional file 1: Fig. 2b, Table S2). Overall, when using Scan.perturb for inferring cell- and tissue- specific miRNA regulation, using Chebyshev [42] and Rhop [41] displays the best overall performance, respectively. The fact that no consensus in terms of the best or better network inference methods found in both K562 and BRCA datasets is caused by two potential reasons. Firstly, the RNA-sequencing technology and biological samples used by the K562 and BRCA datasets are different. Compared with the BRCA dataset, the K562 dataset has higher dropout rate and lower gene detection rate. In the case of high dropout rate and low gene detection rate, the statistic of most of distance-based methods is still reliable to measure the distance between miRNAs and mRNAs. Secondly, the performance of different network inference methods is evaluated with both accuracy and efficiency. Compared with distance-based methods, the proportionality-based methods have better efficiency with two strategies (Scan.interp and Scan.perturb), thus have higher efficiency on the BRCA dataset. Moreover, in the BRCA dataset, the statistics of the proportionality-based methods have better accuracy in predicting miRNA targets.
In terms of accuracy and efficiency, Lasso and Bcor are consistent in the K562 dataset regardless of which strategy is used for inferring cell-specific miRNA regulation. In terms of efficiency, Bcor is consistent in the BRCA dataset regardless of which strategy is used for inferring tissue-specific miRNA regulation. Although the best network inference method (in terms of overall rank score) is different between Scan.interp and Scan.perturb, the network inference methods which lead to the best final result are all distance based network inference methods in the K562 dataset, and are all proportionality based network inference methods in the BRCA dataset. Furthermore, in terms of efficiency, the rank scores obtained by using 27 network inference methods between Scan.interp and Scan.perturb are consistent (p-value less than 2.22E-16 with Cohen’s kappa) in the K562 dataset, and the rank scores obtained by using 22 network inference methods between Scan.interp and Scan.perturb are also consistent (p-value less than 2.22E-16 with Cohen’s kappa) in the BRCA dataset. However, in terms of accuracy and overall, the rank scores obtained by using 27 network inference methods between Scan.interp and Scan.perturb are not consistent (p-values are 0.53 and 0.69 with Cohen's kappa, respectively) in the K562 dataset, and the rank scores obtained by using 22 network inference methods between Scan.interp and Scan.perturb are also not consistent (p-values are 0.58 and 0.75 with Cohen's kappa, respectively) in the BRCA dataset. This result suggests that the performance of the network inference methods is closely associated with the selected strategy.
Prior information can generally improve the accuracy of miRNA target prediction
To understand whether the prior information of miRNA targets can improve the accuracy of miRNA target prediction, we compare the average percentage of validated miRNA targets predicted by Scan with and without using the prior information of miRNA targets. We find that incorporating the prior information of miRNA targets generally results in a larger average percentage of validated miRNA targets in both K562 and BRCA datasets (Fig. 3). Moreover, the higher the confidence of the prior information is, generally the larger the average percentage of validated miRNA targets in both K562 and BRCA datasets is (Fig. 3). In particular, after incorporating the prior information of miRNA targets, five network inference methods (Jaccard, Dice, Chebyshev, Euclidean and Z-score) with Scan.perturb show different performances between K562 and BRCA datasets. This is because after incorporating the prior information of miRNA targets, most of sample-specific miRNA targets identified by these network inference methods with Scan.perturb are not in the validated miRNA-target databases due to incomplete knowledge of miRNA targets, and the percentage of validated miRNA targets specific to many K562 cells or BRCA tissues is 0. Consequently, for the five network inference methods with using prior information, the average percentage of validated miRNA targets across all of K562 cells or BRCA tissues displays small or no improvement in contrast to without using prior information. Overall, our result demonstrates that the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction.
Application of sample-specific miRNA regulation
With the K562 dataset, when Scan.interp and Scan.perturb are used together with Canberra and Chebyshev respectively, the best result in terms of overall rank score is achieved. For the BRCA dataset, when Scan.interp and Scan.perturb are used together with Phis and Rhop respectively, the best result in terms of overall rank score is obtained. As a result, we select the four optimal combinations (Scan.interp_Canberra, Scan.perturb_Chebyshev, Scan.interp_Phis, Scan.perturb_Rhop) without using prior information to infer cell-specific and tissue-specific miRNA regulatory networks respectively for the two datasets. The identified sample-specific miRNA regulatory networks can be further applied into downstream analysis. In this section, we are interested in two types of downstream analysis, including constructing correlation networks of K562 cells and BRCA tissues, and finding dynamic and conservative miRNA regulation across K562 cells and BRCA tissues.
Correlation networks of K562 cells and BRCA tissues
The correlation networks of K562 cells and BRCA tissues are useful for investigating the correlation or association among cells or tissues. We use the inferred cell-specific (or tissue-specific) miRNA regulatory networks to generate cell–cell (or tissue-tissue) similarity matrices. Based on the generated cell–cell and tissue-tissue similarity matrices, we further construct correlation networks of K562 cells and BRCA tissues respectively (Methods). As a result, we have constructed two correlation networks for the K562 cells and two correlation networks for the BRCA tissues (Additional file 2). Based on the identified cell-specific miRNA regulatory networks by Scan.interp_Canberra and Scan.perturb_Chebyshev, the numbers of cell–cell correlation pairs are 171 and 22 respectively for the K562 dataset, Based on the identified tissue-specific miRNA regulatory networks by Scan.interp_Phis and Scan.perturb_Rhop, the numbers of tissue-tissue correlation pairs are 51,769 and 8805 respectively for the BRCA dataset. Network topological analysis reveals that the node degree of the identified cell–cell correlation network (10 nodes and 22 edges) by Scan.perturb_Chebyshev obeys a power law distribution (p-value more than 0.99 with Kolmogorov–Smirnov test), and the node degrees of two identified tissue-tissue correlation networks by Scan.interp_Phis and Scan.perturb_Rhop also follow power law distributions (p-values more than 0.85 with Kolmogorov–Smirnov test). Additionally, the identified cell–cell correlation network (19 nodes and 171 edges) by Scan.interp_Canberra is a fully connected network, and it is also a small-world network with higher densities than its corresponding random networks (p-value less than 2.22E-16 with Student’s t-test). This result indicates that the identified correlation networks of cells or tissues tend to be scale-free or small-world networks, which are important characteristics of real-world biological networks [43, 44].
Dynamic and conservative miRNA regulation across K562 cells and BRCA tissues
Based on the identified cell-specific or tissue-specific miRNA regulatory networks, we further investigate the dynamic and conservative miRNA regulation across K562 cells and BRCA tissues (Methods). With both K562 and BRCA datasets, the numbers of dynamic and conservative miRNA-mRNA interactions identified tend to be varying (Additional file 3). With the K562 dataset, the node degrees of the identified dynamic and conservative miRNA regulatory networks by Scan.interp_Canberra do not follow power law distributions (p-values are 2.41E-05 and 3.77E-04 with Kolmogorov–Smirnov test, respectively), but the node degrees of the identified dynamic and conservative miRNA regulatory networks by Scan.perturb_Chebyshev obey power law distributions (p-values are 1.00 and 0.99 with Kolmogorov–Smirnov test, respectively). With the BRCA dataset, the node degrees of the identified dynamic miRNA regulatory networks by Scan.interp_Phis and the identified conservative miRNA regulatory networks by Scan.perturb_Rhop follow power law distributions (p-values are 0.21 and 1.00 with Kolmogorov–Smirnov test, respectively). However, with the BRCA dataset, the node degrees of the identified conservative miRNA regulatory networks by Scan.interp_Phis and the identified dynamic miRNA regulatory networks by Scan.perturb_Rhop do not obey power law distributions (p-values are 2.77E-33 and 9.47E-05 with Kolmogorov–Smirnov test, respectively).
In addition, the numbers of dynamic and conservative hub miRNAs by Scan.interp and Scan.perturb are also different (Additional file 4). We further conduct validation analysis to investigate whether the identified dynamic and conservative hub miRNAs are closely associated with CML and BRCA. With the K562 dataset, 4 out of 16 dynamic hub miRNAs and 1 out of 1 conservative hub miRNA identified by Scan.interp_Canberra are confirmed to be CML-related miRNAs with experimental evidence. With the BRCA dataset, 35 out of 44 dynamic hub miRNAs and 15 out of 15 conservative hub miRNAs identified by Scan.interp_Phis are confirmed to be BRCA-related miRNAs, and 39 out of 39 dynamic hub miRNAs identified by Scan.perturb_Rhop are validated to be BRCA-related miRNAs with experimental support. This validation result implies that dynamic and conservative hub miRNAs across K562 cells or BRCA tissues are closely associated with CML or BRCA.
Aggregate miRNA regulation exhibits to be various between single-sample and multi-sample resolution levels
At the multi-sample resolution level, we apply network inference methods to identify aggregate miRNA regulatory networks across K562 cells and BRCA tissues. Here, the way of identifying aggregate miRNA regulatory networks at the resolution of multiple samples is named as Aggregate. At the single-sample resolution level, we use the majority voting algorithm [45] to infer aggregate miRNA regulatory networks from the identified sample-specific miRNA regulatory networks by Scan.interp and Scan.perturb (Methods). By comparing the aggregate miRNA regulation between single-sample and multi-sample resolution levels, we have discovered that the similarity between the identified aggregate miRNA regulatory networks (Scan.interp vs Aggregate, Scan.perturb vs Aggregate) is mostly less than 50% and 70% in the K562 and BRCA datasets, respectively (Fig. 4a and c). This result displays that the aggregate miRNA regulation exhibits to be various between single-sample and multi-sample resolution levels. Additionally, Scan.interp and Scan.perturb are comparable to Aggregate in the percentage of validated aggregate miRNA-mRNA interactions (Fig. 4b and d). This finding indicates that integrating sample-specific miRNA regulatory networks may contribute to recover aggregate miRNA regulatory networks across K562 cells and BRCA tissues.
Performance comparison with CSmiR
To evaluate the effectiveness of Scan, we compare the performance of Scan (Scan.interp and Scan.perturb) with CSmiR [11] on the K562 and BRCA datasets. To our knowledge, CSmiR is the first and the only existing method to explore miRNA regulation at a single-cell resolution level, so we use it as the baseline for comparison. With the K562 dataset, without using the prior information of miRNA targets, Scan.interp with 8 network inference methods and Scan.perturb with 7 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5a). After integrating the prior information of miRNA targets from TargetScan, Scan.interp with 14 network inference methods and Scan.perturb with 18 network inference methods perform better than CSmiR in terms of accuracy, and after integrating the prior information of miRNA targets from ENCORI, Scan.interp with 21 network inference methods and Scan.perturb with 20 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5a). This result shows that in terms of accuracy, Scan mostly performs better than or comparable to CSmiR with the K562 dataset when using the prior information of miRNA targets. Furthermore, in terms of efficiency, both Scan.interp and Scan.perturb with all of the network inference methods perform better than CSmiR (Fig. 5b).
With the BRCA dataset, without using the prior information of miRNA targets, Scan.interp with 11 network inference methods and Scan.perturb with 12 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5c). After integrating the prior information of miRNA targets from TargetScan, Scan.interp with 7 network inference methods and Scan.perturb with 10 network inference methods perform better than CSmiR in terms of accuracy, and after integrating the prior information of miRNA targets from ENCORI, Scan.interp with 4 network inference methods and Scan.perturb with 7 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5c). This result shows that in terms of accuracy, Scan mostly performs better than or comparable to CSmiR with the BRCA dataset when without using the prior information of miRNA targets. In addition, in terms of efficiency, Scan.interp with 3 network inference methods and Scan.perturb with 9 network inference methods perform better than CSmiR (Fig. 5d).
The above performance comparison suggests that in the small-scale K562 dataset, Scan with prior information generally performs better than or comparable to CSmiR in terms of accuracy and efficiency. In the large-scale BRCA dataset, Scan without using prior information generally performs better than or comparable to CSmiR in terms of accuracy, but in terms of efficiency, only a few instances of Scan (Scan with a few network inference methods) perform better than CSmiR. The inconsistent result of performance comparison with the two datasets implies that the performance of Scan is more likely to be data-specific, and it is necessary to select optimal network inference methods for new data.
Discussion
It is well-established that miRNAs are important regulators of gene expression, and their dysregulations can lead to the occurrence and development of complex human diseases, including cancers. Yet, the research on miRNA regulation at the resolution of single samples (cells or tissues) is still limited. Here, we present the Scan framework, and further show the effectiveness of it in inferring sample-specific miRNA regulation. By applying Scan into bulk and single-cell RNA-sequencing data, we have discovered that adding the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. By instantiating the Scan framework with 27 network inference methods, we have found that the performance of Scan instantiated with different network inference methods exhibits to be data-specific. In addition, the identified sample-specific miRNA regulatory network by Scan can be used for downstream analysis, e.g. constructing sample correlation network. The freely available framework Scan provides a useful method for exploring miRNA regulation at the resolution of single samples.
For constructing miRNA-mRNA relation matrix, Scan includes 27 network inference methods spanning seven types (Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality). It is noted that other types of computational methods (e.g. deep learning [46], probabilistic modeling [47]) can be also plugged into Scan for generating miRNA-mRNA relation matrix. In future, it is our plan to add more types of network inference methods. Additionally, the noise exist in bulk and single-cell RNA-sequencing data, as well as gene and cell dropouts, call for more robust methods to predict miRNA regulatory network. For network inference, ensemble methods based on the wisdom of crowds has demonstrated better performance than individual methods [48, 49]. Therefore, for robust miRNA-mRNA relation matrix, we will incorporate ensemble methods (i.e. integrating the prediction results from individual methods) into Scan.
The incorporation of domain knowledge can reduce the searching space and result in approximate optimal solutions [50]. As the domain knowledge, the prior information of miRNA targets provides an initial structure of miRNA regulatory network to constrain the searching space of network inference methods. In this work, we have demonstrated that combining prior information can improve the performance of miRNA target prediction, and higher confidence of prior information leads to better accuracy of miRNA targets. This result is consistent with previous studies [51, 52]. However, it is noted that this higher accuracy comes at the cost of reducing the number of miRNA-mRNA interactions predicted. It is expected that more comprehensive and high-confidence prior information will alleviate this issue.
We provide two strategies, Scan.interp (linear interpolation strategy) and Scan.perturb (statistical perturbation strategy) to infer miRNA regulation specific to single cells or tissues. Generally, using the same network inference method, Scan.interp can generate larger cell-specific or tissue-specific miRNA regulatory networks than those by Scan.perturb. This can be explained by the fact that Scan.interp infers cell-specific miRNA regulatory networks with differential values (e.g. differential correlation, distance or regression) between all samples and all samples except sample k, but Scan.perturb identifies differential miRNA regulatory networks between the networks of using all samples and the networks of using all samples except sample k (see details in Methods). Moreover, the runtime of Scan.interp and Scan.perturb with the K562 and BRCA datasets is similar. For new data, we suggest use both Scan.interp and Scan.perturb for comprehensively exploring miRNA regulation specific to individual samples.
To infer sample-specific miRNA regulation, we have applied Scan into bulk and single-cell RNA-sequencing data across tumor cells or tissues. Certainly, Scan is also applicable in bulk and single-cell RNA-sequencing data across healthy cells or tissues. In future, it will be a meaningful direction to reveal dynamic and conservative miRNA regulation between the matched tumor and healthy cells or tissues. Moreover, with the advancement of spatial RNA-sequencing technology [53], Scan is also a potential method to explore the heterogeneity of miRNA regulation between different spatial positions from spatial RNA-sequencing data.
When Scan is applied to large-scale transcriptomics data, the process of inferring miRNA-mRNA relation matrix using network inference methods will be computationally intensive. This is a common issue of existing computational methods, including Scan. To alleviate the issue, users can allocate more CPU cores to identify miRNA-mRNA relation matrix in parallel. Furthermore, users can also use the prior information of samples to divide samples into several subtypes, and then focus on investigating sample-specific miRNA regulation across cells or tissues of each subtype. Finally, selecting a fast network inference method (e.g. Pearson and Bcor) is also a feasible way to quickly infer sample-specific miRNA regulation from large-scale transcriptomics data.
To help users to select the most appropriate inference method for their own dataset, we have provided an extensive tutorial of Scan at https://github.com/zhangjunpeng411/Scan or https://doi.org/10.5281/zenodo.13346480 [54]. For a large-scale dataset (e.g. the number of samples is more than 100), we recommend users selecting the network inference methods with better efficiency (i.e. less runtime). In this tutorial, we select five representative network inference methods (Pearson, Euclidean, MI, Lasso, Phit) spanning five types (Correlation, Distance, Information, Regression and Proportionality) and two strategies (Scan.interp and Scan.perturb) to infer cell-specific miRNA regulation from small-scale K562 single-cell RNA-sequencing data. The prior information of miRNA targets has three cases: None (no prior information), TargetScan (prior information of TargetScan), ENCORI (prior information of ENCORI). This tutorial shows that the combination (Scan.perturb and Pearson) is the most appropriate inference method for inferring cell-specific miRNA regulation from K562 single-cell RNA-sequencing data.
In addition to studying miRNA regulation at the single-sample resolved level, Scan also can be used for other types of gene regulation specific to individual samples, e.g. transcriptional regulation, long non-coding RNA regulation, circular RNA regulation and PIWI-interacting RNA regulation.
Since miRNAs play important roles in tumor microenvironments, it is important for us to understand the regulation of miRNAs in tumor cells or tissues. In-depth investigation of miRNA regulation specific to individual tumor cells or tissues will help to figure out the heterogeneity of tumor microenvironments and discover novel subtypes of tumor cells or tissues.
Conclusions
Taken altogether, we provide a useful method called Scan to identify sample-specific miRNA regulation that helps understand the heterogeneity of individual samples. The presented results provide insights to uncover the heterogeneity of miRNA regulation across K562 cells and BRCA tissues. We believe that Scan is applicable to explore miRNA regulation mechanisms at the resolution of individual samples.
Methods
Bulk and single-cell RNA-sequencing data
From TCGA [26, 27], we have obtained the normalized expression data of 894 miRNAs and 19,068 mRNAs in 690 matching breast cancer tissues. Since epithelial-mesenchymal transition (EMT) is closely related to the development, progression and metastasis of breast cancer [55,56,57] and EMT signatures associates closely with breast cancer subtypes [58, 59], we obtain a list of 315 EMT signatures [60] (Additional file 5) and further divide the matched 690 breast cancer tissues into four EMT types (epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal) by using GSVA R package [61]. As a result, the numbers of breast cancer tissues belonging to epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal types are 491, 107, 46 and 46, respectively (Additional file 6). By using the limma-trend approach in limma R package [62], we have identified 163 miRNAs and 5801 mRNAs that are differentially expressed between epithelial and mesenchymal type (adjusted p-value < 0.01, fold change > 1.5) (Additional file 7). Here, the p-values are adjusted by the Benjamini–Hochberg (BH) method [63]. Therefore, the input bulk RNA-sequencing data used in our case study includes the expression data of 163 miRNAs and 5801 mRNAs in 690 breast cancer tissues.
The normalized K562 single-cell RNA-sequencing data (the accession number is GSE114071 in Gene Expression Omnibus (GEO) database) includes the expression data of 2822 miRNAs and 21,704 mRNAs in 19 half K562 cells (the first human chronic myeloid leukemia cell line) [28, 29]. In the single-cell RNA-sequencing data, we calculate the average expression values of the duplicate miRNAs or mRNAs as their ultimate expression values. As a feature selection step, we discard the miRNAs or mRNAs with constant expression values, and we are interested in the high expression miRNAs or mRNAs with mean values greater than the median of mean values of all non-constant expression miRNAs or mRNAs. Moreover, the reserved miRNA data and mRNA expression data are then log-transformed. As a result, the input single-cell RNA-sequencing data used in our case study includes the expression data of 212 miRNAs and 7680 mRNAs in 19 half K562 cells.
Prior information of miRNA targets
To improve the prediction of sample-specific miRNA regulation, the miRNA target information in TargetScan v8.0 [30] and ENCORI [31] (the pilot version is starBase) is used as prior information for Scan. From TargetScan, a list of 235,109 predicted miRNA-mRNA interactions has been obtained. To identify miRNA-mRNA interactions, ENCORI provides seven prediction tools (PITA [64], RNA22 [65], miRmap [66], DIANA-microT [67], miRanda [68], PicTar [69] and TargetScan [30]) for users. To obtain high-confidence miRNA-mRNA interactions, we only retain the miRNA-mRNA interactions predicted by at least five prediction algorithms. In total, a list of 55,343 high-confidence miRNA-mRNA interactions is obtained from ENCORI. The number of overlapped miRNA-mRNA interactions between TargetScan v8.0 and ENCORI is 48,266, indicating that most of high-confidence miRNA-mRNA interactions (87.21%) in ENCORI are included in TargetScan v8.0.
Network inference methods
We select 27 network inference methods (Table 1) for constructing miRNA-mRNA relation matrix, including Pearson [70], Spearman [71], Kendall [72], Distance correlation (Dcor) [32], Random Dependence Coefficient (RDC) [73], Hoeffding’s D statistics (Hoeffding) [33], Z-score [74], Biweight midcorrelation (Bcor) [38], Weighted rank correlation (Wcor) [75], Cosine [76], Euclidean [77], Manhattan [78], Canberra [40], Chebyshev [42], Dice [79], Jaccard [80], Mahalanobise [81], Mutual Information (MI) [37], Maximal Information Coefficient (MIC) [34], Lasso [35], Elastic [35], Ridge [35], GenMiR++ [39], \(\phi\)(Phit) [41], \(\phi_{{\text{s}}}\)(Phis) [41], \(\rho_{{\text{p}}}\)(Rhop) [41], and Intervention calculus when the Directed acyclic graph is Absent (IDA) [36]. These methods can be divided into seven types: Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality. Ten methods (Pearson, Spearman, Kendall, Dcor, RDC, Hoeffding, Z-score, Bcor, Wcor and Cosine) belong to the Correlation type, seven methods (Euclidean, Manhattan, Canberra, Chebyshev, Dice, Jaccard and Mahalanobise) belong to the Distance type, two methods (MI and MIC) are of the Information type, three methods (Lasso, Elastic and Ridge) belong to the Regression type, three methods (Phit, Phis and Rhop) belong to the Proportionality type, and GenMiR + + and IDA are of the Bayesian and Causality types, respectively.
Sample-specific network inference
Given bulk or single-cell RNA-sequencing data with m samples (i.e. matched miRNA and mRNA expression data), with or without prior information of miRNA-mRNA interactions, Scan constructs two miRNA-mRNA relation matrices (one for all samples and the other for all samples except the k-th sample of interest,\(k \in [1,m]\)) by using one of the network inference methods. The two constructed miRNA-mRNA relation matrices (of p miRNAs and q mRNAs) for all samples and all samples except the k-th sample (denoted as \(X^{(k)}\) and \(Y^{(k)}\) respectively) are as follows.
where \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) represent the connection between mRNA i and miRNA j.
For the network inference methods which are of the Correlation, Information, Regression, Bayesian, Proportionality and Causality types, larger absolute values of \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) indicate higher connections between mRNA i and miRNA j. In contrast, for Distance based network inference methods, larger absolute values of \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) indicate weaker connections between mRNA i and miRNA j. To keep consistency between the constructed miRNA-mRNA relation matrices by different types of network inference methods, we transform the two miRNA-mRNA relation matrices (\(X^{(k)}\) and \(Y^{(k)}\)) generated by Distance based methods into \(X{\prime}^{(k)}\) and \(Y{\prime}^{(k)}\) as follows:
where eps refers to the precision of floating numbers (the value is 2.22E-16 in default).
After transformation, a larger absolute value in a relation matrix generated by any type of the network inference methods indicates a higher connection between a miRNA and a mRNA.
Linear interpolation strategy
Following the use of a network inference method of the Correlation, Information, Regression, Bayesian, Proportionality or Causality type to obtain the two constructed miRNA-mRNA relation matrices (\(X^{(k)}\) and \(Y^{(k)}\)), Scan applies a linear interpolation strategy [17] to estimate the miRNA-mRNA relation matrix \(Z^{(k)}\) specific to sample k as follows:
where m denotes the number of samples in bulk or single-cell RNA-sequencing data.
When a Distance based network inference method is used to obtain the two constructed miRNA-mRNA relation matrices (\(X{'}^{(k)}\) and \(Y{'}^{(k)}\)), the miRNA-mRNA relation matrix \(Z^{(k)}\) specific to sample k is estimated as:
\(Z_{ij}^{(k)}\) is further normalized as:
where \(\mu^{(k)}\) and \(\sigma^{(k)}\) denote the mean value and standard deviation of \(Z^{(k)}\), and the normalized miRNA-mRNA relation matrix \(Z^{^{\prime}(k)}\) is:
Each value of \(Z_{ij}^{^{\prime}(k)}\) corresponds to a significance p-value. The p-value is calculated as follows:
where \(|Z_{ij}^{^{\prime}(k)} |\) is the absolute value of \(Z_{ij}^{^{\prime}(k)}\), and the pnorm function is used to calculate the probability a random value from the standard normal distribution being less than \(|Z_{ij}^{^{\prime}(k)} |\).
Smaller value of \(p_{ij}^{(k)}\) indicates that miRNA j is more likely to interact with mRNA i in sample k. If the value of \(p_{ij}^{(k)}\) is less than a cutoff (e.g. 0.05), miRNA j is considered to interact with mRNA i in sample k.
Statistical perturbation strategy
For the Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality methods, each value of \(X^{(k)}\)(\(X{'}^{(k)}\)) and \(Y^{(k)}\)(\(Y{'}^{(k)}\)) corresponds to a significance p-value. The corresponding p-value matrices of \(X^{(k)}\)(\(X{'}^{(k)}\)) and \(Y^{(k)}\)(\(Y{'}^{(k)}\)) are denoted as \(p_{x}^{(k)}\) and \(p_{y}^{(k)}\), respectively.
Smaller value of \(x_{ij}^{(k)}\) and \(y_{ij}^{(k)}\) indicates that miRNA j is more likely to interact with mRNA i in all samples and all samples except k, respectively. Given a p-value cutoff (e.g. 0.05), we can obtain two zero–one matrices \(ZO_{x}^{(k)}\) and \(ZO_{y}^{(k)}\) in all samples and all samples except k. By using a statistical perturbation strategy [16], Scan assigns an edge to a miRNA-mRNA pair in sample k if the p-value of that pair is different when the sample is removed from all samples. The miRNA-mRNA zero–one matrix \(ZO^{(k)}\) for sample k is calculated as follows:
where xor is the XOR logical function. One value of \(ZO_{ij}^{(k)}\) indicates that miRNA j interacts with mRNA i in sample k, and zero value of \(ZO_{ij}^{(k)}\) represents that miRNA j doesn’t interact with mRNA i in sample k.
In this work, the cutoff of significant p-value for both linear interpolation and statistical perturbation strategies is set to be 0.05 in default. By applying linear interpolation or statistical perturbation strategy, Scan can identify m sample-specific miRNA regulatory networks across m samples. Each sample-specific miRNA regulatory network is a directed graph where the direction of an edge is from a miRNA to a mRNA.
Network topological analysis
The degree of a node in a sample-specific miRNA regulatory network is the number of connections with other nodes, and the degree distribution denotes the probability distribution of node degrees over the sample-specific miRNA regulatory network. If the node degree of a sample-specific miRNA regulatory network obeys a power law distribution, the network is considered as a scale-free network. In this work, we use the R package igraph [82] to calculate the degree distribution of the identified sample-specific miRNA regulatory networks. The Kolmogorov–Smirnov (KS) test [83] is used to determine whether the node degree of a sample-specific miRNA regulatory network obeys a power law distribution. If the p-value of the KS test is smaller than a cutoff (e.g. 0.05), the node degree of a sample-specific miRNA regulatory network does not obey a power law distribution, suggesting that the sample-specific miRNA regulatory network is not a scale-free network, and vice versa.
The network density is used to measure the tightness of a sample-specific miRNA regulatory network. Compared with the random networks, a network with a higher density tends to be a small-world network [84]. We also use the R package igraph [82] to calculate the density of the inferred sample-specific miRNA regulatory networks. We generate 100 random networks (the same number of nodes and edges with real network) for each sample-specific miRNA regulatory network to determine whether the network is small-world network. The Student’s t-test [85] is applied to judge whether the density of a sample-specific miRNA regulatory network is higher than that of its corresponding random networks at a significance level. For a sample-specific miRNA regulatory network, if the p-value of the Student’s t-test is smaller than a p-value cutoff (e.g. 0.05), it is regarded as a small-world network.
Sample correlation network construction
We use sample-specific miRNA regulatory networks to compute the similarities between samples. In the context of sample-specific miRNA regulatory networks, the similarity between samples a and b is calculated in the following.
where \(Net_{a}\) and \(Net_{b}\) represent the miRNA regulatory networks specific to samples a and b, respectively, \(|Net_{a} \cap Net_{b} |\) is the number of common miRNA-mRNA interactions between \(Net_{a}\) and \(Net_{b}\), and \(min(|Net_{a} |,|Net_{b} |)\) denotes the smaller number of miRNA-mRNA interactions between \(Net_{a}\) and \(Net_{b}\).
For m samples, the sample-sample similarity matrix SM (a symmetric matrix) is:
The sample-sample similarity matrix is further used for sample correlation network construction [86]. For each sample pair, a higher similarity value indicates that the pair of samples is more correlated with each other. We use a similarity cutoff (e.g. 0.50) to infer whether two samples are correlated or not. In other words, if the similarity value between samples a and b is larger than the cutoff, samples a and b are correlated with each other. After assembling the correlated sample pairs, we can construct a sample correlation network.
Dynamic and conservative analysis
The miRNA-mRNA interaction existing in only one cell or tissue are defined as a dynamic miRNA-mRNA interaction, whereas the miRNA-mRNA interaction existing in at least a half of cells or tissues are defined as a conservative miRNA-mRNA interaction. Given the identified sample-specific miRNA regulatory networks, we can obtain dynamic and conservative miRNA-mRNA interactions which form dynamic and conservative miRNA regulatory networks, respectively.
Hub miRNA identification
Hub miRNAs are defined as highly connected miRNAs in a dynamic and conservative miRNA regulatory network. In this work, we use the cumulative probability of Poisson distribution to evaluate whether a miRNA is a hub:
where \(\lambda = np\),\(p = \frac{t}{C}\), n is the number of genes (including miRNAs and mRNAs), t is the number of miRNA-mRNA interactions in a dynamic or conservative miRNA regulatory network, and \(C\) is the number of all possible miRNA-mRNA interaction pairs. Smaller p-value of a miRNA shows that the miRNA is more likely to be a hub. Here, the cutoff of the p-value is set to 0.05.
Experimentally validated miRNAs related to CML and BRCA
To understand whether the identified dynamic and conservative hub miRNAs are closely associated with CML and BRCA, we obtain the miRNAs related to CML and BRCA with experimental evidence from RNADisease v4.0 [87], which is a comprehensive RNA-disease resource and includes experimentally validated and computationally predicted RNA-disease associations. If a dynamic or conservative hub miRNA is overlapped with the experimentally validated CML-related or BRCA-related miRNAs in RNADisease v4.0, the dynamic or conservative hub miRNA is regarded as a CML-related or BRCA-related miRNA.
Comparison metrics
We use two metrics (accuracy and efficiency) to compare different instances of Scan using different network inference methods, and compare Scan with other methods for inferring sample-specific miRNA regulation. For accuracy, due to the lack of ground-truth of miRNA-mRNA interactions at the single-sample level, the ground truth of miRNA-mRNA interactions at the multi-sample level are acquired from miRTarBase v9.0 [88] and TarBase v8.0 [89] for validation. We consider that the validation using the experimentally validated miRNA-mRNA interactions at the multi-sample level as an approximate ground-truth (of the ground-truth at the single-sample level) are still useful in screening reliable miRNA targets for subsequent experimental validation at the single-sample level. If a method has a larger percentage of validated miRNA-mRNA interactions (the number of validated predictions divided by the total number of predictions), the method is considered having higher accuracy. For efficiency, we compare the runtime of different methods in the same bulk or single-cell RNA-sequencing data. If a method takes less runtime in the same bulk or single-cell RNA-sequencing data, the method is considered having better efficiency.
In terms of both accuracy and efficiency, we use an overall rank score [90] to evaluate the performance of each method. For the method i, the overall rank score orsi is calculated as:
where \(rs_{i1}\) and \(rs_{i2}\) denote the rank scores of method i in terms of accuracy and efficiency, respectively.
A method with higher accuracy or better efficiency will obtain a larger rank score. A method with a larger overall rank score is regarded as a better or practical method.
Aggregate network inference
Given bulk and single-cell RNA-sequencing data, we infer aggregate miRNA regulatory network by using two ways. The first way (called Aggregate) is to identify aggregate miRNA regulatory networks at the resolution of multiple samples (i.e. group of cells or tissues) by using network inference methods. The second way is to predict aggregate miRNA regulatory networks via integrating sample-specific miRNA regulatory networks identified by Scan.interp and Scan.perturb. Since the majority voting algorithm [45] is a simple but effective ensemble learning method, we use it to infer aggregate miRNA regulatory networks from the identified sample-specific miRNA regulatory networks. For the majority voting method, if a sample-specific miRNA-mRNA interaction is predicted in more samples, it will be assigned a higher score (i.e. higher rank). To have a fair comparison, the number of top ranked miRNA-mRNA interactions in the second way is the same as the number of miRNA-mRNA interactions in the first way. Particularly, if the number of unique miRNA-mRNA interactions in the second way is less than the number of miRNA-mRNA interactions in the first way, we set the number of top ranked miRNA-mRNA interactions as the number of unique miRNA-mRNA interactions.
Method execution
Each execution of Scan and CSmiR on bulk or single-cell RNA-sequencing data is performed in a separate task. Each task is allocated 32 CPU cores of Intel(R) Xeon(R) Platinum 8375C CPU at 2.90 GHz, and one R session is opened for each task. The network inference methods of Scan with runtime more than 10 days or with memory usage more than 256 GB are discarded.
Availability of data and materials
Scan is released under the GPL-3.0 License, and is available at https://github.com/zhangjunpeng411/Scan/ or https://doi.org/10.5281/zenodo.13346480 [54]. Gene expression profiles used for Scan were accessed from Gene Expression Omnibus (GEO) at The Cancer Genome Atlas (TCGA) at https://www.cancer.gov/tcga [26, 27] and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114071 [28, 29]. The lists of all the data used in this study are available in the Additional files.
Abbreviations
- Bcor:
-
Biweight midcorrelation
- BH:
-
Benjamini–Hochberg
- BRCA:
-
Breast cancer
- CML:
-
Chronic Myelogenous Leukemia
- Dcor:
-
Distance correlation
- EMT:
-
Epithelial-mesenchymal transition
- GEO:
-
Gene Expression Omnibus
- Hoeffding:
-
Hoeffding’s D statistics
- IDA:
-
Intervention calculus when the Directed acyclic graph is Absent
- KS:
-
Kolmogorov–Smirnov
- MI:
-
Mutual Information
- MIC:
-
Maximal Information Coefficient
- miRNA:
-
MicroRNA
- mRNA:
-
Messenger RNA
- ncRNA:
-
Non-coding RNA
- RDC:
-
Random Dependence Coefficient
- Scan:
-
Sample-specific miRNA regulation
- TCGA:
-
The Cancer Genome Atlas
- Wcor:
-
Weighted rank correlation
References
Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–33.
Krol J, Loedige I, Filipowicz W. The widespread regulation of microRNA biogenesis, function and decay. Nat Rev Genet. 2010;11:597–610.
Peng Y, Croce CM. The role of microRNAs in human cancer. Signal Transduct Target Ther. 2016;1:15004.
Hamam R, Hamam D, Alsaleh KA, Kassem M, Zaher W, Alfayez M, et al. Circulating microRNAs in breast cancer: novel diagnostic and prognostic biomarkers. Cell Death Dis. 2017;8:e3045.
Yendamuri S, Calin GA. The role of microRNA in human leukemia: a review. Leukemia. 2009;23:1257–63.
Coradduzza D, Solinas T, Balzano F, Culeddu N, Rossi N, Cruciani S, et al. miRNAs as molecular biomarkers for prostate cancer. J Mol Diagn. 2022;24:1171–80.
De Marzio M, Glass K, Kuijjer ML. Single-sample network modeling on omics data. BMC Biol. 2023;21:296.
Le HS, Bar-Joseph Z. Integrating sequence, expression and interaction data to determine condition-specific miRNA regulation. Bioinformatics. 2013;29:i89-97.
Zhang J, Le TD, Liu L, Liu B, He J, Goodall GJ, et al. Inferring condition-specific miRNA activity from matched miRNA and mRNA expression data. Bioinformatics. 2014;30:3070–7.
Yoon S, Nguyen HCT, Jo W, Kim J, Chi S-M, Park J, et al. Biclustering analysis of transcriptome big data identifies condition-specific microRNA targets. Nucleic Acids Res. 2019;47:e53.
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, et al. Exploring cell-specific miRNA regulation with single-cell miRNA-mRNA co-sequencing data. BMC Bioinformatics. 2021;22:578.
Ludwig N, Leidinger P, Becker K, Backes C, Fehlmann T, Pallasch C, et al. Distribution of miRNA expression across human tissues. Nucleic Acids Res. 2016;44:3865–77.
Dai H, Li L, Zeng T, Chen L. Cell-specific network constructed by single-cell RNA sequencing data. Nucleic Acids Res. 2019;47:e62.
Yu X, Zeng T, Wang X, Li G, Chen L. Unravelling personalized dysfunctional gene network of complex diseases based on differential network model. J Transl Med. 2015;13:189.
Zhang W, Zeng T, Liu X, Chen L. Diagnosing phenotypes of single-sample individuals by edge biomarkers. J Mol Cell Biol. 2015;7:231–41.
Liu X, Wang Y, Ji H, Aihara K, Chen L. Personalized characterization of diseases using sample-specific networks. Nucleic Acids Res. 2016;44:e164.
Kuijjer ML, Tung MG, Yuan G, Quackenbush J, Glass K. Estimating sample-specific regulatory networks. iScience. 2019;14:226–40.
Li L, Dai H, Fang Z, Chen L. c-CSN: Single-cell RNA sequencing data analysis by conditional cell-specific network. Genomics Proteomics Bioinformatics. 2021;19:319–29.
Wang X, Choi D, Roeder K. Constructing local cell-specific networks from single-cell data. Proc Natl Acad Sci U S A. 2021;118:e2113178118.
Chen H-H, Hsueh C-W, Lee C-H, Hao T-Y, Tu T-Y, Chang L-Y, et al. SWEET: a single-sample network inference method for deciphering individual features in disease. Brief Bioinform. 2023;24:bbad032.
Wang Y, Xuan C, Wu H, Zhang B, Ding T, Gao J. P-CSN: single-cell RNA sequencing data analysis by partial cell-specific network. Brief Bioinform. 2023;24:bbad180.
Zeng Y, He Y, Zheng R, Li M. Inferring single-cell gene regulatory network by non-redundant mutual information. Brief Bioinform. 2023;24:bbad326.
Zhang SY, Stumpf MPH. Inferring cell-specific causal regulatory networks from drift and diffusion. In: The 2022 ICML Workshop on Computational Biology. Baltimore, Maryland, USA; 2022. https://icml-compbio.github.io/icml-website-2022/.
Li L, Xia R, Chen W, Zhao Q, Tao P, Chen L. Single-cell causal network inferred by cross-mapping entropy. Brief Bioinform. 2023;24:bbad281.
Le TD, Zhang J, Liu L, Liu H, Li J. miRLAB: An R based dry lab for exploring miRNA-mRNA regulatory relationships. PLoS ONE. 2015;10:e0145386.
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–20.
TCGA. 2019. https://www.cancer.gov/tcga.
Wang N, Zheng J, Chen Z, Liu Y, Dura B, Kwak M, et al. Single-cell microRNA-mRNA co-sequencing reveals non-genetic heterogeneity and mechanisms of microRNA regulation. Nat Commun. 2019;10:95.
Wang N, Chen Z, Fan R, Lu J. Single-cell microRNA-mRNA co-sequencing reveals non-genetic heterogeneity and mechanisms of microRNA regulation. GEO. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114071.
Agarwal V, Bell GW, Nam JW, Bartel DP. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005.
Li JH, Liu S, Zhou H, Qu LH, Yang JH. starBase v2.0: decoding miRNA-ceRNA, miRNA-ncRNA and protein-RNA interaction networks from large-scale CLIP-Seq data. Nucleic Acids Res. 2014;42:D92–97.
Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35:2769–94.
Hoeffding W. A non-parametric test of independence. Ann Math Stat. 1948;19:546–57.
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, et al. Detecting novel associations in large data sets. Science. 2011;334:1518–24.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Maathuis MH, Kalisch M, Bühlmann P. Estimating high-dimensional intervention effects from observational data. Ann Stat. 2009;37:3133–64.
Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69:066138.
Wilcox R. Introduction to robust estimation and hypothesis testing, 5th Ed. Cambridge: Academic Press; 2021.
Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, et al. Using expression profiling data to identify human microRNA targets. Nat Methods. 2007;4:1045–9.
Lance GN, Williams WT. Computer programs for hierarchical polythetic classification (“similarity analyses”). Comput J. 1966;9:60–4.
Quinn TP, Richardson MF, Lovell D, Crowley TM. propr: An R-package for identifying proportionally abundant features using compositional data analysis. Sci Rep. 2017;7:16252.
Cantrell CD. Modern mathematical methods for physicists and engineers. Cambridge: Cambridge University Press; 2000.
Watts DJ, Strogatz SH. Collective dynamics of “small-world” networks. Nature. 1998;393:440–2.
Barabási A-L, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5:101–13.
Boyer RS, Moore JS. MJRTY-A fast majority vote algorithm. In: Boyer RS, editor. Automated Reasoning: Essays in Honor of Woody Bledsoe. Dordrecht: Springer, Netherlands; 1991. p. 105–17.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Larrañaga P, Karshenas H, Bielza C, Santana R. A review on probabilistic graphical models in evolutionary computation. Journal of Heuristics. 2012;18:795–819.
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9:796–804.
Le TD, Zhang J, Liu L, Li J. Ensemble methods for miRNA target prediction from expression data. PLoS ONE. 2015;10:e0131627.
Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1:67–82.
Liu B, Li J, Tsykin A, Liu L, Gaur AB, Goodall GJ. Exploring complex miRNA-mRNA interactions with Bayesian networks by splitting-averaging strategy. BMC Bioinformatics. 2009;10:408.
Zhang J, Le TD, Liu L, Liu B, He J, Goodall GJ, et al. Identifying direct miRNA-mRNA causal regulatory relationships in heterogeneous data. J Biomed Inform. 2014;52:438–47.
Marx V. Method of the Year: spatially resolved transcriptomics. Nat Methods. 2021;18:9–14.
Zhang J. Scan (v1.0.0). Zenodo. 2024. https://doi.org/10.5281/zenodo.13346480.
Ye X, Brabletz T, Kang Y, Longmore GD, Nieto MA, Stanger BZ, et al. Upholding a role for EMT in breast cancer metastasis. Nature. 2017;547:E1-3.
Neelakantan D, Zhou H, Oliphant MUJ, Zhang X, Simon LM, Henke DM, et al. EMT cells increase breast cancer metastasis via paracrine GLI activation in neighbouring tumour cells. Nat Commun. 2017;8:15773.
Kröger C, Afeyan A, Mraz J, Eaton EN, Reinhardt F, Khodor YL, et al. Acquisition of a hybrid E/M state is essential for tumorigenicity of basal breast cancer cells. Proc Natl Acad Sci U S A. 2019;116:7353–62.
Taube JH, Herschkowitz JI, Komurov K, Zhou AY, Gupta S, Yang J, et al. Core epithelial-to-mesenchymal transition interactome gene-expression signature is associated with claudin-low and metaplastic breast cancer subtypes. Proc Natl Acad Sci U S A. 2010;107:15449–54.
Turner KM, Yeo SK, Holm TM, Shaughnessy E, Guan J-L. Heterogeneity within molecular subtypes of breast cancer. Am J Physiol Cell Physiol. 2021;321:C343–54.
Tan TZ, Miow QH, Miki Y, Noda T, Mori S, Huang RYJ, et al. Epithelial-mesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. EMBO Mol Med. 2014;6:1279–93.
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNA-seq data. BMC Bioinformatics. 2013;14:7.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57:289–300.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E. The role of site accessibility in microRNA target recognition. Nat Genet. 2007;39:1278–84.
Miranda KC, Huynh T, Tay Y, Ang Y-S, Tam W-L, Thomson AM, et al. A pattern-based method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell. 2006;126:1203–17.
Vejnar CE, Zdobnov EM. MiRmap: comprehensive prediction of microRNA target repression strength. Nucleic Acids Res. 2012;40:11673–83.
Paraskevopoulou MD, Georgakilas G, Kostoulas N, Vlachos IS, Vergoulis T, Reczko M, et al. DIANA-microT web server v5.0: service integration into miRNA functional analysis workflows. Nucleic Acids Res. 2013;41 Web Server issue:W169-173.
Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets in Drosophila. Genome Biol. 2003;5:R1.
Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, et al. Combinatorial microRNA target predictions. Nat Genet. 2005;37:495–500.
Pearson K. Notes on the history of correlation. Biometrika. 1920;13:25–45.
Spearman C. “General intelligence”, objectively determined and measured. Am J Psychol. 1904;15:201–92.
Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93.
Lopez-Paz D, Hennig P, Schölkopf B. The randomized dependence coefficient. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1. Red Hook, NY, USA: Curran Associates Inc; 2013. p. 1–9.
Prill RJ, Marbach D, Saez-Rodriguez J, Sorger PK, Alexopoulos LG, Xue X, et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE. 2010;5:e9202–e9202.
Zar J. Biostatistical analysis. Old Bridge: Prentice-Hall/Pearson; 2010.
Deza E, Deza MM. Dictionary of distances. Amsterdam: Elsevier; 2006.
Deza MM, Deza E. Encyclopedia of distances. In: Deza E, Deza MM, editors. Encyclopedia of Distances. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p. 1–583.
Craw S. Manhattan distance. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning. Boston, MA: Springer, US; 2010. p. 639–639.
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302.
Duda R, Hart P, G.Stork D. Pattern classification. Hoboken: Wiley Interscience; 2001.
Mahalanobis PC. On the generalized distance in statistics. Proceed National Institute Sci (Calcutta). 1936;2:49–55.
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal Complex Syst. 2006;1695:1–9.
Massey FJ. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. 1951;46:68–78.
Zaidi F. Small world networks and clustered small world networks with random connectivity. Soc Netw Anal Min. 2013;3:51–63.
Student. The probable error of a mean. Biometrika. 1908;6:1–25.
Zhang J, Liu L, Zhang W, Li X, Zhao C, Li S, et al. miRspongeR 2.0: an enhanced R package for exploring miRNA sponge regulation. Bioinform Adv. 2022;2:vbac063.
Chen J, Lin J, Hu Y, Ye M, Yao L, Wu L, et al. RNADisease v4.0: an updated resource of RNA-associated diseases, providing RNA-disease analysis, enrichment and prediction. Nucleic Acids Res. 2023;51:D1397-1404.
Huang H, Lin Y-C, Cui S, Huang Y, Tang Y, Xu J, et al. miRTarBase update 2022: an informative resource for experimentally validated miRNA-target interactions. Nucleic Acids Res. 2022;50:D222–30.
Karagkouni D, Paraskevopoulou MD, Chatzopoulos S, Vlachos IS, Tastsoglou S, Kanellos I, et al. DIANA-TarBase v8: a decade-long collection of experimentally supported miRNA-gene interactions. Nucleic Acids Res. 2018;46:D239–45.
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, et al. miRSM: an R package to infer and analyse miRNA sponge modules in heterogeneous data. RNA Biol. 2021;18:2308–20.
Acknowledgements
The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga, and GEO repository: https://www.ncbi.nlm.nih.gov/geo/.
Funding
JZ was supported by the National Natural Science Foundation of China (Grant Number: 61963001), the Yunnan Xingdian Talents Support Plan—Young Talents Program, and the Doctoral Scientific Research Foundation of Dali University. CZ was supported by the Applied Basic Research Foundation of Science and Technology of Yunnan Province (Grant Number: 202101BA070001-221). TDL was supported by the ARC DECRA Grant (Grant Number: DE200100200).
Author information
Authors and Affiliations
Contributions
J.Z. and T.D.L. conceived the idea of this work. L.L. and J.L. refined the idea. J.Z. designed and performed the experiments. X.W., C.Z. and Y.L. participated in the design of the study and performed the statistical analysis. J.Z., L.L., T.D.L. and J.L. drafted the manuscript. All authors revised the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12915_2024_2020_MOESM1_ESM.docx
Additional file 1: Figures S1-S4 and Tables S1-S2. Fig. S1 - Cell-specific miRNA regulation using linear interpolation strategy (Scan.interp) in the K562 dataset. Fig. S2 - Cell-specific miRNA regulation using statistical perturbation strategy (Scan.perturb) in the K562 dataset. Fig. S3 - Tissue-specific miRNA regulation using linear interpolation strategy (Scan.interp) in the BRCA dataset. Fig. S4 - Tissue-specific miRNA regulation using statistical perturbation strategy (Scan.perturb) in the BRCA dataset. Table S1 - Runtime of 27 network inference methods with the linear interpolation strategy (Scan.interp) in the K562 and BRCA datasets. Table S2 - Runtime of 27 network inference methods with the statistical perturbation strategy (Scan.perturb) in the K562 and BRCA datasets.
12915_2024_2020_MOESM6_ESM.xlsx
Additional file 6. Breast cancer tissues belonging to epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal types.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Zhang, J., Liu, L., Wei, X. et al. Scanning sample-specific miRNA regulation from bulk and single-cell RNA-sequencing data. BMC Biol 22, 218 (2024). https://doi.org/10.1186/s12915-024-02020-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12915-024-02020-x