 Methodology article
 Open access
 Published:
Scanning samplespecific miRNA regulation from bulk and singlecell RNAsequencing data
BMC Biology volume 22, Article number: 218 (2024)
Abstract
Background
RNAsequencing technology provides an effective tool for understanding miRNA regulation in complex human diseases, including cancers. A large number of computational methods have been developed to make use of bulk and singlecell RNAsequencing data to identify miRNA regulations at the resolution of multiple samples (i.e. group of cells or tissues). However, due to the heterogeneity of individual samples, there is a strong need to infer miRNA regulation specific to individual samples to uncover miRNA regulation at the singlesample resolution level.
Results
Here, we develop a framework, Scan, for scanning samplespecific miRNA regulation. Since a single network inference method or strategy cannot perform well for all types of new data, Scan incorporates 27 network inference methods and two strategies to infer tissuespecific or cellspecific miRNA regulation from bulk or singlecell RNAsequencing data. Results on bulk and singlecell RNAsequencing data demonstrate the effectiveness of Scan in inferring samplespecific miRNA regulation. Moreover, we have found that incorporating the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. In addition, Scan can contribute to construct cell/tissue correlation networks and recover aggregate miRNA regulatory networks. Finally, the comparison results have shown that the performance of network inference methods is likely to be dataspecific, and selecting optimal network inference methods is required for more accurate prediction of miRNA targets.
Conclusions
Scan provides a useful method to help infer samplespecific miRNA regulation for new data, benchmark new network inference methods and deepen the understanding of miRNA regulation at the resolution of individual samples.
Background
As a type of regulatory noncoding RNA (ncRNA), microRNA (miRNA, 19–25 nt in length) can modulate the expression levels of thousands of coding RNAs, thereby playing a key role in biological processes, signaling pathways and pathological processes of human cancers [1,2,3]. Due to the importance of miRNAs in biological system, they have the potential to be biomarkers of human cancers, e.g. breast cancer [4], leukemia [5] and prostate cancer [6] in clinical applications.
There are multiple types of biological networks, including gene regulatory network within a biological sample (cell or tissue). Among the different types of biological networks, gene regulatory network is regarded as an important feature of biological samples, and gives rise to the unique characteristics of each biological sample [7]. In the gene regulation field, miRNA regulation has attracted broad attentions on account of its potential clinical translation. Therefore, inferring and characterizing miRNA regulation is a central problem for revealing miRNA regulatory mechanisms in systems biology. Just like dynamic biological system in a biological sample (cell or tissue), miRNA regulation also tends to be dynamic. For example, by using bulk and singlecell RNAsequencing data, existing studies [8,9,10,11,12] have found that the miRNA regulatory networks are conditionspecific, tissuespecific and cellspecific. Moreover, it is widely known that biological samples are characterized by heterogeneity, and are shown to be unique. Thus, to understand the miRNA regulation specific to biological samples (cancer tissues or cells in particular), it is essential to investigate samplespecific miRNA regulation (i.e. the miRNA regulatory network for a specific sample).
At the singlesample level, previous methods including CSN [13], DEVCnet [14], EdgeBiomarker [15], SSN [16], LIONESS [17], cCSN [18], LocCSN [19], SWEET [20], PCSN [21] and Normi [22] have been presented to identify samplespecific gene regulation (one gene regulatory network for one sample), and the identified gene regulatory networks with these methods are undirected networks. To infer causal regulatory networks (represented as directed networks) specific to individual samples, an inference method [23] using drift–diffusion processes and NME [24] are also proposed. However, the above computational methods [13,14,15,16,17,18,19,20,21,22,23,24] are designed for exploring samplespecific transcriptional regulation rather than samplespecific miRNA regulation. To study miRNA regulation at the singlecell level, CSmiR [11] is introduced to infer cellspecific miRNA regulation from singlecell miRNAmRNA cosequencing data (where each single cell is considered as a sample). As stated in CSmiR, for singlecell miRNAmRNA cosequencing data with less than 100 singlecells, CSmiR needs to interpolate pseudocells by using a resampling technique. However, the original singlecell miRNAmRNA cosequencing data itself contains technical noise (e.g. high dropout rate), thus as a result, the enlarged data by adding pseudocells may aggravate the technical noise.
To model the dynamic regulatory processes of miRNAs at the singlesample level, in this work, we propose a Samplespecific miRNA regulation (Scan) framework to scan samplespecific miRNA regulation from bulk and singlecell RNAsequencing data. For exploring miRNAmRNA regulatory relationships, a large number of network inference methods have been proposed, which can be categorized as Correlation, Distance, Information, Regression, Bayesian, Proportionality, and Causality [25]. Recently, two main strategies, statistical perturbation [16] and linear interpolation [17], have emerged that explicitly predict samplespecific networks. Due to the fact that a single network inference method or strategy does not always perform the best in all types of new data, Scan incorporates 27 network inference methods and two strategies to infer tissuespecific or cellspecific miRNA regulation from bulk or singlecell RNAsequencing data. Given bulk or singlecell RNAsequencing data with or without prior information of miRNAmRNA interactions, 27 wellused network inference methods in Scan can be used to calculate the strength between miRNAs and mRNAs. Based on the strength between miRNAs and mRNAs, Scan adapts two strategies: statistical perturbation and linear interpolation to infer samplespecific miRNA regulatory networks.
By applying Scan to two RNAsequencing datasets, a bulk dataset from breast cancer tissues, and a singlecell dataset from chronic myelogenous leukemia (CML) cells, we demonstrate the effectiveness of Scan in terms of accuracy and efficiency of miRNA target prediction. Moreover, the performance comparison between with and without using prior information indicates that the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. We also reveal that Scan can help to construct sample correlation network and recover aggregate miRNA regulatory network. The comparison with CSmiR shows that with using the prior information of miRNA targets, Scan mostly has better or similar performance compared with CSmiR [11] in inferring cellspecific miRNA regulation from the chronic myelogenous leukemia dataset. Furthermore, without using the prior information of miRNA targets, the performance of Scan is mostly better than or comparable to the performance of CSmiR in inferring tissuespecific miRNA regulation from the breast cancer dataset. Altogether, Scan provides a useful way to identify samplespecific miRNA regulation in human cancers, and can help to elucidate miRNA regulation mechanisms at the resolution of individual samples.
Results
Samplespecific miRNA regulation (Scan)
To investigate miRNA regulation at the resolution of single samples, we develop an approach called Scan to infer samplespecific miRNA regulation from bulk and singlecell RNAsequencing data (Fig. 1). The bulk RNAsequencing data of 690 breast cancer (BRCA) tissues is obtained from The Cancer Genome Atlas (TCGA) [26, 27] project, and the singlecell RNAsequencing data of 19 half K562 cells (the first human chronic myeloid leukemia cell line) is from [28, 29] (Methods). The optional prior information of miRNAmRNA interactions is from two wellknown databases: TargetScan v8.0 [30] and ENCORI [31] (Methods). For each network inference method, the prior information between miRNAs and mRNAs inferred from the thirdparty miRNAtarget databases is used as an initial structure of miRNAmRNA regulatory network.
Given bulk or singlecell RNAsequencing data (i.e. matched miRNA and mRNA expression data) with or without prior information of miRNAmRNA interactions, Scan apply one of 27 network inference methods (Methods) spanning seven types (including Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality) to infer miRNAmRNA relation matrix. For each sample (cell or tissue), Scan needs to infer two miRNAmRNA relation matrices (one for all samples and the other for all samples except the sample of interest). Based on the two identified miRNAmRNA relation matrices, Scan further adapts one strategy (statistical perturbation or linear interpolation) to infer miRNA regulatory network specific to the sample of interest (Methods).
Samplespecific miRNA regulation is closely associated with the selected strategy
Using the miRNAmRNA cosequencing data in K562 and BRCA, we apply Scan to infer samplespecific miRNA regulation across 19 half K562 cells and 690 BRCA tissues. For the K562 dataset, Scan uses 27 network inference methods to infer miRNAmRNA relation matrices. We do not report the results of four methods (including Dcor [32], Hoeffding [33], MIC [34] and Ridge [35]) due to long runtime and one method (IDA [36]) because of the combination of long runtime and large memory usage for the BRCA dataset, and we have stopped their computations before the results are produced. Therefore, Scan uses 22 network inference methods to identify miRNAmRNA relation matrices for the BRCA dataset. For each sample (cell or tissue), each network inference method predicts two miRNAmRNA relation matrices (one for all samples and the other for all samples except the sample of interest). Based on the identified miRNAmRNA relation matrices, Scan further applies two strategies Scan.interp (using linear interpolation strategy) and Scan.perturb (using statistical perturbation strategy) to construct samplespecific miRNA regulatory networks across 19 half K562 cells and 690 BRCA tissues.
To systematically compare the performance of different combinations of network inference methods and strategies used, we have analyzed a total of 1026 networks: 27 (network inference methods) times 2 (strategies) times 19 (cells) in the K562 dataset, and a total of 30,360 networks: 22 (network inference methods) times 2 (strategies) times 690 (tissues) in the BRCA dataset. Overall, we have found that using different network inference methods results in different number of cell and tissue specific miRNAmRNA interactions (Additional file 1: Fig. S1a, S2a, S3a and S4a). Particularly, Dcor [32] has the largest number of cellspecific miRNAmRNA interactions predicted, and MI [37] and MIC [34] also predict relatively higher number of cellspecific miRNAmRNA interactions (Additional file 1: Fig. S1a and S2a). A possible reason is that in contrast to other network inference methods, Dcor, MI and MIC can measure both linear and nonlinear associations between miRNAs and mRNAs. Furthermore, when the prior information of miRNA targets is used, more cell and tissue specific miRNA regulatory networks are found to have their node degree obey a power law distribution and thus are regarded as scalefree networks (Additional file 1: Fig. S1b, S2b, S3b and S4b). This result may be explained that the prior information of miRNA targets follows a power law distribution. Interestingly, with or without using the prior information of miRNA targets, the node degree of the identified cellspecific miRNA regulatory networks by Dcor [32] does not follow a power law distribution. The possible reason is that the number of predicted targets for each miRNA by Dcor obeys a uniform distribution. Finally, in terms of the predicted miRNAmRNA interactions, the similarity of the majority of cell or tissue pairs is less than 0.50 (Additional file 1: Fig. S1c, S2c, S3c and S4c), indicating the heterogeneity of miRNA regulation for each K562 cell or BRCA tissue.
When Scan.interp is used for inferring cellspecific miRNA regulation, using Lasso [35] for strength calculation gives the best final performance in terms of accuracy and using Bcor [38] leads to the highest efficiency (Additional file 1: Fig. 2a, Table S1). Using GenMiR + + [39] for strength calculation gives the best final results in terms of accuracy and whereas using Bcor [38] has the highest efficiency when Scan.interp is used for inferring tissuespecific miRNA regulation (Additional file 1: Fig. 2b, Table S1). Overall, when using Scan.interp for inferring cell and tissuespecific miRNA regulation, Canberra [40] and Phis [41] has the largest overall rank scores, respectively. When Scan.perturb is used for inferring cellspecific miRNA regulation, using Lasso has the best final performance in terms of accuracy whereas using Bcor leads to the highest efficiency (Additional file 1: Fig. 2a, Table S2). Using Rhop [41] gives the best final results in terms of accuracy whereas using Bcor has the highest efficiency when Scan.perturb is used for inferring tissuespecific miRNA regulation (Additional file 1: Fig. 2b, Table S2). Overall, when using Scan.perturb for inferring cell and tissue specific miRNA regulation, using Chebyshev [42] and Rhop [41] displays the best overall performance, respectively. The fact that no consensus in terms of the best or better network inference methods found in both K562 and BRCA datasets is caused by two potential reasons. Firstly, the RNAsequencing technology and biological samples used by the K562 and BRCA datasets are different. Compared with the BRCA dataset, the K562 dataset has higher dropout rate and lower gene detection rate. In the case of high dropout rate and low gene detection rate, the statistic of most of distancebased methods is still reliable to measure the distance between miRNAs and mRNAs. Secondly, the performance of different network inference methods is evaluated with both accuracy and efficiency. Compared with distancebased methods, the proportionalitybased methods have better efficiency with two strategies (Scan.interp and Scan.perturb), thus have higher efficiency on the BRCA dataset. Moreover, in the BRCA dataset, the statistics of the proportionalitybased methods have better accuracy in predicting miRNA targets.
In terms of accuracy and efficiency, Lasso and Bcor are consistent in the K562 dataset regardless of which strategy is used for inferring cellspecific miRNA regulation. In terms of efficiency, Bcor is consistent in the BRCA dataset regardless of which strategy is used for inferring tissuespecific miRNA regulation. Although the best network inference method (in terms of overall rank score) is different between Scan.interp and Scan.perturb, the network inference methods which lead to the best final result are all distance based network inference methods in the K562 dataset, and are all proportionality based network inference methods in the BRCA dataset. Furthermore, in terms of efficiency, the rank scores obtained by using 27 network inference methods between Scan.interp and Scan.perturb are consistent (pvalue less than 2.22E16 with Cohen’s kappa) in the K562 dataset, and the rank scores obtained by using 22 network inference methods between Scan.interp and Scan.perturb are also consistent (pvalue less than 2.22E16 with Cohen’s kappa) in the BRCA dataset. However, in terms of accuracy and overall, the rank scores obtained by using 27 network inference methods between Scan.interp and Scan.perturb are not consistent (pvalues are 0.53 and 0.69 with Cohen's kappa, respectively) in the K562 dataset, and the rank scores obtained by using 22 network inference methods between Scan.interp and Scan.perturb are also not consistent (pvalues are 0.58 and 0.75 with Cohen's kappa, respectively) in the BRCA dataset. This result suggests that the performance of the network inference methods is closely associated with the selected strategy.
Prior information can generally improve the accuracy of miRNA target prediction
To understand whether the prior information of miRNA targets can improve the accuracy of miRNA target prediction, we compare the average percentage of validated miRNA targets predicted by Scan with and without using the prior information of miRNA targets. We find that incorporating the prior information of miRNA targets generally results in a larger average percentage of validated miRNA targets in both K562 and BRCA datasets (Fig. 3). Moreover, the higher the confidence of the prior information is, generally the larger the average percentage of validated miRNA targets in both K562 and BRCA datasets is (Fig. 3). In particular, after incorporating the prior information of miRNA targets, five network inference methods (Jaccard, Dice, Chebyshev, Euclidean and Zscore) with Scan.perturb show different performances between K562 and BRCA datasets. This is because after incorporating the prior information of miRNA targets, most of samplespecific miRNA targets identified by these network inference methods with Scan.perturb are not in the validated miRNAtarget databases due to incomplete knowledge of miRNA targets, and the percentage of validated miRNA targets specific to many K562 cells or BRCA tissues is 0. Consequently, for the five network inference methods with using prior information, the average percentage of validated miRNA targets across all of K562 cells or BRCA tissues displays small or no improvement in contrast to without using prior information. Overall, our result demonstrates that the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction.
Application of samplespecific miRNA regulation
With the K562 dataset, when Scan.interp and Scan.perturb are used together with Canberra and Chebyshev respectively, the best result in terms of overall rank score is achieved. For the BRCA dataset, when Scan.interp and Scan.perturb are used together with Phis and Rhop respectively, the best result in terms of overall rank score is obtained. As a result, we select the four optimal combinations (Scan.interp_Canberra, Scan.perturb_Chebyshev, Scan.interp_Phis, Scan.perturb_Rhop) without using prior information to infer cellspecific and tissuespecific miRNA regulatory networks respectively for the two datasets. The identified samplespecific miRNA regulatory networks can be further applied into downstream analysis. In this section, we are interested in two types of downstream analysis, including constructing correlation networks of K562 cells and BRCA tissues, and finding dynamic and conservative miRNA regulation across K562 cells and BRCA tissues.
Correlation networks of K562 cells and BRCA tissues
The correlation networks of K562 cells and BRCA tissues are useful for investigating the correlation or association among cells or tissues. We use the inferred cellspecific (or tissuespecific) miRNA regulatory networks to generate cell–cell (or tissuetissue) similarity matrices. Based on the generated cell–cell and tissuetissue similarity matrices, we further construct correlation networks of K562 cells and BRCA tissues respectively (Methods). As a result, we have constructed two correlation networks for the K562 cells and two correlation networks for the BRCA tissues (Additional file 2). Based on the identified cellspecific miRNA regulatory networks by Scan.interp_Canberra and Scan.perturb_Chebyshev, the numbers of cell–cell correlation pairs are 171 and 22 respectively for the K562 dataset, Based on the identified tissuespecific miRNA regulatory networks by Scan.interp_Phis and Scan.perturb_Rhop, the numbers of tissuetissue correlation pairs are 51,769 and 8805 respectively for the BRCA dataset. Network topological analysis reveals that the node degree of the identified cell–cell correlation network (10 nodes and 22 edges) by Scan.perturb_Chebyshev obeys a power law distribution (pvalue more than 0.99 with Kolmogorov–Smirnov test), and the node degrees of two identified tissuetissue correlation networks by Scan.interp_Phis and Scan.perturb_Rhop also follow power law distributions (pvalues more than 0.85 with Kolmogorov–Smirnov test). Additionally, the identified cell–cell correlation network (19 nodes and 171 edges) by Scan.interp_Canberra is a fully connected network, and it is also a smallworld network with higher densities than its corresponding random networks (pvalue less than 2.22E16 with Student’s ttest). This result indicates that the identified correlation networks of cells or tissues tend to be scalefree or smallworld networks, which are important characteristics of realworld biological networks [43, 44].
Dynamic and conservative miRNA regulation across K562 cells and BRCA tissues
Based on the identified cellspecific or tissuespecific miRNA regulatory networks, we further investigate the dynamic and conservative miRNA regulation across K562 cells and BRCA tissues (Methods). With both K562 and BRCA datasets, the numbers of dynamic and conservative miRNAmRNA interactions identified tend to be varying (Additional file 3). With the K562 dataset, the node degrees of the identified dynamic and conservative miRNA regulatory networks by Scan.interp_Canberra do not follow power law distributions (pvalues are 2.41E05 and 3.77E04 with Kolmogorov–Smirnov test, respectively), but the node degrees of the identified dynamic and conservative miRNA regulatory networks by Scan.perturb_Chebyshev obey power law distributions (pvalues are 1.00 and 0.99 with Kolmogorov–Smirnov test, respectively). With the BRCA dataset, the node degrees of the identified dynamic miRNA regulatory networks by Scan.interp_Phis and the identified conservative miRNA regulatory networks by Scan.perturb_Rhop follow power law distributions (pvalues are 0.21 and 1.00 with Kolmogorov–Smirnov test, respectively). However, with the BRCA dataset, the node degrees of the identified conservative miRNA regulatory networks by Scan.interp_Phis and the identified dynamic miRNA regulatory networks by Scan.perturb_Rhop do not obey power law distributions (pvalues are 2.77E33 and 9.47E05 with Kolmogorov–Smirnov test, respectively).
In addition, the numbers of dynamic and conservative hub miRNAs by Scan.interp and Scan.perturb are also different (Additional file 4). We further conduct validation analysis to investigate whether the identified dynamic and conservative hub miRNAs are closely associated with CML and BRCA. With the K562 dataset, 4 out of 16 dynamic hub miRNAs and 1 out of 1 conservative hub miRNA identified by Scan.interp_Canberra are confirmed to be CMLrelated miRNAs with experimental evidence. With the BRCA dataset, 35 out of 44 dynamic hub miRNAs and 15 out of 15 conservative hub miRNAs identified by Scan.interp_Phis are confirmed to be BRCArelated miRNAs, and 39 out of 39 dynamic hub miRNAs identified by Scan.perturb_Rhop are validated to be BRCArelated miRNAs with experimental support. This validation result implies that dynamic and conservative hub miRNAs across K562 cells or BRCA tissues are closely associated with CML or BRCA.
Aggregate miRNA regulation exhibits to be various between singlesample and multisample resolution levels
At the multisample resolution level, we apply network inference methods to identify aggregate miRNA regulatory networks across K562 cells and BRCA tissues. Here, the way of identifying aggregate miRNA regulatory networks at the resolution of multiple samples is named as Aggregate. At the singlesample resolution level, we use the majority voting algorithm [45] to infer aggregate miRNA regulatory networks from the identified samplespecific miRNA regulatory networks by Scan.interp and Scan.perturb (Methods). By comparing the aggregate miRNA regulation between singlesample and multisample resolution levels, we have discovered that the similarity between the identified aggregate miRNA regulatory networks (Scan.interp vs Aggregate, Scan.perturb vs Aggregate) is mostly less than 50% and 70% in the K562 and BRCA datasets, respectively (Fig. 4a and c). This result displays that the aggregate miRNA regulation exhibits to be various between singlesample and multisample resolution levels. Additionally, Scan.interp and Scan.perturb are comparable to Aggregate in the percentage of validated aggregate miRNAmRNA interactions (Fig. 4b and d). This finding indicates that integrating samplespecific miRNA regulatory networks may contribute to recover aggregate miRNA regulatory networks across K562 cells and BRCA tissues.
Performance comparison with CSmiR
To evaluate the effectiveness of Scan, we compare the performance of Scan (Scan.interp and Scan.perturb) with CSmiR [11] on the K562 and BRCA datasets. To our knowledge, CSmiR is the first and the only existing method to explore miRNA regulation at a singlecell resolution level, so we use it as the baseline for comparison. With the K562 dataset, without using the prior information of miRNA targets, Scan.interp with 8 network inference methods and Scan.perturb with 7 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5a). After integrating the prior information of miRNA targets from TargetScan, Scan.interp with 14 network inference methods and Scan.perturb with 18 network inference methods perform better than CSmiR in terms of accuracy, and after integrating the prior information of miRNA targets from ENCORI, Scan.interp with 21 network inference methods and Scan.perturb with 20 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5a). This result shows that in terms of accuracy, Scan mostly performs better than or comparable to CSmiR with the K562 dataset when using the prior information of miRNA targets. Furthermore, in terms of efficiency, both Scan.interp and Scan.perturb with all of the network inference methods perform better than CSmiR (Fig. 5b).
With the BRCA dataset, without using the prior information of miRNA targets, Scan.interp with 11 network inference methods and Scan.perturb with 12 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5c). After integrating the prior information of miRNA targets from TargetScan, Scan.interp with 7 network inference methods and Scan.perturb with 10 network inference methods perform better than CSmiR in terms of accuracy, and after integrating the prior information of miRNA targets from ENCORI, Scan.interp with 4 network inference methods and Scan.perturb with 7 network inference methods perform better than CSmiR in terms of accuracy (Fig. 5c). This result shows that in terms of accuracy, Scan mostly performs better than or comparable to CSmiR with the BRCA dataset when without using the prior information of miRNA targets. In addition, in terms of efficiency, Scan.interp with 3 network inference methods and Scan.perturb with 9 network inference methods perform better than CSmiR (Fig. 5d).
The above performance comparison suggests that in the smallscale K562 dataset, Scan with prior information generally performs better than or comparable to CSmiR in terms of accuracy and efficiency. In the largescale BRCA dataset, Scan without using prior information generally performs better than or comparable to CSmiR in terms of accuracy, but in terms of efficiency, only a few instances of Scan (Scan with a few network inference methods) perform better than CSmiR. The inconsistent result of performance comparison with the two datasets implies that the performance of Scan is more likely to be dataspecific, and it is necessary to select optimal network inference methods for new data.
Discussion
It is wellestablished that miRNAs are important regulators of gene expression, and their dysregulations can lead to the occurrence and development of complex human diseases, including cancers. Yet, the research on miRNA regulation at the resolution of single samples (cells or tissues) is still limited. Here, we present the Scan framework, and further show the effectiveness of it in inferring samplespecific miRNA regulation. By applying Scan into bulk and singlecell RNAsequencing data, we have discovered that adding the prior information of miRNA targets can generally improve the accuracy of miRNA target prediction. By instantiating the Scan framework with 27 network inference methods, we have found that the performance of Scan instantiated with different network inference methods exhibits to be dataspecific. In addition, the identified samplespecific miRNA regulatory network by Scan can be used for downstream analysis, e.g. constructing sample correlation network. The freely available framework Scan provides a useful method for exploring miRNA regulation at the resolution of single samples.
For constructing miRNAmRNA relation matrix, Scan includes 27 network inference methods spanning seven types (Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality). It is noted that other types of computational methods (e.g. deep learning [46], probabilistic modeling [47]) can be also plugged into Scan for generating miRNAmRNA relation matrix. In future, it is our plan to add more types of network inference methods. Additionally, the noise exist in bulk and singlecell RNAsequencing data, as well as gene and cell dropouts, call for more robust methods to predict miRNA regulatory network. For network inference, ensemble methods based on the wisdom of crowds has demonstrated better performance than individual methods [48, 49]. Therefore, for robust miRNAmRNA relation matrix, we will incorporate ensemble methods (i.e. integrating the prediction results from individual methods) into Scan.
The incorporation of domain knowledge can reduce the searching space and result in approximate optimal solutions [50]. As the domain knowledge, the prior information of miRNA targets provides an initial structure of miRNA regulatory network to constrain the searching space of network inference methods. In this work, we have demonstrated that combining prior information can improve the performance of miRNA target prediction, and higher confidence of prior information leads to better accuracy of miRNA targets. This result is consistent with previous studies [51, 52]. However, it is noted that this higher accuracy comes at the cost of reducing the number of miRNAmRNA interactions predicted. It is expected that more comprehensive and highconfidence prior information will alleviate this issue.
We provide two strategies, Scan.interp (linear interpolation strategy) and Scan.perturb (statistical perturbation strategy) to infer miRNA regulation specific to single cells or tissues. Generally, using the same network inference method, Scan.interp can generate larger cellspecific or tissuespecific miRNA regulatory networks than those by Scan.perturb. This can be explained by the fact that Scan.interp infers cellspecific miRNA regulatory networks with differential values (e.g. differential correlation, distance or regression) between all samples and all samples except sample k, but Scan.perturb identifies differential miRNA regulatory networks between the networks of using all samples and the networks of using all samples except sample k (see details in Methods). Moreover, the runtime of Scan.interp and Scan.perturb with the K562 and BRCA datasets is similar. For new data, we suggest use both Scan.interp and Scan.perturb for comprehensively exploring miRNA regulation specific to individual samples.
To infer samplespecific miRNA regulation, we have applied Scan into bulk and singlecell RNAsequencing data across tumor cells or tissues. Certainly, Scan is also applicable in bulk and singlecell RNAsequencing data across healthy cells or tissues. In future, it will be a meaningful direction to reveal dynamic and conservative miRNA regulation between the matched tumor and healthy cells or tissues. Moreover, with the advancement of spatial RNAsequencing technology [53], Scan is also a potential method to explore the heterogeneity of miRNA regulation between different spatial positions from spatial RNAsequencing data.
When Scan is applied to largescale transcriptomics data, the process of inferring miRNAmRNA relation matrix using network inference methods will be computationally intensive. This is a common issue of existing computational methods, including Scan. To alleviate the issue, users can allocate more CPU cores to identify miRNAmRNA relation matrix in parallel. Furthermore, users can also use the prior information of samples to divide samples into several subtypes, and then focus on investigating samplespecific miRNA regulation across cells or tissues of each subtype. Finally, selecting a fast network inference method (e.g. Pearson and Bcor) is also a feasible way to quickly infer samplespecific miRNA regulation from largescale transcriptomics data.
To help users to select the most appropriate inference method for their own dataset, we have provided an extensive tutorial of Scan at https://github.com/zhangjunpeng411/Scan or https://doi.org/10.5281/zenodo.13346480 [54]. For a largescale dataset (e.g. the number of samples is more than 100), we recommend users selecting the network inference methods with better efficiency (i.e. less runtime). In this tutorial, we select five representative network inference methods (Pearson, Euclidean, MI, Lasso, Phit) spanning five types (Correlation, Distance, Information, Regression and Proportionality) and two strategies (Scan.interp and Scan.perturb) to infer cellspecific miRNA regulation from smallscale K562 singlecell RNAsequencing data. The prior information of miRNA targets has three cases: None (no prior information), TargetScan (prior information of TargetScan), ENCORI (prior information of ENCORI). This tutorial shows that the combination (Scan.perturb and Pearson) is the most appropriate inference method for inferring cellspecific miRNA regulation from K562 singlecell RNAsequencing data.
In addition to studying miRNA regulation at the singlesample resolved level, Scan also can be used for other types of gene regulation specific to individual samples, e.g. transcriptional regulation, long noncoding RNA regulation, circular RNA regulation and PIWIinteracting RNA regulation.
Since miRNAs play important roles in tumor microenvironments, it is important for us to understand the regulation of miRNAs in tumor cells or tissues. Indepth investigation of miRNA regulation specific to individual tumor cells or tissues will help to figure out the heterogeneity of tumor microenvironments and discover novel subtypes of tumor cells or tissues.
Conclusions
Taken altogether, we provide a useful method called Scan to identify samplespecific miRNA regulation that helps understand the heterogeneity of individual samples. The presented results provide insights to uncover the heterogeneity of miRNA regulation across K562 cells and BRCA tissues. We believe that Scan is applicable to explore miRNA regulation mechanisms at the resolution of individual samples.
Methods
Bulk and singlecell RNAsequencing data
From TCGA [26, 27], we have obtained the normalized expression data of 894 miRNAs and 19,068 mRNAs in 690 matching breast cancer tissues. Since epithelialmesenchymal transition (EMT) is closely related to the development, progression and metastasis of breast cancer [55,56,57] and EMT signatures associates closely with breast cancer subtypes [58, 59], we obtain a list of 315 EMT signatures [60] (Additional file 5) and further divide the matched 690 breast cancer tissues into four EMT types (epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal) by using GSVA R package [61]. As a result, the numbers of breast cancer tissues belonging to epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal types are 491, 107, 46 and 46, respectively (Additional file 6). By using the limmatrend approach in limma R package [62], we have identified 163 miRNAs and 5801 mRNAs that are differentially expressed between epithelial and mesenchymal type (adjusted pvalue < 0.01, fold change > 1.5) (Additional file 7). Here, the pvalues are adjusted by the Benjamini–Hochberg (BH) method [63]. Therefore, the input bulk RNAsequencing data used in our case study includes the expression data of 163 miRNAs and 5801 mRNAs in 690 breast cancer tissues.
The normalized K562 singlecell RNAsequencing data (the accession number is GSE114071 in Gene Expression Omnibus (GEO) database) includes the expression data of 2822 miRNAs and 21,704 mRNAs in 19 half K562 cells (the first human chronic myeloid leukemia cell line) [28, 29]. In the singlecell RNAsequencing data, we calculate the average expression values of the duplicate miRNAs or mRNAs as their ultimate expression values. As a feature selection step, we discard the miRNAs or mRNAs with constant expression values, and we are interested in the high expression miRNAs or mRNAs with mean values greater than the median of mean values of all nonconstant expression miRNAs or mRNAs. Moreover, the reserved miRNA data and mRNA expression data are then logtransformed. As a result, the input singlecell RNAsequencing data used in our case study includes the expression data of 212 miRNAs and 7680 mRNAs in 19 half K562 cells.
Prior information of miRNA targets
To improve the prediction of samplespecific miRNA regulation, the miRNA target information in TargetScan v8.0 [30] and ENCORI [31] (the pilot version is starBase) is used as prior information for Scan. From TargetScan, a list of 235,109 predicted miRNAmRNA interactions has been obtained. To identify miRNAmRNA interactions, ENCORI provides seven prediction tools (PITA [64], RNA22 [65], miRmap [66], DIANAmicroT [67], miRanda [68], PicTar [69] and TargetScan [30]) for users. To obtain highconfidence miRNAmRNA interactions, we only retain the miRNAmRNA interactions predicted by at least five prediction algorithms. In total, a list of 55,343 highconfidence miRNAmRNA interactions is obtained from ENCORI. The number of overlapped miRNAmRNA interactions between TargetScan v8.0 and ENCORI is 48,266, indicating that most of highconfidence miRNAmRNA interactions (87.21%) in ENCORI are included in TargetScan v8.0.
Network inference methods
We select 27 network inference methods (Table 1) for constructing miRNAmRNA relation matrix, including Pearson [70], Spearman [71], Kendall [72], Distance correlation (Dcor) [32], Random Dependence Coefficient (RDC) [73], Hoeffding’s D statistics (Hoeffding) [33], Zscore [74], Biweight midcorrelation (Bcor) [38], Weighted rank correlation (Wcor) [75], Cosine [76], Euclidean [77], Manhattan [78], Canberra [40], Chebyshev [42], Dice [79], Jaccard [80], Mahalanobise [81], Mutual Information (MI) [37], Maximal Information Coefficient (MIC) [34], Lasso [35], Elastic [35], Ridge [35], GenMiR++ [39], \(\phi\)(Phit) [41], \(\phi_{{\text{s}}}\)(Phis) [41], \(\rho_{{\text{p}}}\)(Rhop) [41], and Intervention calculus when the Directed acyclic graph is Absent (IDA) [36]. These methods can be divided into seven types: Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality. Ten methods (Pearson, Spearman, Kendall, Dcor, RDC, Hoeffding, Zscore, Bcor, Wcor and Cosine) belong to the Correlation type, seven methods (Euclidean, Manhattan, Canberra, Chebyshev, Dice, Jaccard and Mahalanobise) belong to the Distance type, two methods (MI and MIC) are of the Information type, three methods (Lasso, Elastic and Ridge) belong to the Regression type, three methods (Phit, Phis and Rhop) belong to the Proportionality type, and GenMiR + + and IDA are of the Bayesian and Causality types, respectively.
Samplespecific network inference
Given bulk or singlecell RNAsequencing data with m samples (i.e. matched miRNA and mRNA expression data), with or without prior information of miRNAmRNA interactions, Scan constructs two miRNAmRNA relation matrices (one for all samples and the other for all samples except the kth sample of interest,\(k \in [1,m]\)) by using one of the network inference methods. The two constructed miRNAmRNA relation matrices (of p miRNAs and q mRNAs) for all samples and all samples except the kth sample (denoted as \(X^{(k)}\) and \(Y^{(k)}\) respectively) are as follows.
where \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) represent the connection between mRNA i and miRNA j.
For the network inference methods which are of the Correlation, Information, Regression, Bayesian, Proportionality and Causality types, larger absolute values of \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) indicate higher connections between mRNA i and miRNA j. In contrast, for Distance based network inference methods, larger absolute values of \(X_{ij}^{(k)}\) and \(Y_{ij}^{(k)}\) indicate weaker connections between mRNA i and miRNA j. To keep consistency between the constructed miRNAmRNA relation matrices by different types of network inference methods, we transform the two miRNAmRNA relation matrices (\(X^{(k)}\) and \(Y^{(k)}\)) generated by Distance based methods into \(X{\prime}^{(k)}\) and \(Y{\prime}^{(k)}\) as follows:
where eps refers to the precision of floating numbers (the value is 2.22E16 in default).
After transformation, a larger absolute value in a relation matrix generated by any type of the network inference methods indicates a higher connection between a miRNA and a mRNA.
Linear interpolation strategy
Following the use of a network inference method of the Correlation, Information, Regression, Bayesian, Proportionality or Causality type to obtain the two constructed miRNAmRNA relation matrices (\(X^{(k)}\) and \(Y^{(k)}\)), Scan applies a linear interpolation strategy [17] to estimate the miRNAmRNA relation matrix \(Z^{(k)}\) specific to sample k as follows:
where m denotes the number of samples in bulk or singlecell RNAsequencing data.
When a Distance based network inference method is used to obtain the two constructed miRNAmRNA relation matrices (\(X{'}^{(k)}\) and \(Y{'}^{(k)}\)), the miRNAmRNA relation matrix \(Z^{(k)}\) specific to sample k is estimated as:
\(Z_{ij}^{(k)}\) is further normalized as:
where \(\mu^{(k)}\) and \(\sigma^{(k)}\) denote the mean value and standard deviation of \(Z^{(k)}\), and the normalized miRNAmRNA relation matrix \(Z^{^{\prime}(k)}\) is:
Each value of \(Z_{ij}^{^{\prime}(k)}\) corresponds to a significance pvalue. The pvalue is calculated as follows:
where \(Z_{ij}^{^{\prime}(k)} \) is the absolute value of \(Z_{ij}^{^{\prime}(k)}\), and the pnorm function is used to calculate the probability a random value from the standard normal distribution being less than \(Z_{ij}^{^{\prime}(k)} \).
Smaller value of \(p_{ij}^{(k)}\) indicates that miRNA j is more likely to interact with mRNA i in sample k. If the value of \(p_{ij}^{(k)}\) is less than a cutoff (e.g. 0.05), miRNA j is considered to interact with mRNA i in sample k.
Statistical perturbation strategy
For the Correlation, Distance, Information, Regression, Bayesian, Proportionality and Causality methods, each value of \(X^{(k)}\)(\(X{'}^{(k)}\)) and \(Y^{(k)}\)(\(Y{'}^{(k)}\)) corresponds to a significance pvalue. The corresponding pvalue matrices of \(X^{(k)}\)(\(X{'}^{(k)}\)) and \(Y^{(k)}\)(\(Y{'}^{(k)}\)) are denoted as \(p_{x}^{(k)}\) and \(p_{y}^{(k)}\), respectively.
Smaller value of \(x_{ij}^{(k)}\) and \(y_{ij}^{(k)}\) indicates that miRNA j is more likely to interact with mRNA i in all samples and all samples except k, respectively. Given a pvalue cutoff (e.g. 0.05), we can obtain two zero–one matrices \(ZO_{x}^{(k)}\) and \(ZO_{y}^{(k)}\) in all samples and all samples except k. By using a statistical perturbation strategy [16], Scan assigns an edge to a miRNAmRNA pair in sample k if the pvalue of that pair is different when the sample is removed from all samples. The miRNAmRNA zero–one matrix \(ZO^{(k)}\) for sample k is calculated as follows:
where xor is the XOR logical function. One value of \(ZO_{ij}^{(k)}\) indicates that miRNA j interacts with mRNA i in sample k, and zero value of \(ZO_{ij}^{(k)}\) represents that miRNA j doesn’t interact with mRNA i in sample k.
In this work, the cutoff of significant pvalue for both linear interpolation and statistical perturbation strategies is set to be 0.05 in default. By applying linear interpolation or statistical perturbation strategy, Scan can identify m samplespecific miRNA regulatory networks across m samples. Each samplespecific miRNA regulatory network is a directed graph where the direction of an edge is from a miRNA to a mRNA.
Network topological analysis
The degree of a node in a samplespecific miRNA regulatory network is the number of connections with other nodes, and the degree distribution denotes the probability distribution of node degrees over the samplespecific miRNA regulatory network. If the node degree of a samplespecific miRNA regulatory network obeys a power law distribution, the network is considered as a scalefree network. In this work, we use the R package igraph [82] to calculate the degree distribution of the identified samplespecific miRNA regulatory networks. The Kolmogorov–Smirnov (KS) test [83] is used to determine whether the node degree of a samplespecific miRNA regulatory network obeys a power law distribution. If the pvalue of the KS test is smaller than a cutoff (e.g. 0.05), the node degree of a samplespecific miRNA regulatory network does not obey a power law distribution, suggesting that the samplespecific miRNA regulatory network is not a scalefree network, and vice versa.
The network density is used to measure the tightness of a samplespecific miRNA regulatory network. Compared with the random networks, a network with a higher density tends to be a smallworld network [84]. We also use the R package igraph [82] to calculate the density of the inferred samplespecific miRNA regulatory networks. We generate 100 random networks (the same number of nodes and edges with real network) for each samplespecific miRNA regulatory network to determine whether the network is smallworld network. The Student’s ttest [85] is applied to judge whether the density of a samplespecific miRNA regulatory network is higher than that of its corresponding random networks at a significance level. For a samplespecific miRNA regulatory network, if the pvalue of the Student’s ttest is smaller than a pvalue cutoff (e.g. 0.05), it is regarded as a smallworld network.
Sample correlation network construction
We use samplespecific miRNA regulatory networks to compute the similarities between samples. In the context of samplespecific miRNA regulatory networks, the similarity between samples a and b is calculated in the following.
where \(Net_{a}\) and \(Net_{b}\) represent the miRNA regulatory networks specific to samples a and b, respectively, \(Net_{a} \cap Net_{b} \) is the number of common miRNAmRNA interactions between \(Net_{a}\) and \(Net_{b}\), and \(min(Net_{a} ,Net_{b} )\) denotes the smaller number of miRNAmRNA interactions between \(Net_{a}\) and \(Net_{b}\).
For m samples, the samplesample similarity matrix SM (a symmetric matrix) is:
The samplesample similarity matrix is further used for sample correlation network construction [86]. For each sample pair, a higher similarity value indicates that the pair of samples is more correlated with each other. We use a similarity cutoff (e.g. 0.50) to infer whether two samples are correlated or not. In other words, if the similarity value between samples a and b is larger than the cutoff, samples a and b are correlated with each other. After assembling the correlated sample pairs, we can construct a sample correlation network.
Dynamic and conservative analysis
The miRNAmRNA interaction existing in only one cell or tissue are defined as a dynamic miRNAmRNA interaction, whereas the miRNAmRNA interaction existing in at least a half of cells or tissues are defined as a conservative miRNAmRNA interaction. Given the identified samplespecific miRNA regulatory networks, we can obtain dynamic and conservative miRNAmRNA interactions which form dynamic and conservative miRNA regulatory networks, respectively.
Hub miRNA identification
Hub miRNAs are defined as highly connected miRNAs in a dynamic and conservative miRNA regulatory network. In this work, we use the cumulative probability of Poisson distribution to evaluate whether a miRNA is a hub:
where \(\lambda = np\),\(p = \frac{t}{C}\), n is the number of genes (including miRNAs and mRNAs), t is the number of miRNAmRNA interactions in a dynamic or conservative miRNA regulatory network, and \(C\) is the number of all possible miRNAmRNA interaction pairs. Smaller pvalue of a miRNA shows that the miRNA is more likely to be a hub. Here, the cutoff of the pvalue is set to 0.05.
Experimentally validated miRNAs related to CML and BRCA
To understand whether the identified dynamic and conservative hub miRNAs are closely associated with CML and BRCA, we obtain the miRNAs related to CML and BRCA with experimental evidence from RNADisease v4.0 [87], which is a comprehensive RNAdisease resource and includes experimentally validated and computationally predicted RNAdisease associations. If a dynamic or conservative hub miRNA is overlapped with the experimentally validated CMLrelated or BRCArelated miRNAs in RNADisease v4.0, the dynamic or conservative hub miRNA is regarded as a CMLrelated or BRCArelated miRNA.
Comparison metrics
We use two metrics (accuracy and efficiency) to compare different instances of Scan using different network inference methods, and compare Scan with other methods for inferring samplespecific miRNA regulation. For accuracy, due to the lack of groundtruth of miRNAmRNA interactions at the singlesample level, the ground truth of miRNAmRNA interactions at the multisample level are acquired from miRTarBase v9.0 [88] and TarBase v8.0 [89] for validation. We consider that the validation using the experimentally validated miRNAmRNA interactions at the multisample level as an approximate groundtruth (of the groundtruth at the singlesample level) are still useful in screening reliable miRNA targets for subsequent experimental validation at the singlesample level. If a method has a larger percentage of validated miRNAmRNA interactions (the number of validated predictions divided by the total number of predictions), the method is considered having higher accuracy. For efficiency, we compare the runtime of different methods in the same bulk or singlecell RNAsequencing data. If a method takes less runtime in the same bulk or singlecell RNAsequencing data, the method is considered having better efficiency.
In terms of both accuracy and efficiency, we use an overall rank score [90] to evaluate the performance of each method. For the method i, the overall rank score ors_{i} is calculated as:
where \(rs_{i1}\) and \(rs_{i2}\) denote the rank scores of method i in terms of accuracy and efficiency, respectively.
A method with higher accuracy or better efficiency will obtain a larger rank score. A method with a larger overall rank score is regarded as a better or practical method.
Aggregate network inference
Given bulk and singlecell RNAsequencing data, we infer aggregate miRNA regulatory network by using two ways. The first way (called Aggregate) is to identify aggregate miRNA regulatory networks at the resolution of multiple samples (i.e. group of cells or tissues) by using network inference methods. The second way is to predict aggregate miRNA regulatory networks via integrating samplespecific miRNA regulatory networks identified by Scan.interp and Scan.perturb. Since the majority voting algorithm [45] is a simple but effective ensemble learning method, we use it to infer aggregate miRNA regulatory networks from the identified samplespecific miRNA regulatory networks. For the majority voting method, if a samplespecific miRNAmRNA interaction is predicted in more samples, it will be assigned a higher score (i.e. higher rank). To have a fair comparison, the number of top ranked miRNAmRNA interactions in the second way is the same as the number of miRNAmRNA interactions in the first way. Particularly, if the number of unique miRNAmRNA interactions in the second way is less than the number of miRNAmRNA interactions in the first way, we set the number of top ranked miRNAmRNA interactions as the number of unique miRNAmRNA interactions.
Method execution
Each execution of Scan and CSmiR on bulk or singlecell RNAsequencing data is performed in a separate task. Each task is allocated 32 CPU cores of Intel(R) Xeon(R) Platinum 8375C CPU at 2.90 GHz, and one R session is opened for each task. The network inference methods of Scan with runtime more than 10 days or with memory usage more than 256 GB are discarded.
Availability of data and materials
Scan is released under the GPL3.0 License, and is available at https://github.com/zhangjunpeng411/Scan/ or https://doi.org/10.5281/zenodo.13346480 [54]. Gene expression profiles used for Scan were accessed from Gene Expression Omnibus (GEO) at The Cancer Genome Atlas (TCGA) at https://www.cancer.gov/tcga [26, 27] and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114071 [28, 29]. The lists of all the data used in this study are available in the Additional files.
Abbreviations
 Bcor:

Biweight midcorrelation
 BH:

Benjamini–Hochberg
 BRCA:

Breast cancer
 CML:

Chronic Myelogenous Leukemia
 Dcor:

Distance correlation
 EMT:

Epithelialmesenchymal transition
 GEO:

Gene Expression Omnibus
 Hoeffding:

Hoeffding’s D statistics
 IDA:

Intervention calculus when the Directed acyclic graph is Absent
 KS:

Kolmogorov–Smirnov
 MI:

Mutual Information
 MIC:

Maximal Information Coefficient
 miRNA:

MicroRNA
 mRNA:

Messenger RNA
 ncRNA:

Noncoding RNA
 RDC:

Random Dependence Coefficient
 Scan:

Samplespecific miRNA regulation
 TCGA:

The Cancer Genome Atlas
 Wcor:

Weighted rank correlation
References
Bartel DP. MicroRNAs: target recognition and regulatory functions. Cell. 2009;136:215–33.
Krol J, Loedige I, Filipowicz W. The widespread regulation of microRNA biogenesis, function and decay. Nat Rev Genet. 2010;11:597–610.
Peng Y, Croce CM. The role of microRNAs in human cancer. Signal Transduct Target Ther. 2016;1:15004.
Hamam R, Hamam D, Alsaleh KA, Kassem M, Zaher W, Alfayez M, et al. Circulating microRNAs in breast cancer: novel diagnostic and prognostic biomarkers. Cell Death Dis. 2017;8:e3045.
Yendamuri S, Calin GA. The role of microRNA in human leukemia: a review. Leukemia. 2009;23:1257–63.
Coradduzza D, Solinas T, Balzano F, Culeddu N, Rossi N, Cruciani S, et al. miRNAs as molecular biomarkers for prostate cancer. J Mol Diagn. 2022;24:1171–80.
De Marzio M, Glass K, Kuijjer ML. Singlesample network modeling on omics data. BMC Biol. 2023;21:296.
Le HS, BarJoseph Z. Integrating sequence, expression and interaction data to determine conditionspecific miRNA regulation. Bioinformatics. 2013;29:i8997.
Zhang J, Le TD, Liu L, Liu B, He J, Goodall GJ, et al. Inferring conditionspecific miRNA activity from matched miRNA and mRNA expression data. Bioinformatics. 2014;30:3070–7.
Yoon S, Nguyen HCT, Jo W, Kim J, Chi SM, Park J, et al. Biclustering analysis of transcriptome big data identifies conditionspecific microRNA targets. Nucleic Acids Res. 2019;47:e53.
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, et al. Exploring cellspecific miRNA regulation with singlecell miRNAmRNA cosequencing data. BMC Bioinformatics. 2021;22:578.
Ludwig N, Leidinger P, Becker K, Backes C, Fehlmann T, Pallasch C, et al. Distribution of miRNA expression across human tissues. Nucleic Acids Res. 2016;44:3865–77.
Dai H, Li L, Zeng T, Chen L. Cellspecific network constructed by singlecell RNA sequencing data. Nucleic Acids Res. 2019;47:e62.
Yu X, Zeng T, Wang X, Li G, Chen L. Unravelling personalized dysfunctional gene network of complex diseases based on differential network model. J Transl Med. 2015;13:189.
Zhang W, Zeng T, Liu X, Chen L. Diagnosing phenotypes of singlesample individuals by edge biomarkers. J Mol Cell Biol. 2015;7:231–41.
Liu X, Wang Y, Ji H, Aihara K, Chen L. Personalized characterization of diseases using samplespecific networks. Nucleic Acids Res. 2016;44:e164.
Kuijjer ML, Tung MG, Yuan G, Quackenbush J, Glass K. Estimating samplespecific regulatory networks. iScience. 2019;14:226–40.
Li L, Dai H, Fang Z, Chen L. cCSN: Singlecell RNA sequencing data analysis by conditional cellspecific network. Genomics Proteomics Bioinformatics. 2021;19:319–29.
Wang X, Choi D, Roeder K. Constructing local cellspecific networks from singlecell data. Proc Natl Acad Sci U S A. 2021;118:e2113178118.
Chen HH, Hsueh CW, Lee CH, Hao TY, Tu TY, Chang LY, et al. SWEET: a singlesample network inference method for deciphering individual features in disease. Brief Bioinform. 2023;24:bbad032.
Wang Y, Xuan C, Wu H, Zhang B, Ding T, Gao J. PCSN: singlecell RNA sequencing data analysis by partial cellspecific network. Brief Bioinform. 2023;24:bbad180.
Zeng Y, He Y, Zheng R, Li M. Inferring singlecell gene regulatory network by nonredundant mutual information. Brief Bioinform. 2023;24:bbad326.
Zhang SY, Stumpf MPH. Inferring cellspecific causal regulatory networks from drift and diffusion. In: The 2022 ICML Workshop on Computational Biology. Baltimore, Maryland, USA; 2022. https://icmlcompbio.github.io/icmlwebsite2022/.
Li L, Xia R, Chen W, Zhao Q, Tao P, Chen L. Singlecell causal network inferred by crossmapping entropy. Brief Bioinform. 2023;24:bbad281.
Le TD, Zhang J, Liu L, Liu H, Li J. miRLAB: An R based dry lab for exploring miRNAmRNA regulatory relationships. PLoS ONE. 2015;10:e0145386.
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, et al. The cancer genome atlas pancancer analysis project. Nat Genet. 2013;45:1113–20.
TCGA. 2019. https://www.cancer.gov/tcga.
Wang N, Zheng J, Chen Z, Liu Y, Dura B, Kwak M, et al. Singlecell microRNAmRNA cosequencing reveals nongenetic heterogeneity and mechanisms of microRNA regulation. Nat Commun. 2019;10:95.
Wang N, Chen Z, Fan R, Lu J. Singlecell microRNAmRNA cosequencing reveals nongenetic heterogeneity and mechanisms of microRNA regulation. GEO. 2018. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE114071.
Agarwal V, Bell GW, Nam JW, Bartel DP. Predicting effective microRNA target sites in mammalian mRNAs. Elife. 2015;4:e05005.
Li JH, Liu S, Zhou H, Qu LH, Yang JH. starBase v2.0: decoding miRNAceRNA, miRNAncRNA and proteinRNA interaction networks from largescale CLIPSeq data. Nucleic Acids Res. 2014;42:D92–97.
Székely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35:2769–94.
Hoeffding W. A nonparametric test of independence. Ann Math Stat. 1948;19:546–57.
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, et al. Detecting novel associations in large data sets. Science. 2011;334:1518–24.
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22.
Maathuis MH, Kalisch M, Bühlmann P. Estimating highdimensional intervention effects from observational data. Ann Stat. 2009;37:3133–64.
Kraskov A, Stögbauer H, Grassberger P. Estimating mutual information. Phys Rev E. 2004;69:066138.
Wilcox R. Introduction to robust estimation and hypothesis testing, 5th Ed. Cambridge: Academic Press; 2021.
Huang JC, Babak T, Corson TW, Chua G, Khan S, Gallie BL, et al. Using expression profiling data to identify human microRNA targets. Nat Methods. 2007;4:1045–9.
Lance GN, Williams WT. Computer programs for hierarchical polythetic classification (“similarity analyses”). Comput J. 1966;9:60–4.
Quinn TP, Richardson MF, Lovell D, Crowley TM. propr: An Rpackage for identifying proportionally abundant features using compositional data analysis. Sci Rep. 2017;7:16252.
Cantrell CD. Modern mathematical methods for physicists and engineers. Cambridge: Cambridge University Press; 2000.
Watts DJ, Strogatz SH. Collective dynamics of “smallworld” networks. Nature. 1998;393:440–2.
Barabási AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet. 2004;5:101–13.
Boyer RS, Moore JS. MJRTYA fast majority vote algorithm. In: Boyer RS, editor. Automated Reasoning: Essays in Honor of Woody Bledsoe. Dordrecht: Springer, Netherlands; 1991. p. 105–17.
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521:436–44.
Larrañaga P, Karshenas H, Bielza C, Santana R. A review on probabilistic graphical models in evolutionary computation. Journal of Heuristics. 2012;18:795–819.
Marbach D, Costello JC, Küffner R, Vega NM, Prill RJ, Camacho DM, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9:796–804.
Le TD, Zhang J, Liu L, Li J. Ensemble methods for miRNA target prediction from expression data. PLoS ONE. 2015;10:e0131627.
Wolpert DH, Macready WG. No free lunch theorems for optimization. IEEE Trans Evol Comput. 1997;1:67–82.
Liu B, Li J, Tsykin A, Liu L, Gaur AB, Goodall GJ. Exploring complex miRNAmRNA interactions with Bayesian networks by splittingaveraging strategy. BMC Bioinformatics. 2009;10:408.
Zhang J, Le TD, Liu L, Liu B, He J, Goodall GJ, et al. Identifying direct miRNAmRNA causal regulatory relationships in heterogeneous data. J Biomed Inform. 2014;52:438–47.
Marx V. Method of the Year: spatially resolved transcriptomics. Nat Methods. 2021;18:9–14.
Zhang J. Scan (v1.0.0). Zenodo. 2024. https://doi.org/10.5281/zenodo.13346480.
Ye X, Brabletz T, Kang Y, Longmore GD, Nieto MA, Stanger BZ, et al. Upholding a role for EMT in breast cancer metastasis. Nature. 2017;547:E13.
Neelakantan D, Zhou H, Oliphant MUJ, Zhang X, Simon LM, Henke DM, et al. EMT cells increase breast cancer metastasis via paracrine GLI activation in neighbouring tumour cells. Nat Commun. 2017;8:15773.
Kröger C, Afeyan A, Mraz J, Eaton EN, Reinhardt F, Khodor YL, et al. Acquisition of a hybrid E/M state is essential for tumorigenicity of basal breast cancer cells. Proc Natl Acad Sci U S A. 2019;116:7353–62.
Taube JH, Herschkowitz JI, Komurov K, Zhou AY, Gupta S, Yang J, et al. Core epithelialtomesenchymal transition interactome geneexpression signature is associated with claudinlow and metaplastic breast cancer subtypes. Proc Natl Acad Sci U S A. 2010;107:15449–54.
Turner KM, Yeo SK, Holm TM, Shaughnessy E, Guan JL. Heterogeneity within molecular subtypes of breast cancer. Am J Physiol Cell Physiol. 2021;321:C343–54.
Tan TZ, Miow QH, Miki Y, Noda T, Mori S, Huang RYJ, et al. Epithelialmesenchymal transition spectrum quantification and its efficacy in deciphering survival and drug responses of cancer patients. EMBO Mol Med. 2014;6:1279–93.
Hänzelmann S, Castelo R, Guinney J. GSVA: gene set variation analysis for microarray and RNAseq data. BMC Bioinformatics. 2013;14:7.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, et al. limma powers differential expression analyses for RNAsequencing and microarray studies. Nucleic Acids Res. 2015;43:e47.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc: Ser B (Methodol). 1995;57:289–300.
Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E. The role of site accessibility in microRNA target recognition. Nat Genet. 2007;39:1278–84.
Miranda KC, Huynh T, Tay Y, Ang YS, Tam WL, Thomson AM, et al. A patternbased method for the identification of MicroRNA binding sites and their corresponding heteroduplexes. Cell. 2006;126:1203–17.
Vejnar CE, Zdobnov EM. MiRmap: comprehensive prediction of microRNA target repression strength. Nucleic Acids Res. 2012;40:11673–83.
Paraskevopoulou MD, Georgakilas G, Kostoulas N, Vlachos IS, Vergoulis T, Reczko M, et al. DIANAmicroT web server v5.0: service integration into miRNA functional analysis workflows. Nucleic Acids Res. 2013;41 Web Server issue:W169173.
Enright AJ, John B, Gaul U, Tuschl T, Sander C, Marks DS. MicroRNA targets in Drosophila. Genome Biol. 2003;5:R1.
Krek A, Grün D, Poy MN, Wolf R, Rosenberg L, Epstein EJ, et al. Combinatorial microRNA target predictions. Nat Genet. 2005;37:495–500.
Pearson K. Notes on the history of correlation. Biometrika. 1920;13:25–45.
Spearman C. “General intelligence”, objectively determined and measured. Am J Psychol. 1904;15:201–92.
Kendall MG. A new measure of rank correlation. Biometrika. 1938;30:81–93.
LopezPaz D, Hennig P, Schölkopf B. The randomized dependence coefficient. In: Proceedings of the 26th International Conference on Neural Information Processing Systems  Volume 1. Red Hook, NY, USA: Curran Associates Inc; 2013. p. 1–9.
Prill RJ, Marbach D, SaezRodriguez J, Sorger PK, Alexopoulos LG, Xue X, et al. Towards a rigorous assessment of systems biology models: the DREAM3 challenges. PLoS ONE. 2010;5:e9202–e9202.
Zar J. Biostatistical analysis. Old Bridge: PrenticeHall/Pearson; 2010.
Deza E, Deza MM. Dictionary of distances. Amsterdam: Elsevier; 2006.
Deza MM, Deza E. Encyclopedia of distances. In: Deza E, Deza MM, editors. Encyclopedia of Distances. Berlin, Heidelberg: Springer Berlin Heidelberg; 2009. p. 1–583.
Craw S. Manhattan distance. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning. Boston, MA: Springer, US; 2010. p. 639–639.
Dice LR. Measures of the amount of ecologic association between species. Ecology. 1945;26:297–302.
Duda R, Hart P, G.Stork D. Pattern classification. Hoboken: Wiley Interscience; 2001.
Mahalanobis PC. On the generalized distance in statistics. Proceed National Institute Sci (Calcutta). 1936;2:49–55.
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal Complex Syst. 2006;1695:1–9.
Massey FJ. The KolmogorovSmirnov test for goodness of fit. J Am Stat Assoc. 1951;46:68–78.
Zaidi F. Small world networks and clustered small world networks with random connectivity. Soc Netw Anal Min. 2013;3:51–63.
Student. The probable error of a mean. Biometrika. 1908;6:1–25.
Zhang J, Liu L, Zhang W, Li X, Zhao C, Li S, et al. miRspongeR 2.0: an enhanced R package for exploring miRNA sponge regulation. Bioinform Adv. 2022;2:vbac063.
Chen J, Lin J, Hu Y, Ye M, Yao L, Wu L, et al. RNADisease v4.0: an updated resource of RNAassociated diseases, providing RNAdisease analysis, enrichment and prediction. Nucleic Acids Res. 2023;51:D13971404.
Huang H, Lin YC, Cui S, Huang Y, Tang Y, Xu J, et al. miRTarBase update 2022: an informative resource for experimentally validated miRNAtarget interactions. Nucleic Acids Res. 2022;50:D222–30.
Karagkouni D, Paraskevopoulou MD, Chatzopoulos S, Vlachos IS, Tastsoglou S, Kanellos I, et al. DIANATarBase v8: a decadelong collection of experimentally supported miRNAgene interactions. Nucleic Acids Res. 2018;46:D239–45.
Zhang J, Liu L, Xu T, Zhang W, Zhao C, Li S, et al. miRSM: an R package to infer and analyse miRNA sponge modules in heterogeneous data. RNA Biol. 2021;18:2308–20.
Acknowledgements
The results shown here are in part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga, and GEO repository: https://www.ncbi.nlm.nih.gov/geo/.
Funding
JZ was supported by the National Natural Science Foundation of China (Grant Number: 61963001), the Yunnan Xingdian Talents Support Plan—Young Talents Program, and the Doctoral Scientific Research Foundation of Dali University. CZ was supported by the Applied Basic Research Foundation of Science and Technology of Yunnan Province (Grant Number: 202101BA070001221). TDL was supported by the ARC DECRA Grant (Grant Number: DE200100200).
Author information
Authors and Affiliations
Contributions
J.Z. and T.D.L. conceived the idea of this work. L.L. and J.L. refined the idea. J.Z. designed and performed the experiments. X.W., C.Z. and Y.L. participated in the design of the study and performed the statistical analysis. J.Z., L.L., T.D.L. and J.L. drafted the manuscript. All authors revised the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
12915_2024_2020_MOESM1_ESM.docx
Additional file 1: Figures S1S4 and Tables S1S2. Fig. S1  Cellspecific miRNA regulation using linear interpolation strategy (Scan.interp) in the K562 dataset. Fig. S2  Cellspecific miRNA regulation using statistical perturbation strategy (Scan.perturb) in the K562 dataset. Fig. S3  Tissuespecific miRNA regulation using linear interpolation strategy (Scan.interp) in the BRCA dataset. Fig. S4  Tissuespecific miRNA regulation using statistical perturbation strategy (Scan.perturb) in the BRCA dataset. Table S1  Runtime of 27 network inference methods with the linear interpolation strategy (Scan.interp) in the K562 and BRCA datasets. Table S2  Runtime of 27 network inference methods with the statistical perturbation strategy (Scan.perturb) in the K562 and BRCA datasets.
12915_2024_2020_MOESM6_ESM.xlsx
Additional file 6. Breast cancer tissues belonging to epithelial, intermediate epithelial, intermediate mesenchymal and mesenchymal types.
Rights and permissions
Open Access This article is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License, which permits any noncommercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/byncnd/4.0/.
About this article
Cite this article
Zhang, J., Liu, L., Wei, X. et al. Scanning samplespecific miRNA regulation from bulk and singlecell RNAsequencing data. BMC Biol 22, 218 (2024). https://doi.org/10.1186/s1291502402020x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1291502402020x