3′-tag-based scRNA-seq data could be applied to poly(A) site identification
Currently available scRNA-seq protocols are developed from two main strategies, tag-based and full-length. The tag-based scRNA-seq methods with a designed unique molecular identifier (UMI) are either 3′-tag- or 5′-tag-based. It should be mentioned that 3-tag-based scRNA-seq methods are based on the strategy using oligo-dT priming to enrich the 3′-ends of transcripts, which are similar to several widely used high-throughput APA profiling methods. Based on the read coverage of 3′-tag-based scRNA-seq methods (Fig. 1a), we set out to explore whether this kind of scRNA-seq data could be applied to APA analysis.
To address this question, we collected five recently published 3′-tag-based scRNA-seq datasets, including two replicated CEL-Seq2 (cell expression by linear amplification and sequencing) datasets (A/B) for mouse embryonic stem cells (mESC), two replicated SCRB-seq (single-cell RNA barcoding and sequencing) datasets (A/B) for mESC and a Microwell-seq dataset for HEK293 cell line [27, 28]. In order to accurately evaluate the reliability of scRNA-seq data in poly(A) site identification, we first need to ensure that 3′-tag scRNA-seq methods indeed preferentially capture 3′-end of transcripts. We extracted the coordinates of 3′-ends from all aligned reads in scRNA-seq data and compared them with poly(A) sites annotated in two widely used poly(A) databases, including PolyA_DB 3 and GENCODE, respectively (Additional file 1: Fig. S1A-B) [29, 30]. The results suggested 3′-ends of scRNA-seq reads are enriched adjacent to annotated poly(A) sites, although different protocols exhibited distinct cumulative distributions (Additional file 1: Fig. S1A-B). To further evaluate the validity of these scRNA-seq data in poly(A) site identification, we pooled all aligned reads together and extracted reads containing poly(A) sequence (poly(A) reads) as poly(A) reads (see the “Methods” section). Notably, the poly(A) reads coverage decreased sharply around the annotated poly(A) sites to create peaks that could be used to infer the coordinates of poly(A) sites (Fig. 1b, Additional file 1: Fig. S1C-I).
According to these observations in 3′-tag-based scRNA-seq data, we referred to previous studies [20, 31] and developed a computational method that aim to de novo identify poly(A) sites using 3′-tag-based scRNA-seq data, regardless of any prior poly(A) sites annotation (Additional file 1: Fig. S2A). Firstly, we trimmed consecutive poly(A) sequences and tagged scRNA-seq reads into poly(A) and non-poly(A) reads. Then, we could obtain the genomic coordinates of 3′-ends of those tagged poly(A) reads and count the number of aggregated 3′-ends on each position from the aligned reads. The summits of clusters could be regarded as potential poly(A) sites. As the poly(A) reads may originate from internal poly(A) regions, we excluded those poly(A) sites adjacent to consecutive poly(A) sequences that were suspected to generate from internal priming. By further filtering those adjacent to annotated poly(A) sites, additional sites were regarded as novel poly(A) sites (Additional file 1: Fig. S2A).
Taking advantage of the collected 5 scRNA-seq datasets [27, 28], we set out to identify poly(A) sites for each dataset. To determine how well the poly(A) sites were identified using scRNA-seq data match annotated poly(A) sites, we calculate the distances from poly(A) sites identified using SAPAS to the closet annotated poly(A) sites for each scRNA-seq dataset. The results showed that the identified poly(A) sites exhibit a sharp peak around annotated poly(A) sites within 10 nt, suggesting that SAPAS could accurately identify the exact positions of annotated poly(A) sites using scRNA-seq data (Fig. 1c–e, Additional file 1: Fig. S2B-C).
In order to further evaluate the performance of poly(A) identification using SAPAS on scRNA-seq data, we conducted motif enrichment analysis on the novel poly(A) sites identified in each scRNA-seq dataset. The poly(A) signals are required for pre-mRNA cleavage and polyadenylation and usually found at approximately 15–30 nt upstream of the poly(A) sites. The canonical poly(A) signal is AAUAAA, which is predominant with greater than 50% frequency [3, 32]. The results of motif enrichment showed that the canonical poly(A) signal (AAUAAA) is top significantly enriched for each scRNA-seq dataset (Fig. 1f). Furthermore, the position-dependent frequency of the canonical poly(A) signal also illustrated that the novel poly(A) sites have the canonical poly(A) signal at the expected position, ~ 21 nucleotides upstream of poly(A) sites (Fig. 1g). These observations demonstrated the authenticity of poly(A) sites identified by SAPAS, indicating SAPAS could accurately identify the exact position of poly(A) sites. Additionally, two examples of novel identified poly(A) sites in mESC were shown in Additional file 1: Fig. S2D-E.
Moreover, novel intronic poly(A) site could also be identified using SAPAS. For example, a novel poly(A) sites (chr2:169686455-169686456:-) located in the first intron of coiled-coil domain containing 173 (CCDC173) was identified in the HEK293 scRNA-seq data, indicating that a truncated coding sequence was used in HEK293 cells (Fig. 1h). In addition, this intronic poly(A) site was also supported by bulk RNA-seq reads, but it was not reported in PolyA_DB 3 and GENCODE before [29, 30]. Interestingly, the intronic poly(A) site was used in a cell type-specific manner that it was mainly expressed in HEK293 cells that originally derived from human embryonic kidney cells, but other cells prefer to use the distal poly(A) sites, such as human GM12878 lymphoblastoid cells and HepG2 liver cancer cells (Fig. 1h). Together, these results demonstrated that 3′-tag-based scRNA-seq data could be used to identify poly(A) sites, allowing further exploration of APA in different cell types.
Quantification of APA using 3′-tag-based scRNA-seq data
The pooled aligned reads of 3′-tag-based scRNA-seq data were clustered using the parametric clustering algorithm implemented in paraclu to identify the peak regions [33]. Combining defined poly(A) sites, we could assign peak regions to poly(A) sites for further quantification of APA. Once the genomic intervals of all poly(A) sites’ peak regions were identified, the transcript-level expression of distinct poly(A) isoforms could be estimated by counting reads aligned to each poly(A) site’s peak region for each single cell. Furthermore, to quantify the relative usage for each poly(A) site, we calculated the relative expression level of a specific poly(A) site isoform with respect to the total expression level of all poly(A) isoforms of the gene. Through this way, we could profile the poly(A) site usage at the single-cell level (Fig. 2a).
To assess the reproducibility and reliability of quantification of APA using SAPAS, we calculated the pairwise Pearson correlations of gene expression level and poly(A) isoform expression level across all single cells for each scRNA-seq dataset, respectively (Fig. 2b, c, Additional file 1: Fig. S3A-D, Additional file 1: Fig. S4A-D). The pairwise Pearson correlations of gene expression level were highly correlated that could reach about 0.8 and the correlations dropped slightly in isoform expression levels (Fig. 2d), suggesting that the estimation of poly(A) isoform expression level is reproducible as gene level, despite the fact that poly(A) isoforms only contained a fraction of the reads. One major challenge of scRNA-seq data analysis is the presence of dropout events where one gene is observed at a moderate or even high expression level in one cell but undetected in another cell, which is due to low amounts of RNA sequenced for each single cell [34]. To assess the effect of dropout events on quantification of APA, we calculated the Pearson correlations of gene expression level and poly(A) isoform expression level between each individual cell and artificial bulk sample constructed by simply summing the single-cell read counts, respectively (Fig. 2e). The marginal difference of Pearson correlations of poly(A) isoform expression level and gene expression level demonstrated that the dropout events do not introduce too many additional biases which could dramatically affect the accuracy of quantification of APA (Fig. 2e).
To further illustrate the reliability of SAPAS, we also compared the APA profiles estimated from scRNA-seq data to those computed using bulk 3′-end sequencing data in HEK293 cells [21]. Despite the fact that the scRNA-seq data and bulk 3′-seq data were generated in different labs and using obviously different sequencing technologies, the estimated poly(A) isoform expression level correlated well between scRNA-seq data and bulk 3′-seq data (Pearson correlation R = 0.67, P < 2.2 × 10−16) (Additional file 1: Fig. S5A, B). In addition, to further demonstrate the performance of SAPAS on quantifying APA at the single-cell level, we applied SAPAS and other existing bioinformatics methods developed to analyze APA dynamics using conventional RNA-seq data, such as Dapars [13] and QAPA [22], to conduct benchmarking analysis on a CEL-seq2 dataset of peripheral blood mononuclear cells (PBMCs) [35]. The results of t-distributed stochastic neighbor embedding (t-SNE) showed that the single-cell APA profiles estimated by SAPAS could clearly separate single cells into different cell-type clusters, including B cells, T cells, monocytes, and megakaryocytes (Fig. 3a). Besides, several cell subtypes could also be revealed by single-cell APA profiles, such as CD4+ T cell and cytotoxic T cell (Fig. 3a). However, the single-cell APA profiles estimated using Dapars and QAPA could only cluster one or two cell types and failed to distinguish others (Fig. 3b, c). We also conducted silhouette analysis to quantitatively evaluate the clustering results. The silhouette width is widely used to quantitatively assess the quality of the clustering results. The observations showed that the silhouette widths of SAPAS are higher than Dapars and QAPA, indicating SAPAS outperform Dapars and QAPA on quantifying APA at the single-cell level using 3′-tag-based scRNA-seq data (Fig. 3d–f).
SAPAS enables identification of novel poly(A) sites in GABAergic interneurons
We next employed SAPAS to study the genome-wide landscape of APA in phenotype-characterized GABAergic interneurons, using recently published 3′-tag-based scRNA-seq data generated from six non-overlapping GABAergic subpopulations with anatomical, physiological, and molecular evidence [36]. These GABAergic neurons include 64 CCK-positive basket cells (CCKC) [37], 132 chandelier cells (CHC) that innervate the axon initial segment of pyramidal neurons [25], 63 interneuron-selective cells (ISC) [38], 136 long-range-projecting GABAergic neurons (LPC) [39], 62 Martinotti cells (MNC) [40], and 127 fast-spiking parvalbumin-positive interneurons (PVBC) [41]. These individual neurons were manually sorted from micro-dissected motor and somatosensory cortexes of 6-week-old mice [36].
To systematically explore APA in these genetically labeled and phenotypically characterized GABAergic neurons, we first applied SAPAS to identify novel poly(A) sites for each GABAergic neuron type (Additional file 1: Fig. S6). As a result, several novel poly(A) sites were identified in each subtype (1,356 in CCKC, 1,016 in CHC, 620 in ISC, 961 in LPC, 674 in MNC, and 905 in PVBC, Additional file 1: Fig. S7A). Combining the novel poly(A) sites of each GABAergic neuron type, we altogether identified 3777 novel poly(A) sites in these GABAergic neurons. Among these combined novel poly(A) sites, 121 poly(A) sites were discovered in all 6 GABAergic neuron types (Additional file 1: Fig. S7A). Further motif analysis showed that the canonical poly(A) signal (AAUAAA) is top significantly enriched for each GABAergic neuron type (Additional file 1: Fig. S7B) and located at the expected position, ~ 21 nucleotides upstream of poly(A) sites, which is similar to the annotated poly(A) sites (Additional file 1: Fig. S7C). These observations demonstrated the reliability of these poly(A) sites in GABAergic neurons identified by SAPAS. As an example, across all GABAergic neuron types, a previously unannotated poly(A) site was identified in the 3′ UTR of the gene Ran, coding for a small GTP-binding protein that plays fundamental roles in regulating the translocation into and out of the cell nucleus [42] (Additional file 1: Fig. S7D). In addition, the canonical poly(A) motif (AAUAAA) was found ~ 20 nucleotides upstream of the poly(A) site (Additional file 1: Fig. S7D). Furthermore, to explore the potential underlying biological function of these novel poly(A) sites, we performed Gene Ontology (GO) enrichment analysis to assess whether these genes with novel poly(A) sites belong to specific GO terms. The enrichment results revealed that the novel poly(A) sites identified in different GABAergic neuron types are enriched for genes with synaptic communication-associated GO terms, such as presynaptic membrane, postsynaptic membrane, and axon part (Additional file 1: Fig. S8).
APA profiles could be used to classify different GABAergic neuron types
Given that APA is known to be involved in numerous biological processes including development, cell differentiation, cell proliferation, and cell reprogramming [11,12,13,14,15,16,17], we next sought to investigate whether APA profiles could be used to determine GABAergic neuron identity. To address this question, we employed SAPAS to compute poly(A) site usage for all six different GABAergic neuron types. Taking the clustering result using gene expression level as reference, the t-SNE plots demonstrated that both poly(A) isoform expression and poly(A) site usage could also be used to separate different GABAergic neuron types clearly [43] (Additional file 1: Fig. S9).
Furthermore, in order to detect the cell type-specific APA events of GABAergic neuron types, we implemented a supervised machine learning-based method in SAPAS (Fig. 4a). The basic rationale was that single cell of the same cell type should exhibit a similar APA pattern for each gene. For each gene, we first calculate the similarity between all pairs of single cells based on the poly(A) site usage of this gene. In contrast to gene expression level, the poly(A) sites usage of one gene are not scalar, but vector. Thus, we used the Hellinger distance [44] commonly used to measure the similarity of two probability distributions to measure the similarity of cell pairs and construct the cell-to-cell similarity network. Next, we made use of the cross-validation strategy to randomly select some single cells as test set and others are training set. Then, we could predict cell types of the held-out test set by using a neighbor-voting algorithm based on the predefined cell-to-cell similarity network. Thus, the performance (area under the receiver operating characteristic (AUROC)) of separating one cell type from others using poly(A) site usage was calculated as cell-type specificity of APA events, and gene set enrichment analysis (GSEA) could be applied based on the AUROC ranks to probe potential functional associations (Fig. 4a).
Next, we have calculated the cell-type specificity of APA events in the GABAergic neuron dataset, and 4269 genes with multiple poly(A) sites were tested (Fig. 4b, Additional file 1: Fig. S10). In addition, we also performed differential usage analysis using one-versus-rest scheme for each GABAergic neuron type to detect significantly differential used poly(A) sites by constructing artificial bulk data from single-cell data. The observations showed that the genes with significantly differential used poly(A) sites rank top on AUROC in each neuron type (Additional file 1: Fig. S10). By setting a threshold on AUROC (AUROC > 0.8), we could define a set of genes with cell type-specific APA events for each neuron type. Several genes with cell type-specific APA events were shown in Fig. 4c, d. For example, the gene coding for Calmodulin 1 (Calm1), a key integrator of calcium signaling that is involved in guiding axon projections to create connections with other neurons or tissues [45, 46], predominantly uses the distal poly(A) site (chr12:100209791:+) in PVBCs compared with other GABAergic neuron types.
Furthermore, we performed GSEA to assess whether genes with cell type-specific APA events belong to specific biological functions or pathways. The results showed that those genes with higher cell-type specificity were significantly enriched in synaptic connectivity and input-output signaling-related function categories, including synaptic vesicle release machinery, cell adhesion molecules, ion channels, and intracellular receptor signaling (Fig. 4e). In addition, GSEA using cellular component gene sets suggested that these genes encode proteins that localize along or close to the cell or synaptic membrane, such as presynapse, postsynapse, and vesicle membrane (Additional file 1: Fig. S11). Collectively, these observations suggested that the anatomical and physiological differences across different GABAergic neuron types may in part be mediated by APA of genes involved in synaptic communication.
Given that the GABAergic neuronal identity is reported to be encoded in functionally congruent gene expression [36], we next sought to investigate whether the cell-type specificity of APA events is primarily due to gene expression specificity. To address this question, we applied the EWCE method to calculate the cell-type expression specificity for each gene in each GABAergic neuron type (Additional file 1: Fig. S12), which is a metric that represents the proportion of expression of one gene found specifically in one cell type compared to all cell types [47, 48]. We plotted the gene expression specificity of gene sets with cell type-specific APA events for each GABAergic neuron type (Additional file 1: Fig. S13). The results showed that the large majority (> 80%) of genes with cell type-specific APA do not display high cell-type expression specificity, suggesting that the APA profiles difference could be another source of heterogeneity across different GABAergic neuron types, partially independent of gene expression level.
Cell type-specific APA events alter 3′ UTR length and CDS of genes
In addition to 3′ UTR length difference, the coding sequence (CDS) could also be affected by APA located in upstream introns and internal exons in about 40% of mammalian genes [49, 50]. To explore the biological functions of cell type-specific APA events, we assessed potential 3′ UTR or CDS changes due to cell type-specific APA events for each GABAergic neuron type. Most cell type-specific APA events resulted in 3′ UTR length difference only, without impact on CDS. For example, Syt7 coding for Synaptotagmin-7, brain-specific calcium-dependent proteins which have been shown to regulate synaptic exocytosis and neurotransmitter release [51, 52], predominantly used the distal poly(A) sites in ISC compared to other types of GABAergic neurons (Fig. 5a). Conversely, Itsn1, a multidomain scaffolding and adaptor protein involved in the synaptic vesicle, predominantly expressed the short 3′ UTR APA isoforms in ISC, whereas other GABAergic neurons expressed long 3′ UTR APA isoforms (Fig. 5b). In addition, cell type-specific APA events with impact on the CDS were also found in different GABAergic neuron types. For instance, Rufy3, a neuronally enriched protein which has been implicated in regulating the generation of neuronal polarity formation and axon growth [53, 54], mostly expressed the full-length isoform in ISC and MNC, whereas an intronic poly(A) site of Rufy3 was predominantly used in other types of GABAergic neurons, resulting in the APA isoform with a truncated CDS (Fig. 5c). In further, we found that the truncated protein of Rufy3 generated by intronic polyadenylation lacks the FYVE-related domain compared to the full-length protein (Fig. 5d). Notably, the FYVE-related domain is an evolutionarily conserved domain which could bind with high specificity to phosphatidylinositol 3-phosphate (PI(3)P) to localize proteins to endosomes [55, 56]. In addition, a recent study demonstrated that Rufy3 is essential for caspase-mediated axon degeneration [57]. Collectively, the result suggested that the intracellular localization and biological function of Rufy3 protein may be altered by APA in different GABAergic neuron types. In addition, to further illustrate the cell-type specificity of these APA events, we collected an independent public scRNA-seq dataset of mouse neocortex that containing GABAergic interneurons [58]. This dataset is generated by Smart-seq2 method which is a full-length scRNA-seq method. We could extract those single cells with cell types corresponded to the GABAergic neuron type in our study (Additional file 1: Fig. S14A, B). Then, we exploited QAPA, which are suitable for full-length RNA-seq data, to compute the single-cell APA profiles. The observations showed that the APA patterns of Syt7, Itsn1, and Rufy3 among different neuron types are in agreement with previous results, demonstrating the authenticity of cell type-specific APA events identified by SAPAS in our study (Additional file 1: Fig. S14C-H).
To further explore the potential functional consequences of altered 3′ UTR or CDS of APA, we sought to intersect GWAS signals with cell type-specific APA events with the goal of systematically linking particular diseases or traits to APA. Given a large faction of APA events are quite well conserved between human and mouse [29], we applied the linkage disequilibrium (LD) score regression (LDSC) method to quantify the enrichment of heritability for human traits and diseases within altered 3′ UTR or CDS of cell type-specific APA events [59]. To do so, we first lifted over human SNPs to orthologous coordinates in mouse genome and then calculated the enrichment of heritability for 15 brain-related diseases and quantitative traits, including Alzheimer’s disease, amyotrophic lateral sclerosis (ALS), multiple sclerosis (MS), epilepsy, Tourette syndrome, autism, depression, bipolar disorder, attention deficit hyperactivity disorder (ADHD), schizophrenia, educational attainment, intelligence, insomnia, neuroticism, and risky behavior (Fig. 5e). As a result, we observed statistically significant enrichments for heritability for psychiatric disorders and brain traits, such as schizophrenia, ADHD, bipolar disorder, and educational attainment (Fig. 5e). In previous studies, converging evidence suggests that dysfunction of GABAergic interneurons is critical for basic neural circuit function whose dysfunction is linked to the pathophysiology of several psychiatric diseases [60, 61]. Taken together, these results suggested that cell type-specific APA events were involved in defining the physiological properties of GABAergic neurons.
Bimodality of APA could reveal subpopulations of chandelier cells
Quantification of APA at the single-cell level provides us an opportunity to study the cell-to-cell variability of APA. To categorize the distributions of poly(A) site usage at the single-cell level, we implemented a method to classify each gene into one of five modalities, including (1) distal, where the distal poly(A) sites were predominantly used in the majority of cells; (2) proximal, where the proximal poly(A) sites were predominantly used in most cells; (3) bimodal, where two subpopulations of cells with either distal poly(A) sites or proximal poly(A) sites were used; (4) middle, where the moderate usage of distal poly(A) sites can be observed; and (5) multimodal, where distal poly(A) site usages were random from 0 to 1 (Fig. 6a). We binned poly(A) site usage into [0~1/3, 1/3~2/3, 2/3~1] and set binned distributions for different modalities as reference (Fig. 6a). Through computing the distance measured by Jensen-Shannon divergence between each gene’s binned distribution of distal poly(A) site usage and reference binned distributions, the modality of reference binned distribution with the smallest distance was assigned. Thus, we designated genes into different modalities for each GABAergic neuron type (Fig. 6a). For instance, the gene Tardbp with critical roles in splicing in neurons exhibits the middle modality of distal poly(A) site usage in CCKC. In all six GABAergic neuron types, genes within distal modality account for more than 50% of all genes analyzed (Fig. 6b, c), which lines up with previous studies that distal poly(A) sites are favored in nervous systems, resulting in isoforms with longer 3′ UTRs [62, 63]. Besides, genes that exhibit proximal modality account for ~ 25% of all genes, whereas bimodality and multimodality account for ~ 10% and ~ 15%, respectively (Fig. 6b, c).
As genes within bimodality exhibited higher variability of distal poly(A) site usage compared with other modalities (Additional file 1: Fig. S15), we surmised that they could be used to identify subpopulations from a specific GABAergic neuron type. As an example, we observed that the gene Pak3 (p21-activating kinase 3) exhibited bimodality of distal poly(A) site usage in CHCs (Fig. 6d, e). Pak3 is a serine/threonine kinase preferentially expressed in neurons that functions as a downstream effector of the Rho family of GTPase and plays a critical role in regulating neurite growth, as well as synapse formation and plasticity [64,65,66]. CHCs could be separated into two groups (Group1 and Group2) by distal poly(A) site usage of Pak3 (Fig. 6f). In addition, we identified genes that top correlated and anticorrelated with distal poly(A) site usage of Pak3, which also clustered the CHCs into two groups (Fig. 6f). We observed that the distal poly(A) site was predominantly used in Group2 CHCs compared to Group1 CHCs (Fig. 6f, g). Notably, we found that all of the Group1 CHCs were obtained from deeper layers (L5/6), whereas Group2 CHCs were from upper layers (L2/3) (Fig. 6h), which was consistent with previous studies showing CHCs in distinct layers are different subgroups that are recruited by distinct cortical inputs and regulate different populations of pyramidal neurons [25]. Besides, several additional examples with the potential to subtype GABAergic neurons were also found, such as Ythdf3, Dicer1, Efr3a, and Cbx5 (Additional file 1: Fig. S16). Collectively, these results demonstrated that the modalities estimated from quantification of APA at the single-cell level provide us an additional layer of information to reveal cell subpopulations.