Using Iso-Seq to update Harpegnathos gene annotation
We previously generated a single-cell RNA-seq atlas of the Harpegnathos brain during the worker–gamergate transition and discovered extensive changes in cell type composition in glia and neurons [22]. While inspecting these sequencing data [23], we noticed that in many cases, even when using the latest NCBI annotation (NCBI Release 102; hereafter referred to as HSAL50), the single-cell RNA-seq reads mapped outside gene model boundaries, typically donwstream of the annotated TTS, resulting in decreased coverage and information loss. As an example we show the case of the Ref1 gene (Fig. 1A, red box), resulting in decreased coverage and information loss. Motivated by examples such as this and by a desire to obtain a more comprehensive catalog of splicing isoforms, we sought to improve the Harpegnathos gene annotations using PacBio Iso-Seq long RNA reads.
To maximize library complexity, we sequenced two separate polyA+ Iso-Seq libraries: one from a pool of Harpegnathos brains from different castes, and one from a mixture of ovary and fat body tissues. After processing the raw PacBio subreads (Additional File 1: Fig. S1A and Fig. 1B), we obtained 34,867 and 33,520 full-length “polished” reads with median length of 2034 bp and 2137 bp for the brain sample and the fat body/ovary sample, respectively (Fig. 1C). After aligning the polished Iso-Seq reads to the Harpegnathos genome, we compared gene coverage with that of previously obtained short-read RNA-seq in matching tissues [9, 24]. More than half of the genes detectable (RPKMs > 0.5) in our collection of deep short-read RNA-seq were also covered by Iso-Seq reads (Fig. 1D). As expected, genes detected by Iso-Seq tended to be more highly expressed (Additional File 1: Fig. S1B). From the mapped reads, we collapsed redundant transcript models, generated predicted isoforms for each of the Iso-Seq libraries, and used these isoforms to refine the existing HSAL50 annotation. We further improved these models by manually adding 89 genes and reviewing all merged genes (Additional File 1: Fig. S1C and Additional File 2: Table S1–3) and designated this upgraded annotation as HSAL51.
Overall, HSAL51 contained 13,957 gene models that corresponded unambiguously to HSAL50 genes, 392 new genes either predicted from Iso-Seq signal or added manually (Additional File 2: Table S1–2), and 57 gene models that merged two or more HSAL50 gene models into a single one in HSAL51 (Additional File 1: Fig. S1D and Additional File 2: Table S3). An example of a merged gene was the combination of two adjacent HSAL50 genes, each of which contained one of the two known Drosophila melanogaster MOCS1 protein domains; the resulting merged gene model for Mocs1 in HSAL51 encodes a protein with a domain structure identical to its ortholog in Drosophila (Fig. 1E). Returning to the example of Ref1 (Fig. 1A), Iso-Seq reads indicated the existence of at least two isoforms with TTS downstream of the one annotated in HSAL50, which captured the single-cell sequencing signal missed with the old annotation (Fig. 1F, arrows). For additional verification of the TTSs predicted in our new annotation, we devised a custom RNA-seq protocol that compares short reads obtained with an anchored oligo-dT primer with random hexamers to remove signal from internal A-stretches and identifies the position of the polyA tail on mature mRNAs (“dT-seq,” Additional File 1: Fig. S1E–F, see methods). In the case of Ref1, dT-seq signal analyses confirmed the existence of the new termination sites (Fig. 1F).
Thus, using long-read sequencing, we updated the Harpegnathos gene annotation and recovered gene models that were split incorrectly in the HSAL50 assembly or that did not have a correctly annotated 3′ UTRs.
Comprehensive annotation of transcriptional isoforms with Iso-Seq
Since its development in 2013 [25], Iso-Seq has been performed on a genome-wide scale in a range of plants and animal species to improve the annotation of transcriptional isoforms [11, 26,27,28]. The ability of Iso-Seq to sequence RNA molecules in their entirety confers an advantage in detecting splicing patterns compared to the typical short-read annotation strategy of relying on reads that cover a limited span across splice junctions. Indeed, HSAL51 contained a greater number of annotated transcripts with distinct splicing patterns (i.e., beyond simple extension of 5′ or 3′ UTRs) (Fig. 2A). In addition, gene models in HSAL51 exhibited more instances of all seven types of alternative splicing [29]: skipped exon (SE), mutually exclusive exons (ME), alternative 5′ splice site (A5), alternative 3′ splice site (A3), retained introns (RI), alternative first exon (AF), and alternative last exon (AL) (Fig. 2B). Examples of genes with newly annotated alternative splicing events are presented in Additional File 3: Fig. S2A–C.
Next, we identified genes whose relative transcript expression varies between tissues (also called differential transcript usage, or DTU) [30] using previously published bulk RNA-seq data from 6 Harpegnathos tissues: non-visual brain, ovary, fat body, retina, optic lobe, and antenna [9, 31]. Most genes (80%) exhibiting DTU in HSAL50 between at least two tissues were also detected in HSAL51, but the number of genes displaying DTU increased by 681 in the new annotation (Fig. 2C). For example, a newly annotated isoform (isoform 14) of the myeloid leukemia factor (Mlf) gene accounted for 80% of transcripts produced in the brain (Fig. 2D). This isoform contained exon 6 but not exon 5 of Mlf (Fig. 2E, F) and was not identified with short reads alone, highlighting the power of Iso-Seq in untangling complicated exon structures, especially in cases of mutually exclusive exons. Once annotated, short reads spanning the alternative splice junctions could be properly assigned to this isoform resulting in the identification of brain-specific DTU.
In addition to genes with DTU between tissues, we identified several genes with caste-specific transcript usage in the brain using previously published RNA-seq from Harpegnathos [24]. One notable gene with caste-biased isoforms between workers and gamergates was insulin-like-peptide 2 (Ilp2; LOC105188195), a gene similar to canonical insulin whose absolute transcript levels are higher in the brains or heads of reproductive individuals compared to those of non-reproductives in many ant species [32]. In general, insulin signaling has been identified as a key pathway regulating caste identity in social insects [33]. In addition to the higher gene-level Ilp2 expression in brains of Harpegnathos gamergates compared to workers (Additional File 3: Fig. S2D), isoforms 3 and 33 were used at different levels between the castes (Fig. 2G), with the upstream first exon being used more frequently in gamergates compared to workers (Fig. 2H). While these alternative splicing events appear to only affect the 5′ UTR, they might still have important consequences on insulin signaling, for example by regulating translation of the resulting mRNA, as observed in mammals [34, 35]. Upon reanalysis of published data [32, 36,37,38], two other ant species, the carpenter ant Camponotus planatus (Formicinae) and the giant ant Dinoponera quadriceps (Ponerinae), also displayed caste-biased selection of the first exon in brains (Additional File 3: Fig. S2E–F). In both species, as in Harpegnathos, the reproductive caste was more likely than the non-reproductive caste to use the upstream first exon, suggesting that alternative splicing of Ilp2 mRNA might be an evolutionary conserved mechanism for the caste-specific regulation of the insulin pathway.
Together, these analyses demonstrate that an Iso-Seq-enriched genomic annotation captures a greater complexity in the transcriptome which, in turn, can provide a comprehensive view of alternative splicing events between different biological samples—in this case tissues and castes.
Extended 5′ and 3′ gene boundaries increase sensitivity of bulk RNA-seq
In addition to a more comprehensive view of transcriptional isoforms originating from alternative splicing, the long reads of Iso-Seq are expected to contain more complete UTRs in 3′ and, to some extent, 5′, resulting in more accurate annotation of TTSs and TSSs, respectively, extending the mappable regions of the gene models. Consistent with these expectations, for 45% of all gene models, the exons annotated in HSAL51 covered a larger (median extension, + 20%) sequence than in HSAL50, whereas only 5% resulted in smaller gene models and even those were minimally impacted (median contraction, − 2%) (Fig. 3A, B). This remarkable growth in annotated exonic space was largely due to the extension of 3′ UTR using newly annotated TTSs (5338 transcripts among 4269 genes) and, to a smaller degree, to the extension of 5′ UTRs using newly annotated TSSs (2878 transcripts) (Fig. 3C). Transcripts were typically extended by more base pairs at the TTS compared to the TSS (Fig. 3D), with median extension length of 251 nt and 31 nt, respectively.
To confirm that these 3′ UTR extensions originated from the annotation of bona fide TTSs, we analyzed the position of polyA tails, as determined by dT-seq (see Additional File 1: Fig. S1E), relative to the HSAL50 and HSAL51 gene models. As expected, the dT-seq signal accumulated at TTSs in both annotations, and it was stronger at the TTSs of HSAL51 gene models as compared to HSAL50 (Additional File 4: Fig. S3A), including in a comparison of gene models with extended TTSs in HSAL51 (Fig. 3E, F), demonstrating the accuracy of the newly annotated downstream TTSs.
To evaluate the effect of the new annotation on RNA-seq mapping rates, we calculated the mapping rate of aligned reads using a newly generated dataset of Harpegnathos worker brains, which were not used in the construction of the HSAL50 (or HSAL51) annotations. Significantly more reads mapped to annotated exons in HSAL51 (Fig. 3G). This improvement in RNA-seq mapping rates had tangible benefits on the biological interpretation of sequencing datasets, such as, for example, the identification of additional differentially expressed genes in pairwise comparisons. Reanalyzing the worker vs. gamergate transcriptomes [20, 24] with the new annotation identified the egh gene as significantly upregulated in workers (Fig. 3H). The Drosophila homolog of this gene is involved in the sex-peptide response and is implicated in the regulation of mating and egg-laying [39], suggesting a biological explanation for its caste-specific expression in Harpegnathos. Iso-Seq reads indicated the existence of a longer gene model for egh (Fig. 3I), which increased the number of reads mapping to this gene (Additional File 4: Fig. S3B) and resulted in its confident identification as caste-specific.
These results show that the addition of Iso-Seq information extends the annotated gene models, allowing for the extraction of more information from RNA-seq experiments and resulting in higher sensitivity for differentially expressed genes.
Iso-Seq-based annotation improves single-cell analyses
Given the 3′ bias of the most widely used techniques for droplet-based single-cell sequencing, we reasoned that these analyses would be improved by the more accurate 3′ UTR annotations found in HSAL51. Indeed, the mapping rate of 10x Genomics single-cell RNA-seq reads from our previous worker–gamergates comparison [22, 23] increased in average by 44% (Fig. 4A). This resulted in substantially increased counts for a large majority of annotated genes, including several with important functions in the brain (Fig. 4B, below diagonal). In all 11 samples analyzed, increased mapping rates resulted in improvements for the total number of cells identified, as well as the average unique molecular identifiers (UMIs) and genes detected per cell (Fig. 4C). Overall, the total number of cells passing quality thresholds increased by 18% from 20,729 using the HSAL50 annotation to 24,560 using HSAL51 (Fig. 4D, Additional File 5: Fig. S4A), showing the importance of accurate 3′ UTR annotations to extract the maximum amount of information from single-cell RNA-seq data.
Markers for three major neuronal classes based on neurotransmitter usage, VAChT (cholinergic), VGlut (glutaminergic), and Gad1 (GABAergic), were among the genes that benefitted from the increased mapping rates (Fig. 4B), resulting in an increased number of cells with observed expression of these genes (Fig. 4E, pie charts), and thus easier identification of clusters with specific neurotransmitter expression, as visualized on UMAP projection of the HSAL51 data (Fig. 4E). These improvements were not confined to genes associated with neurotransmitter usage; using HSAL51, we recovered in total 288 previously undetected marker genes with restricted, cell type-specific expression (Additional File 5: Fig. S4B), likely missed in HSAL50 due to reads that were not assigned to the incomplete old gene models (Additional File 5: Fig. S4C).
In addition, we recovered 12 new markers for mushroom body neurons (Additional File 5: Fig. S4D), which are key to learning and memory in insects [40,41,42]. Some of these markers were biased for mushroom body cells in HSAL50 but did not pass statistical thresholds due to overall low expression. Of these 12 newly identified mushroom body markers in Harpegnathos, 10 were previously described as mushroom body-specific genes in Drosophila [43, 44] or honeybees [45] (Additional File 5: Fig. S4D). In particular, two Harpegnathos genes with homology to known mushroom body markers GluR1B and twin of eyeless (toy) [44, 46,47,48] were barely detectable in HSAL50 but clearly mapped to mushroom body clusters in HSAL51 (Fig. 4F and Additional File 5: Fig. S4D).
We previously showed that neuroprotective ensheathing glia cells are expanded during the worker–gamergate transition and lost at a faster rate in workers than in gamergates during aging [22]. Despite the in-depth investigation of this cell type in our previous study, the updated HSAL51 annotation allowed us to discover a new marker gene, CG9259 (Fig. 4G), that was previously missed because the near-entirety of the single-cell RNA-seq reads fell within an extended 3′ UTR not annotated in HSAL50 (Additional File 5: Fig. S4E). In Drosophila, CG9259 encodes an ecdysteroid kinase-like protein [49], suggesting that Harpegnathos ensheathing glia that express this gene might play an important role in the caste-specific regulation of the key developmental hormone ecdysone.
Finally, the increased single-cell transcriptome depth afforded by HSAL51 allowed us to identify differential gene expression between workers and gamergates within specific cell types, with more genes overall classified as caste-specific between single-cell clusters in the HSAL51 analysis (Additional File 5: Fig. S4F). As with the newly detected marker genes, many of the 250 newly called differential genes had increased counts in HSAL51 compared to HSAL50, suggesting their new detection resulted from the increased mapping rates of the 3′-biased single-cell RNA-seq reads (Additional File 5: Fig. S4G). For example, we identified CycG as a gene preferentially expressed in specific subtypes of gamergate cells as compared to their counterparts in workers (Fig. 4H). This observation is in agreement with previous studies reporting upregulation of CycG in the reproductive caste of various ant species [50], including Harpegnathos gamergates (Additional File 5: Fig. S4H). CycG regulates the insulin pathway [51, 52], which, as mentioned above, is an important player in caste determination in social insects [32, 33]. The identification of the cell types that are the strongest drivers of CycG caste-biased expression will help inform future studies of this gene.
Thus, single-cell RNA-seq analyses were greatly improved by the increased accuracy of 3′ UTR annotations in HSAL51, resulting in 18% more single cells identified computationally, a clustering of transcriptional types more reflective of biological function, recovery of additional cell-type markers, and higher sensitivity for differentially expressed genes.
Long noncoding RNAs in single-cell sequencing analysis revealed by Iso-Seq
Protein-coding genes are often the focus of transcriptomic studies, but many genes are transcribed into noncoding RNAs with important regulatory roles [53,54,55]. Similar to the case of protein-coding transcripts, several gene models for various types of noncoding RNAs, and in particular long noncoding RNAs (lncRNAs), were also extended in HSAL51 compared to HSAL50 (Fig. 5A, Additional File 6: Fig. S5A), although not to the same extent, possibly due to their overall lower expression level.
Single-cell analyses using the updated gene models in HSAL51 revealed 130 lncRNAs with more UMIs compared to HSAL50 (Fig. 5B), suggesting that the new annotation might provide additional insight on the patterns of lncRNA expression in the Harpegnathos brain. We recovered a set of lncRNAs with neuronal-specific expression profiles, some of which could not be detected using HSAL50 gene models (Additional File 6: Fig. S5B). One example was LOC112589360, which had over 20 times more mapping reads in HSAL51 compared to HSAL50 (Fig. 5B), with a corresponding increase in its calculated expression levels in neurons (Fig. 5C), as well as the fraction of neurons where this lncRNA could be detected, from 0.6% in HSAL50 to 10.1% in HSAL51 (Additional File 6: Fig. S5B).
In addition to these neuronal lncRNAs, we identified several cell type-specific lncRNAs in ensheathing glia. LOC112588339.LOC112588340 was merged from two adjacent lncRNAs annotated in HSAL50, with Iso-Seq reads clearly supporting the HSAL51 gene model (Additional File 6: Fig. S5C). The new merged gene model had negative coding potential as assessed by CPC and PhyloCSF [56, 57] and was one of the strongest markers of ensheathing glia (Additional File 6: Fig. S5D). Another updated lncRNA, LOC112590028, was specific to ensheathing glia, but missed by previous analyses due to low mapping rates in HSAL50 (Fig. 5D, E). The protein-coding gene adjacent to this lncRNA, windpipe (wdp), was also preferentially expressed in ensheathing glia (Fig. 5E and Additional File 6: Fig. S5E), suggesting potential co-regulation of the coding and noncoding transcript in cis as previously reported for other lncRNAs-mRNA pairs [58]. Drosophila wdp encodes a transmembrane protein with known functions in the wing disc [59] and the trachea [60], and it has also been implicated in synaptic target recognition [61] and learning [62]. While the sequence of the lncRNA LOC112590028 itself is not conserved, Drosophila has a lncRNA, CR44758 in the same position as LOC112590028, between wdp and Gp150 (Fig. 5F), suggesting that synteny of this locus, and potentially its molecular regulation, have been maintained over 350 million years of divergent evolution (Fig. 5F). In fact, wdp is also expressed specifically in Drosophila ensheathing glia (Fig. 5G) [44], further supporting a conserved regulation of this locus across distantly related insect species.
Thus, similar to protein-coding genes, lncRNA annotations were also improved by the addition of Iso-Seq data and this resulted in increased visibility of these regulatory transcripts in single-cell analyses.