Identification of transcription start sites and their clusters
To identify TSSs in the migratory locust, we mapped the oligo-capping sequencing reads from 14 libraries obtained from nine different tissues and organs, including the ovary, testis, wing, thoracic muscle, pronotum, labipalp, brain, fat body, and antenna (Additional file 1: Table S1). All of the oligo-capping libraries were sequenced using an Illumina NovaSeq 6000 System (150-bp paired-end reads). The sequencing of the oligo-capping libraries yielded 1893 million sequencing reads (284 Gb) in total, providing an unprecedented dataset for investigating the 5′ transcriptional start sites of mRNA transcripts in locusts. Only the read pairs that contained both the 5′ oligo-capping and 3′ oligo-capping adapters were mapped to the locust reference genome, with a mean mapping rate of 74.91%. The sequencing read pairs that were properly mapped to the reference genome were used for further analyses. The individual OTSSs were clustered along the genome into TSS clusters. The nucleotide composition of the OTSSs confirmed the absence of systematic G nucleotide addition bias, which is usually observed in the cap analysis gene expression technique (Additional file 1: Fig. S1). The number of OTSSs identified in each library ranged from 290,320 to 1,555,558, with a mean of 615,362 (Additional file 1: Fig. S2). The OTSS number (5,230,229) identified from the combined data of all the libraries was 3.36–18.02 times the number of TSCs identified in any single library (Additional file 1: Fig. S3), suggesting the necessity of investigating more tissues and organs to obtain a more comprehensive TSS landscape in locusts. The OTSSs located within the 1000 bp upstream of the translation start codon formed a broad distribution (Additional file 1: Fig. S4). As expected, the majority of OTSSs (66.94%) were mapped to the intergenic regions (Fig. 1a), indicating that widespread transcription is initiated from noncoding regions in the locust genome.
We identified TSCs, which are high-density regions of TSSs, by clustering individual OTSSs along the genome into TSS clusters. To avoid false TSCs, the TSCs with sequencing reads showing fewer than 3 read tags of TSCs per million (TPM) were not used in further analyses because they probably come from truncated transcripts and cryptic transcripts due to the inherent nature of the basic transcriptional machinery [22]. The number of TSCs identified in each library ranged from 22,858 to 47,615, with a mean of 36,229. The combined data from all libraries yielded 72,280 TSCs, which were used as the initial set of TSCs. In the initial set of TSCs, we observed a large proportion (56.39%) of 1-bp-wide TSCs in the intergenic regions (Additional file 1: Fig. S5).
To discover motifs within TSCs, we generated a consensus sequence logo of 25 bp surrounding the dominant TSSs in each TSC. The composition of [− 1, + 1] initiator dinucleotides revealed a severe overrepresentation of pyrimidine–purine (PyPu) dinucleotides in the non-1-bp-wide TSCs located in the 5′-UTRs, coding DNA sequences (CDSs), and intronic and intergenic regions but not in those located in the 3′-UTRs (Additional file 1: Fig. S6). However, PyPu dinucleotide initiators were absent in all of the 1-bp-wide TSCs except for the 1-bp-wide TSCs located in 5′-UTRs. These results imply that a considerable portion of the 1-bp-wide TSCs were derived from false-positive signals due to the experimental limitation of the oligo-capping method. Thus, we applied additional correction steps to remove putative false TSCs. A large proportion of the identified TSCs showed significant enrichment of the TGAG motif and its 1-bp-substitution variants upstream of the dominant TSS sites of TSCs (Additional file 1: Fig. S7 and Tables S2-S7). These results suggested that these 1-bp-wide TSCs were likely false TSCs derived from mis-hybridization of the 5′ oligo-capping adapters and internal RNA molecules (Additional file 1: Fig. S8). Therefore, the TSCs showing significant enrichment (observed number greater than 20 and a q-value of less than 1e−10) of the TGAG motif and its 1-bp-substitution variants were filtered for further analysis. The insert fragments generated in sequencing library construction showed a unimodal distribution with a peak at approximately 300–400 bp. Therefore, the internal false signals were generated from possible truncated mRNAs, and the insert fragment size in the 3′ ends did not obey the insert fragment limits along the mRNA transcripts (Additional file 1: Fig. S9). As expected, the 3′ end distribution (median = 352.3 bp; first quartile to third quartile, 229.1–557.3 bp) of the mean insert fragments inferred by determining the start sites of paired R2 (reverse) reads was consistent with the insert fragment distribution observed in sequencing library construction. The inferred 3′ end distribution of the insert fragments was unimodal and asymmetric, with a long tail to the right (Additional file 1: Fig. S10). Therefore, the TSCs showing deviation of the distance from the 3′ ends of the insert fragment to the start sites of the mRNAs were considered to come from possible truncated mRNAs and were therefore removed from further analysis above the threshold of the 90% quantile (Additional file 1: Fig. S11). It is worth noting that a total of 8247 (2252 in 1-bp-wide TSCs and 5995 in non-1-bp-wide TSCs) intergenic TSCs could be assigned to the mRNAs of protein-coding genes by paired R2 read linking. This result suggested that the start sites of mRNAs in the official gene set do not represent authentic TSS sites because of the technological inability to achieve sufficient 5′-UTR coverage using standard Illumina RNA-seq. Overall, we identified 38,136 TSCs in the final set after removing the false TSCs derived from adapter mis-hybridization and internal truncated sites. The width of most of the identified TSCs (median = 38 bp and 90% quantile = 133 bp in non-1-bp-wide TSCs) in the final set was less than 150 bp, which is consistent with that of Drosophila TSCs (median = 36 bp and 90% quantile = 152 bp in non-1-bp-wide TSCs) identified via the RAMPAGE method [23]. The performance assessment for TSC identification was judged on the basis of the distribution of the identified TSCs over gene bodies. Compared with that of the TSCs in the initial set, the higher enrichment of TSCs in the 5′ ends of protein-coding genes in the final set indicated that the application of the correction steps greatly improved the ability to distinguish between authentic TSCs and false TSCs (Fig. 1b). The composition of [− 1, + 1] initiator dinucleotides showed that the PyPu dinucleotide initiators are preferentially used as TSSs in the different genomic regions (Fig. 1c). Thus, these 38,136 TSCs in the final set are reliable TSCs in locusts.
Compared with distance-based promoter identification, promoter identification involving the paired-read-based assignment rule allows the more accurate assignment of core promoters to existing gene models and provides direct evidence for characterizing the transcription start site landscape of protein-coding genes. For each TSC, if an insert fragment with its 5′ end in the TSCs and its 3′ end in an annotated exon of a protein-coding gene was identified, the TSC was functionally linked to the gene. The 38,136 reliable TSCs in the final set were linked to annotated protein-coding genes based on gene structure information using the paired-read-based assignment rule. We thus assigned 48.0% (18,305 in 38,136) of the reliable TSCs to annotated protein-coding genes, and the remaining 52.0% represented potential initiation sites of unannotated nonprotein-coding genes. The comparison of biological replicates of the tissue and organ data confirmed the ability to quantify promoter expression using the oligo-capping method with excellent quantification reproducibility (Additional file 1: Fig. S12, Ps < 2.2e−16, Pearson’s R = 0.99; Pearson correlation coefficients were obtained using the TPM values of all of the detected TSCs). To identify the genic TSCs for which promoter activities are provided by transposable elements (TEs), we searched for TE-containing TSCs that drive the expression of annotated protein-coding genes. We identified a considerable proportion (14.64%, 2779 of 18,305) of reliable TSCs derived from 36 multiple families of TEs that drive the expression of 2190 annotated protein-coding genes. This observation was more prevalent among TSCs located in intergenic regions. Although all three major classes of locust TEs were present (non-LTR, LTR and DNA), the percentage of genic TSCs whose expression was driven by TEs was not directly proportional to the composition of TEs in the locust genome (Fig. 1d). For example, members of the RTE/BovB subfamily (constituting 244 Mb [4.05%] of the locust genome), which is the most prevalent TE subfamily in the locust genome [20], contributed to only 0.07% (2 in 2679) of the genic TSCs.
Characteristics of locust core promoters
We obtained 22,820 genic TSCs in Drosophila melanogaster (Additional file 1: Fig. S13) and used them to compare core promoter characteristics between locusts and fruit flies [7, 23]. Similar initiator (PyPu dinucleotide; notably, because no mutational analysis was performed, the identified PyPu dinucleotide was considered the overrepresented motif) elements were detected in the genic TSCs of locusts and fruit flies (Additional file 1: Fig. S14A). However, the analysis of the nucleotide composition flanking the PyPu dinucleotide of core promoters in locusts revealed a preference for 2-bp downstream T/A usage, which was not observed in fruit flies. By examining the AT contents of the 2-kb flanking regions of core promoters, we found two striking distinct patterns in the nucleotide composition of locusts and fruit flies. Unlike in fruit flies, enrichment of GC nucleotides in the 500-bp flanking regions of the core promoters of locusts was observed, emphasizing the preferential location of the GC-rich regions of locust core promoters (Fig. 2a). Although there is evidence of DNA methylation in gene bodies and repeat regions in locusts, the status of promoter methylation has not been explored because of the unavailability of promoter data [20, 24]. The relative depletion of CpG dinucleotides is negatively correlated with DNA methylation. Thus, we performed a normalized CpG content analysis to determine whether the increases in GC content were associated with the existence of DNA methylation. The fruit fly exhibited a unimodal (centered on approximately 0.93) normalized CpG content (CpG o/e, CpG observed/expected ratio) distribution (a signal of devoid of DNA methylation), and the CpG o/e values consistently remained at approximately 0.93 in the 2-kb flanking region of PyPu dinucleotide (Additional file 1: Fig. S15). However, the locust exhibited a broad distribution of CpG o/e values (Additional file 1: Fig. S16), and the 2-kb flanking regions of the locust core promoters exhibited signatures of gradual CpG restoration (Fig. 2b) as the distance to the dominant OTSS decreased [25]. Thus, the CpG occurrence peaks (approaching the right side of the bimodal CpG o/e value distribution) at the center of the locust core promoters are suggestive of the absence of DNA methylation in the core promoters of locusts despite the enrichment of GC nucleotides in the 500-bp flanking regions of core promoters.
To identify the well-accepted Drosophila core promoter elements, the consensus sequences (Additional file 1: Table S8) of the TATA-box, initiator (Inr), polypyrimidine initiator (TCT), motif ten element (MTE), and downstream core promoter element (DPE) were obtained from a recent review [26]. The consensus sequences were used in pattern matching of the putative Drosophila core promoter elements while allowing one mismatch. There was no obvious difference in the percentage of the Drosophila core promoter elements between the locust and fruit fly core promoters (Additional file 1: Fig. S14B). We extracted the random genomic sequences of which numbers are equal to the numbers of the genic TSCs identified in locusts (N = 18,305) and fruit flies (N = 22,820), respectively. The percentages of PyPu dinucleotide in the locust (85.38% in the genic TSCs and 6.87% in the random sequences) and fruit fly (86.49% in the genic TSCs and 23.75% in the random sequences) core promoters are statistically different (Ps < 2.2e−16, chi-squared tests) from those in the random sequences. The statistical differences (Ps < 2.2e−16, chi-squared tests) remain unchanged for the TATA-box elements in both locusts (18.19% in the genic TSCs and 10.45% in the random sequences) and fruit flies (17.61% in the genic TSCs and 10.47% in the random sequences).
Like in fruit flies, we observed an increase in AT contents in the 20 to 40 bp regions upstream of the PyPu dinucleotide in locusts; these are typical regions in which TATA-box elements are located (Additional file 1: Fig. S17). We performed a de novo motif discovery analysis to identify the potential enriched motifs in the 20 to 40 bp regions upstream of the PyPu dinucleotide in both species. Two different TATA-box motif variants were present in the 20 to 40 bp regions upstream of the PyPu dinucleotide in both species. In fruit flies, the TATA-box motif variant identified by our de novo motif discovery analysis was identical to the TATA-box motif (matrix profile POL012.1) deposited in the JASPAR database [27]. Although the four core nucleotides (TATA) were present in both species, the distinct hallmark of the TATA-box motif variant in locusts was a G/C preference in the 3 bp upstream region of the TATA core nucleotides (Fig. 2c), consistent with the increases in GC nucleotides in locust core promoters. The GC contents in the 3 bp region upstream of the four core nucleotides in locusts and in fruit flies are 92.3% and 53.5%, respectively.
Imprecise transcription initiation and symmetrical pattern of the SNP density of locust core promoters
Transcription can be initiated at precise genomic regions or dispersed genomic regions, a distinction referred to as promoter shape. Distinct promoter classes are defined based on the shape of the TSS distribution: sharp core promoters or broad core promoters [3, 28]. The sharp and broad structures of core promoters are largely conserved across species and are likely to be associated with different functional motifs, emphasizing distinct functional roles between different shapes [9, 29]. The Promoter Shape Score (PSS), a metric for describing promoter shape, was determined to characterize the shape of the locust core promoters. We classified the core promoters based on PSS values and examined the association between the promoter shape and TATA-box signal. On the basis of a previous study [30], the core promoters were divided into three categories: sharp (PSS > − 10) core promoters, intermediate (PSS ≤ − 10 and PSS > − 20) core promoters, and broad (PSS ≤ − 20) core promoters. Both locusts and fruit flies showed a broad PSS value distribution, suggesting that the transcription can be initiated from precise genomic regions to dispersed genomic regions in insects. We observed an obvious tendency for TATA-box signals to appear in upstream regions of PyPu dinucleotides in sharp core promoters in both locusts and fruit flies. However, the sharp core promoters of locusts showed stronger TATA-box signals than those of fruit flies (Fig. 2d), despite the lower overall TA content in the core promoters of locusts. The broad core promoters driving the transcription of ubiquitously expressed genes were TATA-less promoters in both species [31, 32]. To explore the roles of promoter shape in reflecting TSC expression specificity, we used τ to measure the expression specificity. τ varies from 0 to 1, where 0 indicates ubiquitous expression and 1 indicates restricted expression. Contrary to protein-coding genes with a bimodal distribution of τ scores [33], the τ scores in locusts were skewed towards classifying many core promoters as showing restricted expression (Additional file 1: Fig. S18) [34]. We performed Gene Ontology (GO) enrichment analysis to classify the core promoters with ubiquitous and restricted expression according to the functional annotation of linked protein-coding genes. The sets of ubiquitously expressed core promoters were predominantly enriched for GO categories associated with the general/basic functions such as ncRNA processing, translation, and regulation of RNA metabolic processes. Conversely, the core promoters with restricted expression patterns were enriched for specific biological processes related to synaptic transmission, neuron fate specification, regulating TF activities and signaling pathways, and response to light intensity (Additional file 1: Fig. S19). Like in fruit flies, the genic TSCs with ubiquitous expression patterns tended to form broad core promoters in locusts (mean PSS = − 33.62 in locusts and mean PSS = − 30.91 in fruit flies, Fig. 2e). The genic TSCs with restricted expression patterns tended to form sharp core promoters in fruit flies, whereas the genic TSCs with restricted expression patterns exhibited a broader distribution in terms of promoter width in locusts (mean PSS = − 15.87 in locusts and mean PSS = − 6.48 in fruit flies, P < 2.2e−16, Wilcoxon rank-sum test). Therefore, both sharp and broad core promoters in locusts can drive the transcription of protein-coding genes with restricted expression.
In both restricted and ubiquitously expressed TSCs, the locusts showed a greater proportion of PSS values towards the left tail (Fig. 2e) than fruit flies. Therefore, we asked whether the imprecision of transcription initiation in locusts is higher than that in fruit flies. Because imprecise transcription initiation in a genic TSC results in an increased number of OTSSs (OTSS diversity), we compared the number of OTSSs of genic TSCs between locusts and fruit flies. We found that the mean number of OTSSs of genic TSCs in locusts was significantly higher than that in fruit flies (P < 0.05, Wilcoxon rank-sum test). Furthermore, we assessed the OTSS diversity of genic TSCs using the Shannon index (H), which is a diversity index that takes into account not only the OTSS number but also the evenness of the relative usage of different OTSSs [35]. In general, the H values of locusts were significantly higher than those of fruit flies (Fig. 2f, mean H values = 2.84 in locusts and mean H values = 2.35 in fruit flies, P < 2.2e−16, Wilcoxon rank-sum test), suggesting increased OTSS diversity of genic TSCs in locusts. To exclude the potential influences of the different TSS profiling methods applied and unequal sequencing depths in the locust and fruit fly data, we performed down-sampling analyses to examine the robustness of the above results. The P value of the Wilcoxon rank-sum test remained significant in the down-sampled data (Fig. 2g, mean H values = 1.80 in locusts and 1.58 in fruit flies, P < 2.2e−16, Wilcoxon rank-sum test). To describe the mean relationship between TSC expression and OTSS diversity using a partitioning method, we grouped the genic TSCs into 10 bins based on their expression quantile ranges. The binscatter plot shows that the TSC expression in both locusts and fruit flies is significantly positively (Additional file 1: Fig. S20, Pearson’s R = 0.75 in locusts and 0.51 in fruit flies; Ps < 2.2e-16) correlated with OTSS diversity, suggesting that increases in TSC expression are generally achieved by the activation of transcription initiation from expanding OTSSs (increasing OTSS diversity within individual genic TSCs) in insects. When the genic TSCs were grouped into three categories based on PSS values, we found that the H values of the nonsharp core promoters in locusts were significantly higher than those in fruit flies (Fig. 2h, P < 2.2e−16 in the broad core promoters and P < 2.2e−16 in the intermediate core promoters, Wilcoxon rank-sum tests). However, a similar observation was not made for the sharp core promoters. Overall, the increased OTSS diversity of genic TSCs indicates a lower precision of transcription initiation of the broad and intermediate core promoters in locusts compared with fruit flies.
Genomic regions flanking TSSs are enriched in functionally important regulatory elements, which show depletion of single-nucleotide polymorphisms (SNPs) due to evolutionary conservation [36]. To determine the sequence variability of the genomic regions flanking TSSs, we extracted 1-kb fragments centered on TSSs and computed the position-specific density of SNPs using the resequencing data of locusts and fruit flies [37, 38]. We found two distinctive patterns of SNP density in the vicinity of dominant OTSSs in the two insect species (Fig. 2i); symmetrical and asymmetrical patterns of SNP density were observed in locusts and in fruit flies, respectively. The steep decline in the SNP density at approximately 250 bp upstream of dominant OTSSs suggests immediate constraints imposed by the presence of TFBSs or regulatory elements. However, the gradual decrease in SNP density from 1 kb upstream to the center of dominant OTSSs in locusts indicated fewer constraints on the distance from TFBSs or regulatory elements to TSSs in locusts than in fruit flies.
Alternative usage of core promoters of protein-coding genes in locusts
Based on the number of assigned core promoters, the protein-coding genes in locusts are divided into two categories: single-core-promoter genes and multicore-promoter genes. The multicore-promoter genes with two or more core promoters included 38.90% of the assigned protein-coding genes in locusts. The proportion of multicore-promoter genes in locusts was similar to that in fruit flies (38.71%). Compared with the fly genome, in which 37.78% of the genome consists of intergenic regions, the locust genome is greatly expanded, and a total of 73.67% of the locust genome consists of intergenic regions. The average length of the intergenic sequences in locusts was much longer than that in the fruit fly genome (217.63 kb in locusts and 4.42 kb in fruit flies). Therefore, despite the remarkable discrepancy in the sizes of intergenic regions between the species, the proportions of multicore-promoter genes are similar between locusts and fruit flies.
For multicore-promoter genes, the distributions of OTSSs between core promoters differed considerably in different tissues of locusts. The alternative usage of core promoters is also referred to as core-promoter shifting, which is used to quantify OTSS distribution dynamics among core promoters. To determine the prevalence of core-promoter shifting, the degree of shift (Ds value) was determined by calculating changes in the OTSS distribution between core promoters for each multicore-promoter gene. The distribution of Ds values approximately followed a normal distribution (Additional file 1: Fig. S21), implying the dynamic usage of core promoters. We found that 31.09% of multicore-promoter genes have undergone a significant shift in core-promoter usage in at least one tissue or organ (Ds values < − 1 or Ds values > 1, P < 0.05 and FDR < 0.1, chi-squared tests, Additional file 1: Fig. S22). This suggests pervasive variability of the 5′-UTR sequences of protein-coding genes in locusts, with implications for translation start site selection in a tissue-specific manner.
Adjacent and distant core promoters in locusts and flies
The density distribution of the distances from the annotated start codons to the farthest upstream genic core promoters showed a distinctive bimodal log distribution with a valley at 3 kb in locusts (Fig. 3a). However, only a unimodal distribution with a peak at approximately 100 bp was found in fruit flies. Thus, the genome size expansion resulted in a looseness of upstream regulatory elements in locusts. This observation held when all of the core promoters of each protein-coding gene were included (Additional file 1: Fig. S23). Therefore, the core promoters in these two species were classified into the adjacent and distant core promoters using a threshold of 3 kb. Compared with fruit flies (15.48%), locusts exhibited more than triple the number of distant core promoters (45.02%, P < 2.2e−16, chi-squared test). Thus, a considerable portion of protein-coding genes contain introns located between the coding and 5′-untranslated first exons, considering that the mean length of mRNA leaders is much shorter than 3 kb.
For the protein-coding genes with annotated mRNA leaders, the intron number in mRNA leaders was less than 3 in both locusts (99.98%) and fruit flies (99.16%), indicating the possibility of strong selection constraints against presenting many introns in mRNA leaders. In mRNA leaders, the intron lengths in locusts (median length = 14,302 bp, 25% quantile = 1957 bp, and 75% quantile = 34,602 bp for the mRNA leaders that have one intron [MLOI]; median length = 17,627 bp, 25% quantile = 8170 bp, and 75% quantile = 46,448 bp for the mRNA leaders that have two introns [MLTI]) were significantly longer (Ps < 2.2e−16, Wilcoxon rank-sum tests) than those in fruit flies (median length = 929 bp, 25% quantile = 173 bp, and 75% quantile = 3105 bp for MLOI; median length = 3068 bp, 25% quantile = 598 bp, and 75% quantile = 9430 bp for MLTI; Additional file 1: Fig. S24). Furthermore, in both locusts (P = 1.303e−06, Wilcoxon rank-sum test) and fruit flies (P < 2.2e−16, Wilcoxon rank-sum test), the mean intron length in MLTI was significantly longer than that in MLOI. The median length of the mRNA leaders in locusts was 133 bp (25% quantile = 70 bp and 75% quantile = 230 bp), which was similar (P = 0.9652, Wilcoxon rank-sum test) to that in fruit flies (median length = 117 bp, 25% quantile = 67 bp and 75% quantile = 263 bp). Significant increases (Ps < 2.2e−16, Wilcoxon rank-sum tests) in the length of mRNA leaders were accompanied by increases in intron numbers in both locusts and fruit flies (Fig. 3b). Furthermore, the mRNA leader lengths in locusts were significantly shorter (Ps < 2.2e−16, Wilcoxon rank-sum tests) than those in fruit flies in both MLOI and MLTI. However, no significant length difference in mRNA leaders between locusts and fruit flies was observed in the mRNA leaders without introns (MLIPs). With respect to the exon-level comparison of mRNA leaders, it was observed only in fruit flies that significant increases in exon length (Ps < 2.2e−16, Wilcoxon rank-sum tests) were accompanied by increases in intron numbers (Fig. 3c). The exon lengths of both MLOI and MLTI in locusts were significantly shorter (Ps < 2.2e−16, Wilcoxon rank-sum tests) than those in fruit flies. In addition, a comparison of the standard deviation showed that the exon lengths (mean = 118.5 bp) of MLTI presented less variance than those of either MLIP or MLOI in locusts (Fig. 3d).
Both the single-core promoters (34.23% in locusts and 4.70% in fruit flies are distant core promoters, P < 2.2e−16, chi-squared test) and multicore promoters (68.24% in locusts and 37.90% in fruit flies are distant core promoters, P < 2.2e−16, chi-squared test) genes of locusts exhibited a significantly greater number of distant core promoters than those of fruit flies. Furthermore, the multicore-promoter genes presented a significantly greater number of distant core promoters than the single-core-promoter genes did in both locusts and fruit flies (Ps < 2.2e−16, chi-squared tests). One distinctive difference between the two types of core promoters in the two species was their transcriptional abundance. As expected, the majority of the adjacent core promoters in fruit flies showed higher transcriptional abundance than the distant core promoters. The transcriptional activities of distant core promoters were significantly weaker than those of adjacent core promoters in fruit flies. However, the scatter plot of TSC expression quantiles indicated that a substantial portion of the distant core promoters in locusts showed high transcriptional activity (Fig. 4a). Furthermore, we observed a weak but significant positive correlation (Pearson’s R = 0.12 and P = 5.192e−15) between the distances from the annotated start codons to the upstream core promoters and TSC expression (TPM) levels in distant core promoters in locusts. However, no similar positive correlation was observed for either the adjacent core promoters in locusts or the adjacent/distant core promoters in fruit flies (Fig. 4b). Broad core promoters were detected in the adjacent core promoters of both locusts and fruit flies (Fig. 4c). In fact, a depletion of distant broad core promoters was observed in the broad core promoters of fruit flies. However, broad core promoters were found in a substantial proportion of distant core promoters in locusts.
To determine whether the overrepresentation of distant broad core promoters in locusts reflects TSC expression specificity, we used τ to measure expression specificity. No significant differences in the TSC expression specificity of distant core promoters were detected between locusts and fruit flies (Ps > 0.05, Wilcoxon rank-sum test), suggesting that both ubiquitous and restricted expression patterns are present in the distant core promoters of locusts and fruit flies. Considering all of these results together, the detection of broad core promoters with restricted expression patterns in locusts reflects only distinct fundamental aspects of gene regulation in locusts and fruit flies, suggesting the lower precision of transcription initiation of locust genic TSCs with restricted expression patterns.
Considering TSC expression in relation to the shape dynamics of core promoters, we performed a sliding window analysis using PSS values and TSC transcriptional abundances to determine the relationship between the core promoter shape and TSC expression. By plotting the PSS values and the expression quantiles of each sliding window, we found that the increases in TSC expression were generally accompanied by decreases in PSS values (Fig. 4d). Enhanced TSC transcriptional abundance gradually results in a broader range of core promoters in both locusts and fruit flies. Therefore, increases in TSC expression are generally achieved by the activation of transcription initiation from expanding OTSSs in insects rather than increased TSC expression of sharp promoters within a few OTSSs, demonstrating a positive correlation between gene expression and PSS values. Compared with distant core promoters, adjacent core promoters showed higher expression activity of genic TSCs in locusts and fruit flies. In addition, similar to fruit flies, the spatial distribution of OTSS signals varied considerably among adjacent core promoters in locusts, spanning a range of promoter shapes from sharp to broad. However, the line termini of the distant core promoters of locusts approached towards those of adjacent core promoters. Thus, unlike fruit flies, the distant core promoters in locusts showed higher expression activities of genic TSCs and became broader with respect to promoter width.
Distant core promoter emergence in the context of genome size evolution in insects
To examine the presence of distant core promoters in the context of genome size, we further generated the oligo-capping data from seven arthropod species (six insect species and one chelicerate species), the genome sizes of which are much smaller than the locust genome (Additional file 1: Table S9). The involved insect species, whose genomes represent a wide range of sizes, included Tribolium castaneum (red flour beetle, Coleoptera, ~ 0.17 Gb), Bombus terrestris (buff-tailed bumblebee, Hymenoptera, ~ 0.25 Gb), Helicoverpa armigera (cotton bollworm, Lepidoptera, ~ 0.34 Gb), Laodelphax striatellus (small brown planthopper, Hemiptera, ~ 0.54 Gb), Acyrthosiphon pisum (pea aphid, Hemiptera, ~ 0.54 Gb), and Aedes aegypti (yellow fever mosquito, Diptera, ~ 1.28 Gb). We also included a chelicerate species, Tetranychus urticae (two-spotted spider mite, Trombidiformes, ~ 0.09 Gb), as it has the smallest genome size among those of the sequenced arthropod species. The same TSC identification and false TSC removal approaches that were used in the locust data were applied in these arthropod data (Additional file 1: Table S10). The presence of the PyPu dinucleotide initiators guarantees the authenticity of the identified TSCs (Additional file 1: Fig. S25). The genic core promoters, which were linked to the annotated protein-coding genes using paired-read-based assignment rule, showed strong enrichment in the 5′ ends of the gene body (Additional file 1: Fig. S26). Except for the yellow fever mosquito, a unimodal log distribution of the distances from the annotated start codons to the farthest/all genic core promoters upstream was observed in all of the arthropod species, reinforcing our results in the comparison of locusts and fruit flies (Fig. 5a and Additional file 1: Fig. S27). Compared with the migratory locust, the yellow fever mosquito (~ 1.28 Gb) also exhibited a bimodal log distribution with a minor peak shifting towards shorter distances, indicating a lower proportion of distant core promoters in this species. Furthermore, the significant positive correlation between the genome size and the number ratio of distant core promoters to adjacent core promoters suggests that the genome size expansion results in the emergence of distant core promoters to initiate distant transcription (Fig. 5b). Because TEs are the dominant contributors to overall genome size variability in metazoan species [39], we next investigated the contribution of TE insertion into the upstream regions from the start codon to its corresponding core promoter. In the dominant portion of adjacent core promoters, the TE sequences were not detected in the upstream regions from the start codon to the adjacent core promoter, suggesting a strong resistance of TE insertion in adjacent transcription initiation (Fig. 5c, bottom panel). However, in the upstream regions from the start codon to the distant core promoter, the TE sequences could be detected in a large portion of distant core promoters. The decrease in genome size was accompanied by decreases in the average TE coverage in the upstream regions from the start codon to the distant core promoter (Fig. 5c, top panel).
The genomic organization of sequence-specific TFBSs represents a profound recognition source for transcriptional initiation. We evaluated the extent of TF-mediated regulatory elements by analyzing the TFBS occurrence and used the dominant TSSs as the reference point to determine the spatial distribution of TFBSs. TFBSs are enriched from − 125 to 0 bp upstream of the dominant TSS, indicating a positioning bias of TFBSs relative to the TSS (Additional file 1: Fig. S28). To explore the different preferences of sequence context in which the TFBS motif occurs, we compared genome-wide TFBS abundances between adjacent and distant core promoters via the number of TFBSs per core promoter per TF (Fig. 5d). A greater number of TFs showing significantly higher TFBS abundance is observed in the distant core promoters. However, only a few TFs showing significantly higher TFBS abundance are observed in the adjacent core promoters. This suggests that the overall architecture of TF-mediated regulation in distant transcriptional initiation is more variable than that in adjacent transcriptional initiation. To estimate the extent of sequence divergence of TFBSs between the adjacent and distant core promoters, we used a normalized Shannon entropy (SE) as a measure of the degree of conservation of TFBSs for each TF. The higher the SE value is, the higher the sequence divergence of TFBSs. In the migratory locust, the distant core promoters show significantly higher values (P < 0.05, Wilcoxon rank-sum test) of SE than adjacent core promoters, suggesting a high variation in TFBSs in distant core promoters (Fig. 5e). However, in the eight other species, the mean values of SE are lower in distant core promoters than in adjacent core promoters. A significantly lower value of SE was observed in five of the eight species involved.