We first analyzed gene expression datasets from preimplantation human [50, 51] and mouse [50, 52] embryos and identified genes that have higher or lower abundance in the blastocyst compared to the zygote, 2-cell, or 8-cell embryo stage. To ensure robustness of the findings, we used two independent datasets for each of human and mouse, respectively. Each dataset was generated using a different quantification methodology (microarray and deep sequencing, respectively). We thresholded and analyzed each of the four datasets separately and found the results to be reproducible (Additional file 1: Supplemental Figure S1). Out of an average of 12,015 genes in each dataset, we found 2709 statistically significantly up-regulated and 5286 down-regulated genes in the blastocyst with respect to earlier developmental time points (false discovery rate, FDR ≤ 5%; Additional file 2: Supplemental Table S1). Among the genes that are more abundant in the blastocyst are mitochondrial membrane transports (e.g., TOMM6 and TIMM13), glutathione metabolism genes (e.g., GPX4, GSTP1, and GSTO1), ribosomal proteins (e.g., RPL4 and RPL6), and metabolic genes (e.g., HK1, IDH3B, and TKT). On the other hand, notable genes among those with lower abundance in the blastocyst include NCOA1, AK5, GRK5, ITGA9, and CLOCK. Examining the associated pathways, we found that ribosome, glycolysis, citric acid cycle, and oxidative phosphorylation are enriched among the genes that are more abundant in the blastocyst (Additional file 2: Supplemental Table S1). On the other hand, signaling pathways (including MAPK, cAMP, JAK-STAT, and Wnt) are enriched among the genes that are more abundant in the zygote (Additional file 2: Supplemental Table S1). These results are in agreement with previous studies [26, 53] and provide a robust dataset for further mining.
Biases in length and repetitive-element content among expressed genes change monotonically with the preimplantation developmental stage
In zebrafish, the genes that are expressed during the transition from the zygote to a highly proliferative population of cells exhibit length biases [26]. We hypothesized that a similar bias may characterize human and mouse genes as well [21, 22].
We computed the distributions of the exonic and intronic lengths in nucleotides (nts) for the genes that are differentially abundant between the blastocyst and earlier embryonic stages, i.e., the zygote, 2-cell, or 8-cell embryo depending on the study (see the “Materials and methods” section; Additional file 3: Supplemental Table S2), and juxtaposed them to the respective length distributions of all expressed genes in each dataset (background). We found that genes with higher abundance in the blastocyst compared to earlier embryonic time points have significantly shorter exons and introns (P value < 10−4; Kolmogorov-Smirnov test). On the other hand, genes with lower abundance in the blastocyst have significantly longer exons and introns (P value < 10−4; Kolmogorov-Smirnov test). These observations hold true for both human (Fig. 2a, b) and mouse (Fig. 2c, d) embryos.
In addition to being shorter, the genes with higher abundance in the blastocyst compared to respective earlier embryo stages had more of their genomic span occupied by exons (P value < 10−4; Kolmogorov-Smirnov test). Notably, the opposite holds true for genes whose abundance is lower in the blastocyst compared to the respective earlier embryo stages (right panels; Fig. 2a–d).
We note that these observations remain unchanged even when we form the background distribution by considering all human or mouse protein-coding genes (Additional file 1: Supplemental Figure S1B-C).
The differences in the exonic and intronic lengths of those two groups of genes prompted us to also examine their nucleotide composition for other possible biases. In particular, we investigated whether the introns and exons of the genes that are up-regulated or down-regulated in the blastocyst are enriched or depleted in any families of repetitive elements. We used Monte Carlo simulations (see the “Materials and methods” section), distinguishing between sense and antisense instances of repetitive elements with respect to the orientation of the genes at hand. For this analysis, we calculated “repetitive-element content per unit length” in order to account for the fact that different genes have different lengths (see the “Materials and methods” section for more details). In Fig. 3, we show heatmaps of the Z-scores that capture the calculated enrichments and depletions with respect to a random-generated background distribution: in all instances, the corresponding FDR value is ≤ 5%. Additional file 4: Supplemental Table S3 lists the various Z-scores and associated FDR values.
Figure 3 makes it strikingly evident that the genes that have higher abundance in the blastocyst compared to respective earlier embryonic stages are also denser in repetitive elements than would have been expected by chance. On the other hand, the genes that have lower abundance in the blastocyst are depleted in repetitive elements. This observation holds true for both exons and introns in human (Fig. 3a, b) and mouse (Fig. 3c, d), and for both orientations of the repeats with regard to the genes' sequences. Of note, introns are enriched or depleted in more categories of repetitive elements than exons.
The repetitive elements whose sequences are over- or under-represented in the examined sequences include DNA transposons, Long Terminal Repeats (LTR), short interspersed nuclear elements (SINE), and the L1 category of long interspersed nuclear elements (LINE). SINE elements are most enriched among the genes whose abundance is higher in the blastocyst, in both humans (Alu, MIR) and mice (B elements, MIR) and with Z-scores as high as + 10.1 (FDR < 10−13). On the other hand, SINE and other repeat categories are depleted among the genes whose abundance is lower in the blastocyst, with Z-scores as low as − 13.0 (FDR < 10−19).
The L1 category represents an exception in the above observations. This is best exemplified by the mouse dataset described in Fig. 3d. As can be seen, the introns of the genes with higher abundance in the blastocyst are depleted in both sense and antisense L1 elements (average Z-score of -6.4; FDR < 5%) whereas the introns of the genes with low abundance in the blastocyst are enriched in antisense L1 elements (average Z-score = + 5.2; FDR < 5%).
One important characteristic of the developmental stages studied here is zygotic genome activation (ZGA) [54]. It is conceivable that the observed differences in transcript composition, transcript length, and repetitive-element biases might be associated with transcripts transcribed de novo after ZGA. To examine this possibility, we focused on the human and mouse datasets of Xie et al. [50]. Specifically, and for different time points for mouse and human embryos, we identified the genes that are up-regulated as the zygotic genome is activated (Additional file 2: Supplemental Table S1) [54]. We found that both the exons and the introns of the corresponding sets of genes are shorter than the background gene population (P value < 10−4; Kolmogorov-Smirnov test) and are enriched in the same repetitive-element families shown in Fig. 3 (Additional file 1: Supplemental Figure S2A-B; Additional file 4: Supplemental Table S3). Moreover, we found that the ZGA-related genes overlap significantly with the genes that have higher abundance in the blastocyst (P value < 10−4; hypergeometric test)—see Additional file 1: Supplemental Figure S2C. This indicates that ZGA follows the same architectural patterns but is only part of the transition from the zygote to the blastocyst.
Collectively, the above results suggest that the genes that are expressed during the preimplantation embryogenesis, including ZGA, exhibit specific patterns in terms of gene architecture and sequence content.
Examples of protein-coding genes having conspicuous overlaps with repetitive elements
The human hexokinase 1 gene, HK1, is located on chromosome 10 where it spans ~ 132 kilobases (Kb). Its exonic length is ~ 4.5 Kb, representing 3% of the gene’s total span. Those of the Alu sequences that are sense to this gene are located solely in its introns and span a grand total of ~ 17 Kb. An additional 16 Kb of Alu sequences are antisense to this gene’s span. In other words, almost one fourth of HK1’s genomic span contains Alu sequences, either in sense or in antisense orientation. Similar observations can be made for the mouse orthologue Hk1: its overlap with B elements, in either sense or antisense orientation, is ~ 19%. The density of MIR elements is also consistent in this gene between the two organisms. Approximately 4% of the human orthologue and 7% of the mouse orthologue correspond to MIR sequences in either sense or antisense. Similar observations can be made for TKT, RPL14, and KRT8 as well, all of which are differentially abundant and part of the enriched pathways that include metabolism and the ribosome (Additional file 2: Supplemental Table S1).
These examples point to the considerable size of the overlap of repetitive elements on genes and hint to potentially consequential associations at the intersection of genomic architecture, evolution, and developmental stage. We examine these matters and their ramification further in the “Discussion” section.
The up-regulated and down-regulated genes contain unique pyknon signatures while the pyknons they have in common correspond to SINE/Alu elements
To obtain a more detailed perspective on the extent of sequence similarities between the two groups of genes with opposite expression behavior, we examined their pyknon composition from a qualitative perspective (Additional file 3: Supplemental Table S2). Pyknons are present in virtually all mRNAs [40], overlap with repetitive elements [47], and have been shown to be functionally active in several contexts [43, 48, 49]. As they are short in length, they can be used to conduct more granular analyses than would have been possible using the repetitive elements of RepeatMasker.
We identified 7828 distinct pyknons that overlap the exons of genes that are up-regulated in human blastocyst and 81,032 ones in genes that are down-regulated; 2782 pyknons are shared by the two gene sets (Fig. 4a and Additional file 2: Supplemental Table S1). We found that there are 4353 and 64,696 pyknons uniquely present in the exon spans of up- and down-regulated genes, respectively (Fig. 4a). More than 90% of the genes in each of the up- or down-regulated gene sets contain at least one pyknon. The exons of the down-regulated genes have a higher density in pyknons (Fig. 4b). On the other hand, there is no appreciable difference in the pyknon density of the introns of the up-regulated and down-regulated genes (data not shown).
As pyknons are by definition repeated motifs on the human genome, we examined how they related to the repetitive-element families and distribution biases of Fig. 3a. Specifically, for the pyknons that are unique to the down-regulated and to the up-regulated genes, respectively, we collected all their genomic instances and intersected them with the known repetitive elements. We then calculated the frequency by which they appeared within each type of repetitive element. For the pyknons that are unique to either the up-regulated or the down-regulated genes, or only in the non-DE genes, most of them (~ 45%) can be found within LINE/L1 elements of the genome whereas a smaller proportion (~ 20%) overlapped with SINE/Alu elements (Fig. 4b). Interestingly, the pyknons that were common in all three gene groups mostly overlapped with SINE/Alu (67% of the pyknons). ERV elements were also significantly enriched (Fig. 4b).
Collectively, this analysis positions pyknons as sequence markers for how the gene will change expression during preimplantation development. These findings also suggest that not all members of a family of repetitive elements are equal in this regard: evidently, the pyknons can effectively partition known families of repeats into subsets each of which is associated with the down-regulated genes, up-regulated genes, and unchanged genes, respectively. The findings further support logical connections—presumably ones that capture regulatory events in nature—between repetitive elements and mRNAs that are expressed in early embryogenesis.
The architecture of early-expressed genes mirrors that of genes comprising the stem cell signature
Above, we showed that early-expressed genes have specific architectural characteristics. The transition from the zygote to the blastocyst can be viewed as the onset of a proliferative phenotype and, for a portion of the cells of the blastocyst, the establishment of a stem cell identity [55]. Considering the latter, we hypothesized that the genes whose abundance is higher in the blastocyst as compared to subsequent embryonic stages may be part of the known stem cell expression signatures.
To test this hypothesis, we downloaded and analyzed the genes involved in the PluriNet protein-protein interaction network [56]. This network is shared among pluripotent stem cells and was generated based on a multitude of stem cell samples. We note that we examined the PluriNet genes with reference to all human protein-coding genes, independent of the genes’ levels of expression at the blastocyst stage.
We found that the genes forming the PluriNet network have shorter lengths (Fig. 5a). Specifically, both exons (top panel of Fig. 5a) and introns (bottom panel of Fig. 5a) are statistically significantly shorter than the background population of human protein-coding genes (P value < 10−4; Kolmogorov-Smirnov test). When we examined the exons of these genes, we did not find any repetitive-element family to be enriched or depleted. However, when we examined the introns of these genes, we found them to be enriched in sequences from DNA repeats, Helitron and Alu elements (FDR < 5%; Fig. 5b; Additional file 4: Supplemental Table S3).
Next, we identified the mouse orthologues of these genes and examined their overlap with mouse repetitive elements. Exons were again found to lack any notable attributes. However, introns exhibit significant biases (Fig. 5c). Specifically, the intronic regions of the mouse orthologues of the PluriNet signature are significantly denser in B1 and B2 SINE elements in both sense and antisense orientations (FDR < 5%; Fig. 5c; Additional file 4: Supplemental Table S3). We note that L1 elements show an inverse behavior and are depleted in the introns of these genes (FDR < 5%; Fig. 5c; Additional file 4: Supplemental Table S3).
These results parallel the above observations from early development and support the view that the genes that form the signature of a stem cell phenotype have specific structure and content in both human and mouse.
Gene expression trajectories of differentiation and organogenesis involve longer genes that are less dense in repetitive elements
Having observed that the state of pluripotency is characterized by shorter genes whose exons are enriched in repetitive sequences, we examined whether repetitive elements differ in cells of different lineages or in differentiating cells, and whether lineage-specific genes share common characteristics.
We first analyzed the blastocyst lineage signatures that were defined by the Petropoulos et al. study [51]. We examined the lineage-specific genes that the study reported for trophoectoderm (TE), primitive endoderm (PE), and epiblast (EPI), respectively. We found that the genes in PE and TE had significantly longer introns (P value < 10−2; Kolmogorov-Smirnov test) but not exons (Fig. 6a). The introns of the PE-specific genes exhibited a stronger length bias than the exons. The weaker P values in the case of TE could be explained by the relatively low number of genes. Moreover, the PE-specific genes were depleted in Alu elements (Additional file 4: Supplemental Table S3). While limited, these results provide independent support of our findings on gene length and lineage-specific genes.
We then analyzed genes from human embryo at the stage of organogenesis [57], specifically, genes whose expression significantly changes by the 9th week as compared to the 4th week. We found that the genes whose abundance increases during organogenesis have both long exons and long introns (P value < 10−4; Kolmogorov-Smirnov test; orange curves on Fig. 6a). On the other hand, the exons and introns of genes whose abundance decreases during organogenesis are on average shorter (P value < 10−4; Kolmogorov-Smirnov test; blue curves on Fig. 6b). When we examined the repetitive-element density of these genes, we observed significant trends. Genes with decreasing abundance during organogenesis are enriched in repetitive sequences; on the other hand, genes with increasing abundance during organogenesis are depleted in repetitive sequences (FDR ≤ 5%; heatmap of Fig. 6b; Additional file 4: Supplemental Table S3). Specifically for Alu elements, the introns of the genes whose abundance increases during organogenesis were significantly sparser in both sense and antisense instances of Alu sequences. Inversely, the introns of the genes whose abundance decreases during organogenesis are significantly denser in Alu sequences. However, the MIR elements, and to a lesser extent the LINE/L1 elements, show the opposite trend (Fig. 6b; Additional file 4: Supplemental Table S3). MIR and Alu elements are highlighted with a red rectangle on Fig. 6b.
We further looked into differentiation, we studied the cases of H1 and H9 human embryonic stem cells forming differentiated embryoid bodies in culture [58] and identified those genes whose abundance changes between the differentiated embryoid bodies and undifferentiated stem cells (Additional files 2 and 3: Supplemental Tables S1 and S2). In both H1 and H9 cells, we found significant biases in the lengths of genes that change in abundance during differentiation (Fig. 6c): the exons of genes whose abundance decreases (resp., increases) with differentiation are significantly shorter (resp., longer) than the background population of genes (P value < 10−4; Kolmogorov-Smirnov test). Similar observations can be made for the introns of the H1 cells (Fig. 6c). For H9 cells, it is only the introns of up-regulated genes that were statistically significantly different (P value < 10−4; Kolmogorov-Smirnov test; Fig. 6c). We note that the stronger statistical differences are found in the introns of the differentially expressed genes. In terms of repetitive-element content, the introns of H1 and H9 genes whose abundance increases with differentiation had strong and statistically significant depletion in Alu element density (Z score < − 10; FDR < 5%; Additional file 4: Supplemental Table S3).
To examine the differentiation process in more detail, we integrated the data from Xie et al. [50] with the ones from Cardoso-Moreira et al. [59]. The latter dataset includes expression values from seven different tissues from multiple developmental time points in both human and mouse. These datasets were obtained from different laboratories using distinct platforms (microarrays and RNA-sequencing). Consequently, the expression values of genes are not directly comparable without proper normalization. For instance, normalizing to housekeeping genes will produce erroneous results because the expression of ribosomal and metabolic genes, like mouse Gapdh, changes during development (see Additional files 2 and 3: Supplemental Tables S1 and S2). To overcome this limitation, we rank-normalized the datasets and considered as differentially abundant those genes whose ranking differed significantly between the compared datasets (see the “Materials and methods” section). Among these differentially ranked genes, the vast majority were common in all seven tissues, in both human and mouse (Additional File 1: Supplemental Figure S3 a-d; Additional File 2: Supplemental Table S1). However, there were tissue-specific gene changes, like the unique upregulation of prothrombin in the liver or the upregulation of Gene Ontology terms related with cardiac muscle development in the heart (Additional File 2: Supplemental Table S1).
We examined the architecture of those genes whose expression differed between the blastocyst and all seven tissues. The genes that were more abundant in the developing tissues as compared to the blastocyst had longer exons and introns in both human and mouse (P-value < 10-2; Kolmogorov-Smirnov test; Fig. 6d, e). On the contrary, the genes with lower abundance in the seven tissues had shorter introns and exons in human (P-value < 10-4; Kolmogorov-Smirnov test; Fig. 6d, e).
We also analyzed the repetitive-element content of the differentially expressed genes. In humans, we found a global bias in the density of repeats: the genes with higher abundance in all developing tissues were significantly sparser in repetitive elements whereas those with lower abundance were significantly richer in repeats (Fig. 6d; Additional File 4: Supplemental Table S3). In mouse, the differentially abundant genes exhibited less of a repetitive element bias compared to human (Fig. 6e; Additional File 4: Supplemental Table S3).
Collectively, the results of the previous sections and those shown in Fig. 6 suggest that differentiation follows a trajectory that is essentially the opposite to the one followed when establishing a proliferative/pluripotent phenotype (Fig. 3). At the same time, the case of Alu and MIR elements (Fig. 6b) indicates that the process of differentiation, as captured in Fig. 6a–c, is more complex than merely the inverse of establishing the pluripotency state (Figs. 2 and 3).
In differentiated tissues, tissue-specific genes are longer and repeat-depleted whereas ubiquitously expressed genes are shorter and repeat-enriched
Our results so far refer to differentiating cells during embryogenesis and do not necessarily describe the attributes of differentiated cells. Thus, we investigated whether length and repeat-element biases exist in differentiated tissues such as those found in the Genotype-Tissue Expression (GTEx) repository [60]. Specifically, we investigated the possibility of such biases in genes that are specific to each tissue.
We formed tissue-specific gene signatures using our previously developed machine learning approach for extracting models from “binary” expression profiles [61]. In these profiles, each gene is labeled as “expressed” or “not expressed” in a dataset based on whether its abundance exceeds a stringent threshold (Additional file 5: Supplemental Table S4). We demonstrated previously that this methodology can distinguish among 32 different cancer types (from different tissues) [61]. Additionally, the methodology allows us to identify the transcripts with the most discriminatory power [61]. We applied this scheme to the GTEx cohort and found 1505 tissue-specific genes that can discriminate among the 27 normal tissues (Fig. 7a) and also 1340 widely expressed genes, i.e., genes found expressed across all tissues (see the “Materials and methods” section; Additional file 1: Supplemental Figure S4; Additional files 1 and 5: Supplemental Tables S1 and S4).
We compared the tissue-specific and the widely expressed genes from the standpoint of length and sequence biases. The widely expressed genes are enriched in the housekeeping pathways that we found to be abundant in the blastocyst, including the ribosome, oxidative phosphorylation, the citric acid cycle, and spliceosome (Additional file 2: Supplemental Table S1). On the other hand, the genes that comprise the tissue-specific signatures are significantly enriched in homeobox-containing genes and signaling and developmental processes (Additional files 2 and 4: Supplemental Tables S1 and S3; Additional file 1: Supplemental Figure S5). Intriguingly, nine of the 10 most important genes are transcription factors (TBX15, FOXF1, TWIST1, and six HOX genes), whereas the tenth is the kinase-encoding gene SKAP2 (Additional file 5: Supplemental Table S4).
The length characteristics of the widely expressed and tissue-specific groups of genes exhibit opposite trends. The widely expressed genes have significantly shorter exons and introns. The tissue-specific genes have significantly longer introns (P value < 10−4; Kolmogorov-Smirnov test). See also Fig. 7b, c and Additional file 4: Supplemental Table S3.
The repetitive-element content of these two groups also exhibits opposite trends. The introns of the widely expressed genes are strikingly enriched in repetitive elements, particularly Alu’s, in both sense and antisense orientations (Fig. 7c). The LINE/L1 category was again a noteworthy exception: the introns are significantly depleted in L1 elements. On the other hand, the tissue-specific gene sets are depleted in repetitive elements but in comparatively fewer categories. We note that, again, SINE/Alu elements exhibit the greatest depletions (Fig. 7c; Additional file 4: Supplemental Table S3).
Collectively, the dichotomy we observe between widely expressed and tissue-specific genes regarding their length biases and repetitive-element content mirrors what we observed in previous results (Figs. 2, 3, and 5): the genes with higher expression in a pluripotent/proliferative state are shorter, repetitive-element rich and represent pathways that are often considered as housekeeping. In contrast, gene sets that establish tissue identity have longer introns, on average; are repetitive-element sparse; and include signaling and transcription factor processes.