Embryo collection and RNA extraction
Fly lines (w1118) were obtained from the Bloomington Stock Center and reared at room temperature on a standard fly medium with a 12-h light-dark cycle. The fly medium we used is composed of 6.2 g agar powder (ACROS N. 400400050), 58.8 g Farigel wheat (Westhove N. FMZH1), 58.8 g yeast (Springaline BA10), 100 mL grape juice; 4.9 mL propionic acid (Sigma N. P1386), 26.5 mL of methyl 4-hydroxybenzoate (VWR N. ALFAA14289.0) solution (400 g/L) in 95% ethanol, and 1 L water. One hundred to 150 flies were transferred to cages, which were sealed to a grape agar plate (1:1 mixture of 6% agar and grape juice). We used 4 separate cages to collect the embryos. The adults were kept overnight on this plate before being transferred to a new plate supplemented with yeast paste. Synchronization of eggs on this plate lasted for 2 h before being swapped with a new plate supplemented with yeast paste. We let the adults lay eggs for 30 min before removing the plate and letting the eggs incubate for the desired time.
Eggs were harvested using the following protocol. First a 1:1 bleach (Reactolab 99412) 1× PBS mix was poured on the plate and incubated for 2 min. During this incubation, we used a brush to lightly scrape the surface to mobilize the embryos. We then poured the PBS-bleach mixture through a sieve, washed the plate with 1× PBS, and poured the wash on the same sieve. We washed the sieve several times with 1× PBS until the smell of bleach disappeared. Single embryos were then manually transferred to Eppendorf containing 50 μL beads and 350 μL TRIzol (Life Technologies AM9738). The tubes were homogenized in a Precellys 24 Tissue Homogenizer at 6000 rpm for 30 s. Samples were then transferred to liquid nitrogen for flash freezing and stored at − 80 °C. For RNA extraction, tubes were thawed on ice, supplemented with 350 μL of 100% ethanol before homogenizing again with the same parameters. We then used the Direct-zol™ RNA Miniprep R2056 Kit, with the following modifications: we did not perform DNAse I treatment, we added another 2 min centrifugation into an empty column after the RNA Wash step, finally elution was performed by adding 8 μL of RNAse-free water to the column, incubation at room temperature for 2 min, and then centrifugation for 2 min. RNA was transferred to a low-binding 96-well plate and stored at − 80 °C.
Bulk RNA barcoding and sequencing
The BRB-seq is a technique for multiplexed RNA-seq [25] which is able to provide high-quality 3′ transcriptomic data at a low cost (e.g., 10-fold lower than Illumina Truseq Stranded mRNA-seq). The data (fastq files) generated from BRB-seq are multiplexed and asymmetrical paired reads. Read R1 contains a 6-bp sample barcode, while read R2 contains the fragment sequence to align to the reference genome.
Library preparation
RNA quantity was assessed using picogreen (Invitrogen P11496). Samples were then grouped according to their concentration in 96-well plates and diluted to a final concentration determined by the lowest sample concentration on the plate. RNA was then used for gene expression profiling using BRB-seq. In short, the BRB-seq protocol starts with oligo-dT barcoding, without TSO for the first-strand synthesis (reverse transcription), performed on each sample separately. Then, all samples are pooled together, after which the second strand is synthesized using DNA PolII Nick translation. The sequencing library is then prepared using cDNA augmented by an in-house-produced Tn5 transposase preloaded with the same adapters (Tn5-B/B), and further enriched by limited-cycle PCR with Illumina compatible adapters. Libraries are then size-selected (200–1000 bp), profiled using High Sensitivity NGS Fragment Analysis Kit (Advanced Analytical, #DNF-474), and measured using Qubit dsDNA HS Assay Kit (Invitrogen, #Q32851). In total, we generated four libraries. For details of the library information, please check Additional file 2: Table S13.
Sequencing
Libraries were mixed in equimolar quantities and were then sequenced on an Illumina Hi-Seq 2500 with pair-end protocol (read R2 with 101 bp) at the Lausanne Genomic Technologies Facility.
RNA-seq analysis
Generating expression matrix
The fastq files were first demultiplexed by using the “Demultiplex” tool from BRB-seqTools suite (available at https://github.com/DeplanckeLab/BRB-seqTools). Then, we trimmed the polyA sequences of the demultiplexed files by using the “Trim” tool. Next, the STAR aligner [47] was used to map the trimmed reads to the reference genome of fly Drosophila melanogaster (BDGP6, Ensembl release 91 [48]). Finally, the read count of each gene was obtained with HTSeq [49].
Filtering samples and genes
Low-quality samples need to be filtered out, since they might bias the results of downstream analyses. In order to assess sample quality, we calculated the number of uniquely mapped reads and of expressed genes for each sample [50]. We removed samples with < 0.3 million uniquely mapped reads or with < 4500 expressed genes (Additional file 1: Figure S1). We confirmed that these filtered samples are indeed outliers in a multidimensional scaling analysis (MDS) (Additional file 1: Figure S16). Since lowly expressed genes have a larger technical error, to minimize the technical noise, we need to remove lowly expressed genes as well. We first calculated counts per million (cpm) with the edgeR package [51] for each gene. Then, we removed genes with mean cpm across samples ≤ 1, as suggested by Lun et al. [50]. Finally, for the remaining genes, we re-transformed their cpm values to the original count values for the downstream normalization analysis. After filtering, we obtained an expression count matrix with 239 samples (Additional file 1: Figure S2) and 8004 protein-coding genes.
Normalization and batch effect correction
Because BRB-seq retains only the 3′ end of the transcript, we performed sample normalization by using quantile normalization with log transformation in the voom package [52], but without transcript length normalization. To remove potential batch effects across the four libraries, we applied the ComBat function in the sva package [53] to the normalized and log2 transformed expression data. For genes with expression values less than 0 after Combat, or with original expression values equal to 0, we change its values to 0 after Combat correction as suggested by Kolodziejczyk et al. [54].
Multidimensional scaling analysis
A number of factors could be invoked to explain the two groups observed in our multidimensional scaling analysis (MDS) (Fig. 1b), but they should also explain that only one group is structured according to the developmental time. The obvious hypothesis that they correspond to male and female embryos does not explain that structure and is also not supported by examining X/autosome gene expression ratios (Additional file 1: Figure S17). An alternative hypothesis is that samples in the small cluster are unfertilized eggs. If an egg is not fertilized, after completion of meiosis, the development will be arrested [55], but they are visually indistinguishable. This hypothesis is confirmed by at least two lines of evidence, in addition to the lack of developmental time structure. First, for expression correlation, all samples in the small cluster are highly correlated with unfertilized egg, while the correlations in the other samples gradually decrease with development (Additional file 1: Figure S3A). Second, all the samples from the small cluster are enriched with meiosis-related genes (Additional file 1: Figure S3B). Thus, we excluded the small cluster for downstream analyses, i.e., we used 150 embryos with an average of 18 individuals per developmental stage.
Metrics of expression variability
Expression variability is generally measured by the coefficient of variation (CV) [56]. However, a gene’s CV is strongly dependent on its RNA abundance (Additional file 1: Figure S4). While this is an inherent property of time-interval counting processes (such as a Poisson process), it makes the comparison of variability between different conditions difficult [54, 57]. Distance to median (DM, the distance between the squared CV of a gene and its running median) has been proposed as a variability metric that is independent of expression level [28, 54, 57]. However, the DM is still strongly negatively correlated with the mean expression level in our data (Additional file 1: Figure S5). To avoid this dependency, we developed another variability measure, the adjusted standard deviation (adjusted SD), by calculating the ratio between observed SD and expected SD. Following the same approach as Barroso et al. [58], we performed polynomial regressions to predict the expected SD from mean expression. Since the adjusted SD metric works much better than DM in terms of accounting for the confounding effects of mean expression (Additional file 1: Figure S6), we adopted it as a measure of expression variability in our study. As observed in yeasts [27, 28], we found that essential genes and hubs (proteins in the center of protein-protein interaction network) have lower expression variability than other genes (Additional file 1: Figure S7), indicating selection to reduce it. This observation provides a control that we are indeed measuring biologically relevant expression variability.
Detailed calculation of expression variability is as follows:
Adjusted SD
For each gene, we computed the standard deviation (SD) in each stage and over all stages. Then, we fitted a polynomial model to predict the global (across all stages) SD from the global mean expression. We increased the degrees of the model until there was no more significant improvement (tested with ANOVA, p < 0.05 as a significant improvement). Then, based on this best-fitting model, for each gene, we computed its predicted global SD based on its global mean expression. Finally, the adjusted SD of a gene in one stage is this gene’s SD in its corresponding stage divided by its predicted global SD. This method is derived from Barroso et al. [58], but computing adjusted SD rather than adjusted variance.
Distance to median: the distance between the squared coefficient of variation of a gene and its running median
For each gene, we computed its squared CV in each stage and over all stages. Then, we ordered genes based on their global (across all stages) mean expression. Next, we defined a series of sliding windows of 50 genes with 25 genes overlap, starting from the lowest global mean expression. Finally, the distance to median of a gene in one stage is the stage-specific log10 squared CV minus the median of global log10 squared CV in this gene’s corresponding window. R code was modified from the DM function of the scran package [50].
Bootstrap analysis
For each stage, we randomly sampled the same number of samples. Then, we computed the adjusted SD based on these random samples. We repeated the first two steps 500 times. Each time, we only kept the median of the adjusted SD for each stage. Thus, in each stage, we obtained 500 medians. Finally, we performed a Wilcoxon test to test the significance of the difference between the bootstrapped values of different stages.
ChIP-seq data analysis
Histone modification signal datasets
The signal data files of four euchromatin histone modification marks (H3K4me3, H3K9ac, and H3K27ac) at six developmental stages (0–4 h, 4–8 h, 8–12 h, 12–16 h, 16–20 h, 20–24 h) were downloaded from modENCODE [39] (NCBI GEO: GSE16013) (March 2018). The signal is smoothed, background-subtracted tag density. The signal was precomputed along the genome in 35-bp windows.
Histone modification signal for promoter
For each gene, we calculated the mean signal of its proximal promoter (2 kb upstream to 2 kb downstream for transcription start site (TSS)) by using the bedtools “map” command [59]. The TSS and transcription end site (TES) information was retrieved from Ensembl release 91 [48]. For a gene with several TSS and TES, we use its mean coordinates.
Histone modification signal Z score transformation
For each modification mark in each stage, the signal value was transformed into a Z score by subtracting the mean signal across intergenic regions and dividing by the standard deviation signal of the intergenic regions. The intergenic region was defined by removing all proximal promoter regions and gene body regions (TSS to TES) with the bedtools “subtract” command [59]. Since the three histone modification marks are largely enriched at promoter over intergenic regions in Drosophila [39], this allows to normalize between libraries. Then, for each gene, we re-calculated the mean signal (Z score) of its proximal promoter (2 kb upstream to 2 kb downstream for TSS) by using the bedtools “map” command [59].
Identification of stage specifically expressed genes
Following the same approach as previously [10], we first defined 8 stage-specific expressed artificial expression profile (Additional file 1: Figure S18A). Then, for each gene, we performed Pearson’s correlation between its real expression and this artificial expression. Finally, for each artificial expression profile, we kept genes with top 10% correlation coefficient as the corresponding stage specifically expressed genes (Additional file 1: Figure S18B).
Identification of hourglass expression variability genes
Similar to the stage specifically expressed gene identification approach, we correlated each gene’s variability profile with the median across all genes. Then, we kept genes with the top 10% correlation coefficient as the ones following the global hourglass variability profile.
Gene Ontology enrichment analysis
We performed Gene Ontology (GO) enrichment analysis for hourglass expression variability genes by using the topGO [60] R package with the “elim” algorithm.
Partial correlation
The R package “ppcor” [61] was used to compute Spearman’s partial correlation coefficient between histone modification signal and expression variability after controlling for the effect of promoter shape.
Transcriptome index analysis
A “transcriptome index” [62, 63] is a weighted mean of a feature over all genes, where the weights are the expression levels of the genes at each condition (e.g., developmental stage). The transcriptome index of phastCons was calculated as:
\( {\mathrm{TPI}}_s=\frac{\sum \limits_{i=l}^n{\mathrm{phastCons}}_i\times {e}_{is}}{\sum \limits_i^n{e}_{is}} \)
where s is the developmental stage, phastConsi is the promoter sequence conservation score of gene i, n is the total number of genes, and eis is the expression level (log-transformed) of gene i in developmental stage s.
Confidence interval analysis for transcriptome index
Firstly, we randomly sampled gene IDs from each original data set 10,000 times with replacement. Then, we computed transcriptome indexes for the 10,000 samples. Finally, the 95% confidence interval is defined as the range from quantile 2.5% to quantile 97.5% of the 10,000 transcriptome indexes.
Meiosis-related genes and transcription factors
The meiosis-related genes and transcription factors were downloaded from AmiGO [64] (May 2018).
Individual unfertilized eggs RNA-seq data
The normalized and log-transformed expression matrix of individual unfertilized eggs was downloaded from NCBI GEO: GSE68062 [65] (May 2018).
Promoter shape index
The promoter shape index was downloaded from [66]. (June 2019). Lower value means broader promoter.
Essential gene annotation and protein connectivity datasets
The gene essentiality and protein connectivity datasets were downloaded from OGEE v2 [67] (March 2018).
PhastCons score
The pre-computed sequence conservation score phastCons [32] of fly genome was downloaded from http://hgdownload.soe.ucsc.edu/goldenPath/dm3/phastCons15way/ (February 2018). Higher value means higher conservation.
Experimentally validated core promoter regions
Experimentally validated transcription start sites (TSSs) were downloaded from the Eukaryotic Promoter Database (EPD) [31] (May 2018). For a gene with several TSSs, we selected the most representative one (the TSS that has been validated by the largest number of samples). The core promoter region was defined as 49 bp upstream TSS to 10 bp downstream of the TSS [31]. We used EPD-defined TSSs here because they are more accurate for defining core promoters, whose function is expected to be related to sequence conservation. Whereas for histone modification signal for promoter, we used Ensembl-defined TSSs to be consistent with the source of transcription end site (TES) information, and precision was less important in defining broader proximal promoters.