- Research article
- Open Access
Classifying human promoters by occupancy patterns identifies recurring sequence elements, combinatorial binding, and spatial interactions
© Yang and Vingron. 2018
- Received: 14 March 2018
- Accepted: 4 October 2018
- Published: 15 November 2018
Characterizing recurring sequence patterns in human promoters has been a challenging undertaking even nowadays where a near-complete overview of promoters exists. However, with the more recent availability of genomic location (ChIP-seq) data, one can approach that question through the identification of characteristic patterns of transcription factor occupancy and histone modifications.
Based on the ENCODE annotation and integration of sequence motifs as well as three-dimensional chromatin data, we have undertaken a re-analysis of occupancy and sequence patterns in human promoters. We identify clear groups of CAAT-box and E-box sequence motif containing promoters, as well as a group of promoters whose interaction with an enhancer appears to be mediated by CCCTC-binding factor (CTCF) binding on the promoter. We also extend our analysis to inactive promoters, showing that only a surprisingly small number of inactive promoters is repressed by the polycomb complex. We also identify combinatorial patterns of transcription factor interactions indicated by the ChIP-seq signals.
Our analysis defines subgroups of promoters characterized by stereotypic patterns of transcription factor occupancy, and combinations of specific sequence patterns which are required for their binding. This grouping provides new hypotheses concerning the assembly and dynamics of transcription factor complexes at their respective promoter groups, as well as questions on the evolutionary origin of these groups.
- Transcription factor combinatorics
- Promoter-enhancer interaction
The sequence patterns which characterize mammalian promoters have been the object of intense study for many years (see, e.g.,  for review). A number of sequence patterns, most prominently the TATA-Box, have been associated to promoters. At the same time, the wish to recognize promoters from sequence alone has prompted the search for further characteristic patterns. In the fruit fly, such patterns could indeed be delineated (see  and ). For mammalian promoters, a number of sequence patterns were identified as enriched based on careful sequence analysis (see, e.g., Bucher  and FitzGerald et al. [5, 6]). From amongst those patterns, the TATA-box used to be considered a hallmark of mammalian transcription. However, the more promoters became known the fewer of them actually contained a TATA-box. At the same time, CpG islands have been identified as an important feature of a human promoter (see  for review), even dividing the human promoters into the ones that possess a CpG island vs. the ones that lack it ([8, 9]).
The ENCODE project  has produced ample data on human promoter occupancy using the chromatin immunoprecipitation with massively parallel DNA sequencing (ChIP-seq) technology. It became clear that active promoters show particular histone modification patterns like an increased level of Histone 3 Lysine 4 trimethylation (H3K4me3) and of H3K27ac. Several papers exploited the data-set to improve on the definition of transcription factor binding sequences [11–13]. ENCODE data also comprises an extensive set of genomic location data for transcription factors and other chromatin-associated proteins, which do not bind in a sequence-specific manner. The group of Zhiping Weng and their collaborators already provided an in-depth analysis of the binding sites and their co-occurences in  and the associated factorbook-website . However, their focus was on the binding sites in general and not on the promoters and their activity status. With a focus on the NF-Y transcription factor, Dolfini et al. studied the associated regulatory module . Giannopoulou et al.  and Gerstein et al.  compiled a regulatory network and chromatin-bound protein complexes. Dan et al. studied TF colocalization. The question remains what can be learned from an integrated analysis of histone modifications and transcription factor occupancy patterns with respect to subgroups of promoters and their characteristic sequence patterns.
In this paper, we focus on the human promoter regions as defined in the RefSeq  database and integrate transcription factor binding information, histone modifications, and chromatin-associated proteins for the purpose of delineating subgroups of promoters. We collect promoter activity information in the form of CAGE tags  as reported by the FANTOM consortium. It is the RefSeq annotation that provides the static information that a chromosomal position can in principle act as a transcription start site, and the CAGE information that provides the dynamic information on promoter activity. As a control, we repeat the analysis with promoters identified from CAGE peaks only and independent of RefSeq annotation. From the ENCODE data, we process the ChIP-seq experiments for the histone marks (HMs) and transcription factors (TFs) for the two cell-lines with the largest available number of experiments (GM12878, K562). We will also verify our generated hypotheses in HeLa-cell data. For each cell-line, our input data constitutes a large matrix where a promoter defines a column, and each row corresponds to one ENCODE ChIP-seq experiment. For each column, the number of CAGE tags indicates transcriptional activity from this TSS and allows us to distinguish active and inactive promoters. Using the computational method of biclustering , we determine groups of promoters and associated groups of ChIP-seq experiments, where the promoters are occupied by just these TFs or HMs and, vice versa, the TFs/HMs are characteristic of these promoters. This information is visualized in a heatmap depiction of the matrix with columns and rows arranged in such a way that the groups become visible as blocks in the matrix. Note that biclustering a matrix differs from combining two hierarchical clusterings on either dimension in that biclustering selects combinations of groups of rows and groups of columns, together forming a homogeneous bicluster .
Analysis of the heatmap that results from biclustering uncovers several classes of promoters that are distinguished by the combinations of bound TFs. We will back up this information by motif analysis and, where meaningful, with chromatin 3D information. We analyze both active promoters and inactive promoters. Clustering information will be further supported by statistical tests for enrichment of occupancy and of motif occurrence. Taken together, our results demonstrate distinct subgroups of promoters defined by TF occupancy with the existence of the sequence patterns a prerequisite for the binding of the respective proteins.
Sorting heatmaps for active and inactive promoters using biclustering
For each of the two cell-lines, GM12878 and K562, we study active and inactive promoters separately, dividing them based on the number of CAGE tags mapping to the promoters as proposed by FANTOM (, for details on this and the following steps see the “Methods” section). GM12878 yielded 6030 active genes and 6854 inactive genes, while K562 has 4172 active genes and 9588 inactive genes. For each promoter, we collected from the ENCODE data  the ChIP-seq read-counts for these promoter regions in the respective cell-lines, see the “Methods” section for processing of read-counts.
Besides using an algorithm that is robust in the sense that it does not report results that might be due to particular parameter settings (see the “Methods” section), two further precautions were taken against possible over-interpretation of computational results. First, an alternative biclustering algorithm based on a very different algorithmic paradigm was employed and found to yield highly similar results (see the “Methods” section and Additional file 1). Second, based on the reported clusters of promoters, we further test the association of individual rows with respect to the column clusters using a t test obtaining a measure how well a row fits into a bicluster. The resulting significance, measured as negative logarithm of the t test p value, is plotted towards the right adjacent to the matrix and grouped by the clusters to which membership is tested. Thus, each bar aligns to its respective row in the matrix, extending to the right and providing evidence in how far this row belongs to the respective cluster. As an example, consider the TF SP1 in Fig. 1 which associates also with promoters in cluster II, although the main affiliation of this TF is in cluster I. This way of associating a probability to the association of a row with a cluster further qualifies the information from the biclustering in an effort to prevent over-interpretation.
Similar biclustering results were obtained for promoters identified from CAGE peaks rather than RefSeq promoters. The corresponding heatmaps are shown in Additional file 1: Figures S4 and display the same division into a structured submatrix and an unstructured part. Note that the number of inactive TSSs in this CAGE-based definition is much larger than in the RefSeq-based definition, because whenever a CAGE cluster was observed in some other cell line, its absence in K562 or GM12878 is interpreted as an inactive TSS.
Classes of active promoters
Based on occupancy patterns depicted by blocks in the heatmap, we have identified five groups of active promoters in GM12878 cell-line. Additional file 1: Figure S5 shows a bar plot of number of promoters in each of the cluster in both of the cell lines. We proceed to introduce these clusters based on the heatmap.
In the heatmap for the active promoters (Fig. 1), the activity-related histone marks H3K4me3, H3K27ac, and H3K9ac are clearly visible across the top rows, together with the rows for PolII and TAF. TATA-binding-protein (TBP) was sorted into another cluster, although visually TBP could be easily joined to these activity related marks. The t test supports this view in that testing TBP for cluster Ia versus the rest shows high significance. In GM12878, we named the first three clusters Ia, Ib, and Ic to indicate that visually they might easily be merged into one cluster. This cluster I further comprises a large group of transcription factors including, e.g., CREB, MYC, ETS1, and others. We observed a similar pattern for K562, where TBP is included in this activity-related cluster I, too.
Cluster II promoters (hereafter called “NF-Y-cluster”) are characterized by binding of NFYA and/or NFYB. The NF-Y transcription factors bind to the CAAT-box, a classical promoter motif . It is clearly visible and supported by the t test, which the NF-Y-cluster includes FOS, as observed earlier by Struhl . The t test also links SP1 to the NF-Y-cluster, although SP1 is visible both on cluster Ib and NF-Y-cluster promoters. In GM12878, SP1 has been assigned to cluster Ib by the algorithm, while in K562 data SP1 is assigned to the NF-Y cluster. Although cooperative binding of NF-Y and FOS, as well as NF-Y and SP1, have been reported in other studies , our analysis indicates a three-way synergy. For additional aspects including the corresponding sequence motifs see section “TF combinatorics in USF and NF-Y clusters”. Promoters in the NF-Y cluster show the standard activity-related marks but are notably devoid of signal from the other studied transcription factors.
In K562 cells, the algorithm groups together NF-Y, SP1, and FOS, too, and the t test supports this clustering (Additional file 1: Figure S2B). Additional file 1: Figures S7 and S8 shows the promoter occupancy with these three factors and demonstrates that many SP1 bound promoters indeed are also occupied by NF-Y and FOS. At the same time, SP1 binds to many more promoters, in line with observing SP1 occupancy also on cluster Ib promoters in the ChIP-seq biclustering matrix.
The USF family (USF1, USF2) transcription factors dominate cluster III (“USF cluster”). The USF binding motif is CACGTG , which is an instance of the classical E-box motif , known, e.g., from the regulation of clock genes . Like many of the cluster I transcription factors, USF is also a bHLH factor. Interestingly, while there are reports of cooperative binding of USF and NF-Y in particular promoters [29, 30], we observe an anti-correlation in that promoters bind either the one or the other. This is further supported by the motif analysis which shows that the binding motifs of USF (E-box) and NF-Y (CAAT-box) tend not to co-occur in the same promoter (see section “Sequence motifs and enrichment analysis” and Additional file 1: Figures S9 and S10). Typical basic helix-loop-helix transcription factors like MAX and BHLHE40 also bind to the USF promoters, which hints at different dimers binding to the E-box. Although in K562 the biclustering algorithm did not assign USF2, the transcription factors USF1, USF2 and ATF3 are clustered together in GM12878, with MYC joined to that cluster in K562. MAX and BHLHE40 are also associated to this cluster by the t test.
Cluster IV (in K562 Cluster V, “CTCF cluster”) is dominated by CTCF/cohesin comprising CTCF itself together with cohesin components RAD21 and SMC3. Again, the group of promoters displaying this binding signal is largely devoid of binding of either NF-Y or USF. While CTCF is not considered a transcription factor, these promoters are at the same time occupied by Pol II, TBP, and TAF1. K562 also shows clustering of CTCF and SMC3, together delineating a particular albeit smaller group of promoters. The structural implications of this observation will be discussed further below.
The lower right corner shows another small cluster (cluster V in GM12878, cluster IV in K562) that is mainly characterized by ELK1 (binding site consensus CCGGAAGT) and SRF binding. These two proteins are known to interact and bind DNA jointly . The t test analysis indicates that from the other clusters GABPA, SIX5, and ETS1 show similar binding behavior as ELK1 and SRF. ELK1 and GABPA are both members of the ETS transcription factor family .
Similar results for promoters identified from CAGE peaks only are shown in Additional file 1: Figures S11A (GM12878) and S11C (K562). Clusters are numbered as they are output by the algorithm. Clusters I, II, and III in S11A correspond to Ia-Ic from before comprising largely the same TFs and histone modifications. The NF-Y cluster from before corresponds to CAGE-based cluster IV. The RefSeq-based CTCF cluster from before becomes cluster VI in S11A. Only the USF cluster does not have a direct counterpart in the CAGE-based clustering, whereas the RefSeq-based SRF-elk cluster (V) gets combined with two additional TFs and is denoted V in S11A. For the active K562 TSSs in S23C, the correspondence between the biclusterings is comparable.
The two cell-lines under study, K562 and GM12878, were chosen based on the availability of many TF ChIP-seq experiments. While reproducing the entire analysis in another cell-line is therefore difficult, we did test for the existence of the proposed NFY, USF, and CTCF clusters in HeLa cells. ENCODE data for HeLa comprise ChIP-seq experiments for NFYA, NFYB, USF2, and CTCF, such that we could test our hypothesis there. Additional file 1: Figure S12 shows the promoter coverage patterns by these factors in HeLa cells after applying k-means (k=4) clustering to the data. This analysis confirms the existence of the NFY, USF, and CTCF clusters. Increasing the cluster number beyond 4 did not lead to new patterns. We also validated these clusters based on the CAGE-based TSS definition by showing their coverage pattern in GM12878 (Additional file 1: Figure S13) and K562 (Additional file 1: Figure S14). In both cases, the groups are again visible in the k-means sorting of the coverage patterns. Even the USF cluster on CAGE-TSSs becomes visible here, although it was not found in the CAGE-based biclustering.
Classes of inactive promoters
In search for patterns associated with silent promoters we also applied the same data analysis strategy to those promoters which showed no CAGE tags (see the “Methods” section). Like for the active promoters, in the complete biclustering matrix for promoters inactive in GM12878 (Additional file 1: Figure S1), we home in on the rows and columns that are assigned to clusters (Fig. 2). The analogous matrix for K562 can be found in the (Additional file 1: Figures S1 and S3). The t test showing the membership of rows in a particular cluster is again printed to the right.
The first mechanism of repression that one would look for is by polycomb combined with H3K27me3. This can be found as part of cluster IV (cluster III in K562), where EZH2 and H3K27me3 mark around half of the promoters that are grouped into this cluster of around 400 promoters. These signals are hardly visible in any promoter outside this cluster.
Like among the active promoters, here too there is a cluster (cluster II) of promoters that is characterized by CTCF/cohesin. We inspected the RNA-seq data (Fig. 3b for GM12878 and Additional file 1: Figure S6B) and it shows that these promoters are clearly inactive. ZNF143 displays a very similar pattern, in line with the observation that it is a sequence specific chromatin-looping factor . This is also supported by the K562 data. These observations will be discussed below in conjunction with the active CTCF-bound promoters.
There is a large cluster I that is characterized by H3K4me2, H3K4me3, and H2AZ. Yet, the other activity-related mark H3K27ac is weak in these promoters. The transcription-elongation-related mark H3K36me3 is mostly absent, while some of the promoters show H3K79me2. Given that the promoters for this particular analysis were selected to have almost no CAGE tags, these are contradictory signals. We are not looking at bivalent promoters because H3K27me3 is absent. The RNA-seq data shows extremely low signals for this cluster I Fig. 3b for GM12878 and Additional file 1: Figure S6B). However, the presence of H3K79me2 is an indicator of transcription, suggesting that we may be looking at unused alternative promoters. Indeed, for many of these promoters, we found an active alternative promoter further upstream (Additional file 1: Figure S15A). A similar explanation applies to cluster III (cluster IV in K562). Many promoters in this cluster display a combination of H3K36me3 and H3K79me2, again strongly indicating that in the cell type under study these genomic regions are part of a transcribed unit (Additional file 1: Figure S15B). Thus, in this cell type, transcription is not actually starting from these promoters but rather they are run over by transcription that had started elsewhere.
The clustering based on CAGE-defined TSSs lead to similar results. Additional file 1: Figures S11B and S11D show the two heatmaps for GM12878 and K562, respectively. The CTCF cluster (number II in Additional file 1: Figures S11B and S11D) and the polycomb-associated cluster (number IV in Additional file 1: Figure S11B and V in Additional file 1: Figure S11D) are clearly identifiable. There also exist the analogs to former cluster I (some active marks) and the H3K36me3 cluster (number V and III, respectively). The plots appear less succinct than for the RefSeq-based promoters due to the very large number of inactive CAGE-based TSSs.
Differences in promoter occupancy patterns between cell-lines
Other covariates and functional categories
An interesting question concerns possible covariates associated to the individual clusters as well as possible functional characteristics. We have already discussed that genes in cluster I are generally more highly expressed than those from the other clusters. Figure 3a, b (Additional file 1: Figures S6A and B for K562 cell line) shows the levels of CAGE tags and RNA-seq expression levels across the clusters. It is apparent that the active promoters are targeting active genes and that cluster I is more active than the other clusters. The inactive promoters were chosen based on the lack of CAGE tags and accordingly also are very lowly or not at all expressed. We also find a strong enrichment of house-keeping genes  in this cluster (χ2 test p value = 1.26e −05, Additional file 1: Figure S17).
Another important covariate that could be linked to expression and the clustering of promoters is the CpG content of a promoter. Indeed, the vast majority of the active promoters is CpG rich, as shown in Fig. 3c(Additional file 1: Figure S6C for K562 cell line). Among the inactive promoters, clusters vary with respect to the CpG content of their promoters, with only the polycomb-repressed promoters comprising mostly high-CpG promoters. This is in line with the general assumption that the polycomb complex PCG2 gets recruited to CpG islands .
The DNase accessibility of a promoter is another indicator of its activity. Figure 3d (Additional file 1: Figures S6D for K562 cell line) shows the distribution of accessibility among the cluster promoters. Among the active clusters, cluster I together with the NF-Y cluster display a higher accessibility than the remaining clusters. The DNase accessibility signal for the inactive cluster is generally low, with the somewhat ambiguous inactive cluster I slightly higher than the others, although still much lower than any of the active clusters.
To further understand which histone modifications of transcription factors contribute the most to the gene expression value, we use a linear model to regress CAGE reads on the normalized ChIP-seq signals . Overall, the Pearson correlation coefficient of all ChIP-seq signals and expression is r_all=0.878 and 0.879 in both of the two cell lines, respectively. Following the feature selection procedure used in , we further test all the combinations of 4 ChIP-seq experiments (487635 combinations) and record their Pearson correlation between the model predictions and the CAGE tags. We define the “good sets” as those yielding a correlation better than 95% of the maximal achievable 0.878 (resp. 0.879) correlation and then apply a hypergeometric test to determine those variables that are highly represented among the good sets. We found that TAF1, POlII, and TBP are the top three epigenetic marks that contribute the most to the gene expression (Additional file 1: Figure S18). Among the histone modifications, the most significant ones are H3K27ac, H3K9ac, and H3K79me2.
In search for possible functional implications of the clusters, we systematically tested the target genes of the promoters in the clusters for Gene Ontology enrichment using DAVID . The five most significant categories with a p value better than 10−5 are summarized in Additional file 1: Table S3, while Additional file 2: (full_GO.xlsx) reproduces the entire DAVID output. For cluster I, this analysis yielded as the most significant hit the category “translational elongation,” which is in line with the prominence of house-keeping genes in this cluster. For the NF-Y and USF clusters, several categories relate to transcriptional control. The NF-Y clusters for the two cell lines largely overlap in terms of the content and accordingly the GO enrichment results are similar. In contrast hereto, the USF clusters contain different genes in the two cell-lines and the enrichment results are accordingly dissimilar. From the presented statistics no clear conclusion as to functional implications of the promoter clusters can be made.
Sequence motifs associated to clusters
A key question in our analysis of promoter clusters concerns the difference between TF occupancy and existence of TF binding motifs. We therefore investigated whether promoter clusters are supported by sequence motifs and whether there are further sequence motifs beyond the known ones for the ChIP-seq TFs. To this end, we conducted motif-search in promoter clusters using all motifs from the Jaspar database (Jaspar 2014) using the MAST tool from the MEME suite (, see the “Methods” section for details). This analysis recovered the binding motifs of the ChIP-seq transcription factors but no further motifs beyond that (data not shown). We conclude that the observed clustering is not a product of binding of further transcription factors, in as far as the binding motifs are contained in Jaspar.
We further asked whether in the ChIP-seq experiments the TFs bind all promoters containing that TF’s binding site, or whether they only bind a subset of available binding sites. To this end, we computed the motif hits for the binding sites in all the promoters and depict the resulting motif match scores as rows of a matrix whose columns are ordered like in the biclustering matrix (Additional file 1: Figures S9 and S10). Some motifs occur in many promoters and across all clusters, like, e.g., the SP1 motif. This is in stark contrast to the clearly defined subset of SP1-bound promoters according to the ChIP-seq heatmap (Fig. 1 and Additional file 1: Figure S2). On the other hand, the motif hits for the NF-Y binding CAAT-box and the USF binding motif, the E-box, are remarkably concentrated in their respective cluster of TF-bound promoters. This suggests that the existence of the respective motif in the promoter is in itself sufficient for TF binding and possible activation of that gene.
TF combinatorics in USF and NF-Y clusters
USF1 and USF2 most of the time bind together with BHLHE40 and MAX (Fig. 5k-means groups 1, 2, and 3, Additional file 1: Figure S19 k-means groups 2–4). MYC displays a strong ChIP-seq signal only when it binds together with MAX, apparently in the well-known MYC-MAX complex (Fig. 5k-means group 5, Additional file 1: Figure S19 k-means group 6). In GM12878, we observe a group of BHLHE40-specific promoters (Fig. 5k-means group 4) and K562 has a group of MAX-specific promoter (Additional file 1: Figure S19 k-means group 1). This appears to be the only promoter set that is different between the two cell lines. Thus, the similarity of TF occupancy pattern in different cell lines suggests a conservation of binding combinatorics across cell types and is consistent with observed importance of bases flanking the core E-box  for specific binding.
CTCF and spatial promoter-enhancer interaction
The existence of spatial interactions suggests that the distal regions of these interactions might overlap with enhancers targeting the respective promoters. To test this, we downloaded from ENCODE the ChromHMM-generated enhancer annotation (GSM936082)  to compare to the distal ChIA-PET regions. For the purpose of counting overlapping regions, we extend the distal region by 500 bp at either side and check for overlap with a ChromHMM strong or weak enhancer. The respective figures are integrated into the bars in Fig. 7: 91.11% of promoters in the active CTCF cluster loop to at least one region predicted to be an enhancer, with 65.04% looping to what is predicted as a strong enhancer. Promoters from the inactive CTCF cluster loop to a strong predicted enhancer in 32.13% of cases.
There has for a long time been a keen interest in determining principles of human promoter architecture. Since sequence motifs have not sufficed to achieve this, we turned to analyzing transcription factor occupancy in promoters as determined in the ENCODE project, in conjunction with sequence. We proceed in a top-down manner starting from the entirety of promoters with their ENCODE ChIP-seq signals and apply biclustering to delineate subgroups of promoters that share particular histone modifications and/or transcription factors. Our study fills a gap between sequence-based promoter studies (e.g., ) and the global occupancy analysis, like, e.g., in . Other authors have studied the global regulatory network  or focused on NF-Y and its co-factors . Our goal was the delineation of subsets of stereotypical human promoter architectures.
Among the active promoters, we identified the NF-Y binding promoters, containing the CAAT-box motif, as one distinct promoter cluster. Like other authors before us , we observe that NF-Y binding frequently goes along with binding of FOS and Sp1. While Sp1 binding sites occur ubiquitously, we have shown that this combination is highly characteristic of a subgroup of promoters.
Another subgroup is characterized by the basic helix-loop-helix factor USF binding to its binding site, the E-box. The NF-Y cluster and the USF cluster appear almost mutually exclusive, reinforcing the view that these two groups constitute two characteristic promoter architectures.
In the cell types studied, there is also a small cluster characterized by binding of the ETS TF. At the same time, there is a large group of promoters that apparently lack any characteristic TFs or TF-combinations, but clearly show the activating histone modifications. While the groups discussed are the result of a clearly visible clustering pattern, there are many more promoters for which—at least with the available transcription factor ChIP-seq experiments, no cluster structure is visible. We conclude that certain characteristic promoter types exist, but only a part of all promoters falls into any of these subgroups.
A further promoter cluster is characterized by binding of CTCF in the promoter region. For this group, we presented evidence suggesting that CTCF and cohesin are responsible for the interaction between the bound promoter and one or more enhancers. This putative arrangement differs from the one where, within a TAD (topologically associated domain [48, 49]), promoters and enhancers can cluster in space without direct involvement of CTCF/cohesin. It remains to be seen whether the two arrangements have different functional consequences.
On the side of the inactive promoters, we were surprised to see that the number of promoters repressed by polycomb is fairly small, thus raising the question whether there are other “mechanisms of repression,” or whether repression is just the absence of activation. We also had expected to see a stronger role for the repressive mark H3K9me3. However, the signal for this mark is low across all experiments (see Additional file 1: Figure S1) and it is unclear whether its absence actually constitutes a biological signal or is due to some technical issues.
Other authors [6, 16] investigated the question whether genes with promoters bound by NF-Y belong to particular functional classes of genes as given, e.g., in the Gene Ontology database. We also tested our derived promoter clusters for such target categories but the results remain rather generic and unconvincing. Although each study, including our own, finds some enriched GO term, those functions are either not consistent within a cluster or they are so generic that they are hardly informative. Thus, we suspect that TF combinatorics might actually lack a systematic link to functional categories.
In summary, basing this analysis on a combination of TF occupancy and motif analysis, we have defined stereotypic patterns delineating a novel grouping of promoters. This grouping opens up interesting new questions concerning the transcription factor complexes at the respective promoter groups, as well as questions on the evolutionary origin of these groups.
Data-sets and preprocessing
Potential transcription start sites
We downloaded coordinates of potential transcription start sites (TSSs) listed in the RefSeq annotation (hg19, release from December 10, 2014, from UCSC database). Cases where two or more TSSs lie within a 1 Kbp window are ignored altogether. This leaves 22164 annotated TSSs.
Cell type-specific activity status of TSS
TSS activity information is obtained from the CAGE (Cap Analysis of Gene Expression) analysis in the FANTOM 5 project . CAGE measures RNA expression level by capturing and sequencing the initial 20 nucleotides from the 5 ′ end of an RNA. We downloaded CAGE data for the GM12878 and K562 cell lines . CAGE peaks are called following the method of FANTOM consortium (decomposition peak identification (DPI) ). TSSs are divided into active and inactive ones. We define a TSS which has robust peaks in its [−500,+500] window around the TSS as active, whereas an inactive TSS is defined as not even having a FANTOM-defined permissive peak in that window. This process yields 6030 active and 14390 inactive TSSs in GM12878 and 4173 active and 15494 inactive TSSs in K562 cell line. We further discard inactive TSSs when located in an intergenic region. In the end, we are left with 6854 inactive genes in GM12878 and 9588 inactive promoters in K562.
We also download RNA-seq from ENCODE/Caltech (GSM958730,GSM958731). We apply cufflinks  with default parameters using RefSeq annotation described before on all the replicates. FPKM values of each TSS are used.
ENCODE ChIP-seq data
We downloaded ChIP-seq data and DNase-seq data (GSM736496/GSM736629) for the GM12878 and K562 cell lines from the ENCODE project . This data-set comprises ChIP-seq data for 11 histone modifications and 49 TFs or chromatin-associated proteins. A detailed list is given in Additional file 1: Table S1 and S2. After merging the replicate bam files for each ChIP-seq and DNase-seq experiment, we use MACS  to call peaks with default parameter and the “–call-subpeaks” option. Then, for each TSS the height of the peak closest to the TSS from within a [−1000,+1000] bp window around it is recorded. This value we normalize within each ChIP-seq experiment by linearly mapping the peak heights to the interval [0,1]. We reduce the influence of outliers by holding out the top 0.05% of values before the linear transformation and subsequently setting these values to 1. For depicting the ChIP signals in the promoter regions as coverage patterns, we use deeptools  and normalize the ChIP-seq signal against its input using the “bamCompare” function under default parameters.
To validate our observation in GM12878/K562 cell line, we download four ChIP-seq (NFYA, NFYB, CTCF, and USF2) together with their Input in HeLa from ENCODE (see Additional file 1: Table S1 and S2).
CTCF and ChIA-PET data
We downloaded the genomic coordinates of interacting genomic regions as determined by ChIA-PET for CTCF in GM12878  (GSM1872886). For each promoter in the CTCF cluster in active and inactive biclustering results, we looked for CTCF loops which have one side mapped to these promoter regions, and record the genomic loci for the other side of the loop (we call it the “interacting region”). If more than one loop is found interacting with a promoter region, the loop with the highest H3K27ac signal at the remote CTCF binding region is chosen. Since the remote CTCF binding regions differ in size, we choose the site which has the highest CTCF ChIP-seq read coverage as the center of the region and extend ±1 Kbp for depiction of the coverage patterns.
Transcription factor binding motifs
As a resource for TF binding motifs, we used the Jaspar database . To calculate the motif match score, we use the program MAST  from the MEME suite , using its default parameters and reporting all hits. If there is more than one hit in a promoter region, we take the hit with the smallest p value for further analysis.
Gene ontology enrichment analysis
We use DAVID functional annotation tool (version 6.7)  to do gene ontology enrichment analysis. For enrichment analysis on each active/inactive clusters, we use all active/inactive genes in the cell line as the background, for both of the cell lines. We present here enriched GOTerm in three categories: biological process (BP), cellular component (CC), and molecular function (MF). In Additional file 2, we listed all the GO terms that with adjusted p value cutoff =0.05 and fold enrichment cutoff =2 for all clusters and both cell lines.
Sorting the heatmaps by biclustering
Our method for clustering the rows and columns of the CHIP-seq data matrix into a visually understandable heatmap is based on a robust version of the s4vd biclustering algorithm , which is available as an R package. s4vd solves the biclustering problem with a procedure that is based on the singular value decomposition. The result of biclustering a matrix is a coupled set of column and row clusters, where these cluster pairs are visible as sub-rectangles of the matrix with rows and columns permuted correspondingly. We use the default parameter settings of s4vd.
Due to its use of a randomized selection step, s4vd produces somewhat different results in different runs of the program. We exploit this with the goal of obtaining robust biclustering results by extracting the common cluster assignments from many runs. To this end, we run s4vd many times and determine the solution with the largest number of column clusters (a tie gets broken randomly). We call this the target clustering. Then, for each of the other solutions obtained from the other runs of s4vd, we determine an optimal assignment of its clusters to the target clustering. This is done by solving a linear assignment problem on the confusion matrix of the two cluster systems. The linear assignment problem is solved using lp.solve from R . Once all the clusters from all solutions are assigned to the target clustering, we determine the frequency at which a column gets mapped into a target cluster. Finally, we only keep columns for which there exists a target cluster to which it gets mapped with a frequency of more than 0.5, i.e., more than half the solutions would clusters this column consistently into a similar looking target cluster (by virtue of the optimal assignment).
In a biclustering result, there exists a connection between a column and a row cluster. Thus, once the column clusters are fixed by the above procedure, one can also determine how often a particular row gets associated to a row cluster. Again, we keep those rows, which are associated to one cluster in more than half of the solutions, and assign this row accordingly. This also leads to the t test p values reported next to our heatmaps: For one row, the test compares the mean of the values among the entries in the associated column cluster with the mean of the values outside the column cluster. This visualization is a further precaution against over-interpretation of the biclustering results. We uploaded our original data matrix and source code on the github .
To further ensure that the computational results which we interpret are not due to the specifics of our algorithm, we applied a second biclustering algorithm which is based on a very different computational principle. Rather than relying on the SVD, we first use k-means clustering on the TSSs (columns) of the ChIP-seq matrix. Additional file 1: Figure S21 shows how the value of k was selected. To then determine the association of a row with a cluster, we test each row with a t test. Like in the visualization procedure described above, the t test measures in how far a particular column cluster divides the row of active promotes into two different regimes of high and low values, respectively. For a given column cluster, the rows with a t test p value better than 0.001 form the row cluster associated to that column cluster. Both biclustering procedures are depicted graphically in Additional file 1: Figure S22. Additional file 1: Figure S23 shows the confusion matrices between the biclustering resulting from the two algorithms. It is apparent that the clusters that we have interpreted are stably reproduced also by the k-means t test-based algorithm.
Promoter definition based on CAGE tags
To not rely solely on the RefSeq promoter annotation, we also alternatively use the same CAGE tag data as above to define promoter location.
Here, we describe how promoters were derived from FANTOM 5 CAGE data. The FANTOM consortium has provided a CAGE-tag based TSS annotation which we downloaded from their website  Here, we also choose TSSs under the robust cutoff. There are 217572 TSSs in this annotation. Note that this annotation comprises CAGE tags from all the cell lines studied in FANTOM 5. In particular, it includes all possible isoforms and a gene might have several possible TSSs in a very short region, see example Figure in Additional file 1: Figure S24.
In this example, we have 6 TSS for Nat10 within an interval of 1000 bases. Although the annotation might be very accurate (i.e., to base pair resolution), it is not helpful to analyze all these TSS individually, because the resolution of the ChIP-seq data with which we annotate the promoters is much lower. Thus, we group CAGE TSS into potential promoter regions before we apply our biclustering method. We first join together TSSs with a distance shorter than 200 bp into a TSS cluster. We then exclude TSS clusters which extend over a length of more than 1000 bps, or where there are CAGE TSSs on both plus and minus strand in the same cluster, or when there are TSSs from more than one annotated gene in a cluster, see illustration in Additional file 1: Figure S25. After this step, we have 43675 TSS clusters with 121129 TSSs.
In order to assign an activity status to such a potential promoter, we extend 500 bps for each TSS cluster region and search for robust or permissive CAGE peaks in the K562/GM12878 cell lines. If there is one or more robust peaks of a cell line in one TSS region, then the TSS cluster is deemed active in this cell line. If there is not even a permissive peak in this TSS cluster region, then the TSS cluster is deemed inactive. Note that this step is strand specific.
After this step, we have 6684 active TSS clusters comprising 22340 TSSs and 31969 inactive TSS clusters comprising 82957 TSSs in the GM12878 cell line. For K562, we have 4375 active TSS clusters (13951 TSSs) and 35649 inactive TSS clusters (94654 TSSs). To save the expense of computing, we exclude inactive TSSs without any ChIP-seq signals in the ±1 Kbp TSS windows from the analysis. In addition, since two very close TSSs usually have the same ChIP-seq signals in their TSS windows, we keep only one row in the input matrix if two or more TSSs in a TSS cluster have exactly the same values of all ChIP-seqs.
The input matrix of active/inactive TSS of GM12878 for biclustering contains 19316/30634 TSSs, the input matrix of active/inactive TSS of K562 for biclustering contains 12861/41013 TSSs.
The authors would like to thank Sebastiaan Meijsing, Marcel Schulz, Xintian You for useful suggestions and proof reading of the manuscript.
This work was supported by the Bundesministerium für Bildung und Forschung “Deutsches Epigenom Programm” (DEEP, Förderkennzeichen 01KU1216C).
Availability of data and materials
All data generated and analysed during this study are included in this published article and its supplementary information files. In this paper, we use RefSeq annotaion (hg19, release from December 10, 2014, from UCSC database) and CAGE-tag-based annotation  to do the analysis. We downloaded data from ENCODE  includes ChIP-seq (see Additional file 1: Table S1 and S2 for detail GSM number.), RNA-seq (GSM958730,GSM958731) and DNase-seq (GSM736496,GSM736629). We download CAGE data from FANTOM . We downloaded ChIA-PET for CTCF in GM12878  (GSM1872886).
MV supervised the project. XY performed the analysis. MV and XY wrote the manuscript. Both authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Werner T. Models for prediction and recognition of eukaryotic promoters. Mamm Genome. 1999; 10(2):168–75.View ArticleGoogle Scholar
- Kadonaga JT. Perspectives on the RNA polymerase II core promoter. Wiley Interdiscip Rev Dev Biol. 2012; 1(1):40–51.View ArticleGoogle Scholar
- Ohler U. Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction. Nucleic Acids Res. 2006; 34(20):5943–950.View ArticleGoogle Scholar
- Bucher P. Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol. 1990; 212(4):563–78.View ArticleGoogle Scholar
- FitzGerald PC, Shlyakhtenko A, Mir AA, Vinson C. Clustering of DNA sequences in human promoters. Genome Res. 2004; 14(8):1562–1574.View ArticleGoogle Scholar
- Ma S, Snyder M, Dinesh-Kumar SP. Discovery of novel human gene regulatory modules from gene co-expression and promoter motif analysis. Sci Rep. 2017; 7:5557.View ArticleGoogle Scholar
- Antequera F. Structure, function and evolution of CpG island promoters. Cell Mol Life Sci. 2003; 60(8):1647–58.View ArticleGoogle Scholar
- Davuluri RV, Grosse I, Zhang MQ. Computational identification of promoters and first exons in the human genome. Nat Genet. 2001; 29(4):412–7.View ArticleGoogle Scholar
- Saxonov S, Berg P, Brutlag DL. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci. 2006; 103(5):1412–7.View ArticleGoogle Scholar
- Consortium EP, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.View ArticleGoogle Scholar
- Lu R, Mucaki EJ, Rogan PK. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res. 2016; 45(5):27–7.View ArticleGoogle Scholar
- Kulakovskiy IV, Vorontsov IE, Yevshin IS, Soboleva AV, Kasianov AS, Ashoor H, Ba-alawi W, Bajic VB, Medvedeva YA, Kolpakov FA, et al. Hocomoco: expansion and enhancement of the collection of transcription factor binding sites models. Nucleic Acids Res. 2016; 44(D1):116–25.View ArticleGoogle Scholar
- Kheradpour P, Kellis M. Systematic discovery and characterization of regulatory motifs in encode tf binding experiments. Nucleic Acids Res. 2013; 42(5):2976–987.View ArticleGoogle Scholar
- Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012; 22(9):1798–812.View ArticleGoogle Scholar
- FactorBook. https://www.factorbook.org.
- Dolfini D, Zambelli F, Pedrazzoli M, Mantovani R, Pavesi G. A high definition look at the nf-y regulome reveals genome-wide associations with selected transcription factors. Nucleic Acids Res. 2016; 44(10):4684–702.View ArticleGoogle Scholar
- Giannopoulou EG, Elemento O. Inferring chromatin-bound protein complexes from genome-wide binding assays. Genome Res. 2013; 23(8):1295–306.View ArticleGoogle Scholar
- Gerstein MB, Kundaje A, Hariharan M, Landt SG, Yan KK, Cheng C, Mu XJ, Khurana E, Rozowsky J, Alexander R, et al. Architecture of the human regulatory network derived from encode data. Nature. 2012; 489(7414):91–100.View ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007; 35(suppl 1):61–5.View ArticleGoogle Scholar
- Consortium TF, et al. A promoter-level mammalian expression atlas. Nature. 2014; 507(7493):462–70.View ArticleGoogle Scholar
- Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinform. 2004; 1(1):24–45.View ArticleGoogle Scholar
- Tanay A, Sharan R, Shamir R. Discovering statistically significant biclusters in gene expression data. Bioinformatics. 2002; 18(suppl_1):136–44.View ArticleGoogle Scholar
- Mantovani R. The molecular biology of the CCAAT-binding factor NF-Y. Gene. 1999; 239(1):15–27.View ArticleGoogle Scholar
- Fleming JD, Pavesi G, Benatti P, Imbriano C, Mantovani R, Struhl K. NF-Y coassociates with FOS at promoters, enhancers, repetitive elements, and inactive chromatin regions, and is stereo-positioned with growth-controlling transcription factors. Genome Res. 2013; 23(8):1195–209.View ArticleGoogle Scholar
- Roder K, Wolf S, Larkin K, Schweizer M. Interaction between the two ubiquitously expressed transcription factors NF-Y and Sp1. Gene. 1999; 234(1):61–9.View ArticleGoogle Scholar
- Sandelin A, Alkema W, Engström P, Wasserman WW, Lenhard B. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 2004; 32(suppl 1):91–4.View ArticleGoogle Scholar
- Mädge B. E-box. In: Encyclopedia of Cancer. Berlin: Springer: 2008. p. 947–50.Google Scholar
- Hardin PE. Transcription regulation within the circadian clock: the E-box and beyond. J Biol Rhythm. 2004; 19(5):348–60.View ArticleGoogle Scholar
- Ito Y, Zhang Y, Dangaria S, Luan X, Diekwisch TG. NF-Y and USF1 transcription factor binding to CCAAT-box and E-box elements activates the CP27 promoter. Gene. 2011; 473(2):92–9.View ArticleGoogle Scholar
- Zhu J, Giannola DM, Zhang Y, Rivera AJ, Emerson SG. NF-Y cooperates with USF1/2 to induce the hematopoietic expression of HOXB4. Blood. 2003; 102(7):2420–427.View ArticleGoogle Scholar
- Marais R, Wynne J, Treisman R. The SRF accessory protein Elk-1 contains a growth factor-regulated transcriptional activation domain. Cell. 1993; 73(2):381–93.View ArticleGoogle Scholar
- Sharrocks AD. The ETS-domain transcription factor family. Nat Rev Mol Cell Biol. 2001; 2(11):827–37.View ArticleGoogle Scholar
- Bailey SD, Zhang X, Desai K, Aid M, Corradin O, Cowper-Sal R, Akhtar-Zaidi B, Scacheri PC, Haibe-Kains B, Lupien M, et al. ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nat Commun. 2015; 2:6186.View ArticleGoogle Scholar
- Eisenberg E, Levanon EY. Human housekeeping genes, revisited. Trends Genet. 2013; 29(10):569–74.View ArticleGoogle Scholar
- Riising EM, Comet I, Leblanc B, Wu X, Johansen JV, Helin K. Gene silencing triggers polycomb repressive complex 2 recruitment to CpG islands genome wide. Mol Cell. 2014; 55(3):347–60.View ArticleGoogle Scholar
- Karlić R, Chung HR, Lasserre J, Vlahoviček K, Vingron M. Histone modification levels are predictive for gene expression. Proc Natl Acad Sci. 2010; 107(7):2926–931.View ArticleGoogle Scholar
- Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4(1):44–57.View ArticleGoogle Scholar
- Bailey TL, Gribskov M. Combining evidence using p-values: application to sequence homology searches. Bioinformatics. 1998; 14(1):48–54.View ArticleGoogle Scholar
- Chaudhary J, Skinner MK. Basic helix-loop-helix proteins can act at the E-box within the serum response element of the c-fos promoter to influence hormone-induced promoter activation in sertoli cells. Mol Endocrinol. 1999; 13(5):774–86.View ArticleGoogle Scholar
- Gordân R, Shen N, Dror I, Zhou T, Horton J, Rohs R, Bulyk ML. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHlH transcription factors through DNA shape. Cell Rep. 2013; 3(4):1093–104.View ArticleGoogle Scholar
- Dynan WS, Sazer S, Tjian R, Schimke RT. Transcription factor Sp1 recognizes a DNA sequence in the mouse dihydrofolate reductase promoter. Nature. 1986; 319(6050):246–8.View ArticleGoogle Scholar
- Rauscher FJ, Sambucetti LC, Curran T, Distel RJ, Spiegelman BM. Common DNA binding site for Fos protein complexes and transcription factor AP-1. Cell. 1988; 52(3):471–80.View ArticleGoogle Scholar
- Vockley CM, D’Ippolito AM, McDowell IC, Majoros WH, Safi A, Song L, Crawford GE, Reddy TE. Direct GR binding sites potentiate clusters of TF binding across the human genome. Cell. 2016; 166(5):1269–81.View ArticleGoogle Scholar
- Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, Trzaskoma P, Magalska A, Wlodarczyk J, Ruszczycki B, et al. CTCF-mediated human 3D genome architecture reveals chromatin topology for transcription. Cell. 2015; 163(7):1611–27.View ArticleGoogle Scholar
- Ernst J, Kellis M. Chromhmm: automating chromatin-state discovery and characterization. Nat Methods. 2012; 9(3):215–6.View ArticleGoogle Scholar
- Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007; 39(3):311–8.View ArticleGoogle Scholar
- Heidari N, Phanstiel DH, He C, Grubert F, Jahanbani F, Kasowski M, Zhang MQ, Snyder MP. Genome-wide map of regulatory interactions in the human genome. Genome Res. 2014; 24(12):1905–17.View ArticleGoogle Scholar
- Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, Hu M, Liu JS, Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012; 485(7398):376.View ArticleGoogle Scholar
- Nora EP, Lajoie BR, Schulz EG, Giorgetti L, Okamoto I, Servant N, Piolot T, van Berkum NL, Meisig J, Sedat J, et al. Spatial partitioning of the regulatory landscape of the x-inactivation centre. Nature. 2012; 485(7398):381.View ArticleGoogle Scholar
- Web link for downloading CAGE data. http://fantom.gsc.riken.jp/5/datafiles/latest/basic/human.cell_line.hCAGE/. Accessed 9 Dec 2014.
- Decomposition-based peak identification. https://github.com/hkawaji/dpi1. Accessed 28 Jun 2016.
- Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of rna-seq experiments with tophat and cufflinks. Nat Protoc. 2012; 7(3):562–78.View ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008; 9(9):137.View ArticleGoogle Scholar
- Ramírez F, Dündar F, Diehl S, Grüning BA, Manke T. deepTools: a flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 2014; 42(W1):187–91.View ArticleGoogle Scholar
- Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(suppl_2):W202–W208.View ArticleGoogle Scholar
- Sill M, Kaiser S, Benner A, Kopp-Schneider A. Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics. 2011; 27(15):2089–097.View ArticleGoogle Scholar
- Berkelaar M, et al. lpsolve: Interface to lp solve v. 5.5 to solve linear/integer programs. Comprehensive R Archive Network. 2014;5(4).Google Scholar
- Orignal data and source code. https://github.molgen.mpg.de/connie/biclustering. Accessed 2 Nov 2017.