Classifying human promoters by occupancy patterns identifies recurring sequence elements, combinatorial binding, and spatial interactions

Background Characterizing recurring sequence patterns in human promoters has been a challenging undertaking even nowadays where a near-complete overview of promoters exists. However, with the more recent availability of genomic location (ChIP-seq) data, one can approach that question through the identification of characteristic patterns of transcription factor occupancy and histone modifications. Results Based on the ENCODE annotation and integration of sequence motifs as well as three-dimensional chromatin data, we have undertaken a re-analysis of occupancy and sequence patterns in human promoters. We identify clear groups of CAAT-box and E-box sequence motif containing promoters, as well as a group of promoters whose interaction with an enhancer appears to be mediated by CCCTC-binding factor (CTCF) binding on the promoter. We also extend our analysis to inactive promoters, showing that only a surprisingly small number of inactive promoters is repressed by the polycomb complex. We also identify combinatorial patterns of transcription factor interactions indicated by the ChIP-seq signals. Conclusion Our analysis defines subgroups of promoters characterized by stereotypic patterns of transcription factor occupancy, and combinations of specific sequence patterns which are required for their binding. This grouping provides new hypotheses concerning the assembly and dynamics of transcription factor complexes at their respective promoter groups, as well as questions on the evolutionary origin of these groups. Electronic supplementary material The online version of this article (10.1186/s12915-018-0585-5) contains supplementary material, which is available to authorized users.

Normalized ChIP-seq signal Figure S8: NFY co-binding pattern in K562 cell-line. Density plots and coverage patterns for ChIP-seq signals of NFYA/B, FOS, and SP1 in +/-1Kbp window of TSSs in active (upper boxes) and inactive TSSs (lower boxes) in K562 cell-line. TSSs are ordered by the peak height of NFYA and then NFYB, FOS, SP1. TSSs without peaks of any of these four ChIP-seq are not shown.
C lu s t e r I I ( N F Y ) C lu s t e r I I I ( U S F ) C lu s t e r I V ( C T C F ) C lu s t e r V ( E L K ) C lu s t e r I C lu s t e r I I ( C T C F ) C lu s t e r I V ( P o ly c o m b ) C l u s t e r V C lu s t e r V ( E L K ) C lu s t e r I V ( C T C F ) C lu s t e r I I I ( U S F ) C lu s t e r I I ( N F Y ) C lu s t e r I b C lu s t e r I c C lu s t e r I a C lu s t e r I b C lu s t e r I c C lu s t e r I a B Figure S9: TF motif hits in promoters in GM12878 cell-line. Promoters were annotated with TF motif hits for all TFs whose ChIP-seq is included in our analysis and where a motif is available in Jaspar 2014. Heatmap color corresponds to motif hit score from MAST (see Methods in the main text). TSSs are ordered in columns as Figure  1 and 2 in the main text and with the cluster assignment from the GM12878 cell-line annotated on the top. The graphs on the right indicate the average motif hit score per cluster. Part A is for active promoters, B is for inactive promoters.

C lu s t e r I b C lu s t e r I a C lu s t e r I I ( N F Y ) C lu s t e r I I I ( U S F ) C lu s t e r V ( C T C F ) C lu s t e r I b BHLHE40
CEBPB CREB1  CTCF  E2F4  EGR1  ELK1  ETS1  FOS  GABPA  JUND  MAFK  MAX  MEF2A  MYC  NFYA  NFYB  NR2C2  NRF1  RFX5  SP1  SPI1  SRF  STAT5A  TBP  USF1  USF2

A B
Cluster IV (ELK) Figure S10: TF motif hits in promoters in K562 cell-line. Promoters were annotated with TF motif hits for all TFs whose ChIP-seq is included in our analysis and where a motif is available in Jaspar 2014. Heatmap color corresponds to motif hit score from MAST (see Methods in the main text). TSSs are ordered in columns as Supplementary Figure 2 and 3 and with the cluster assignment from the K562 cell-line annotated on the top. The graphs on the right indicate the average motif hit score per cluster. Part A is for active promoters, B is for inactive promoters .  I   II  III  IV  V  VI   I  II  III  IV   V  VI   I  II  III  IV V VI VII   I  II  III  IV V VI    Alternative TSS in inactive Cluster I Figure S15: Examples of inactive TSS embedded in an active gene. A: Genome browser screen shot of TSS of gene Fbx22. H3K4me3, PolII and H3K36me3 tracks are shown in the figure. Fbx22 TSS is covered by H3K36me3 becasue it is in the gene body of USP3-AS1, which is transcribed. B: Genome browser screen shot of TSS of gene Rpl32. The downstream alternative promoter of Rpl32 is marked by H3K79me2 apparently because the upstream promoter of that gene is active.    x axis refers to the number of cluster we choose for each run. green y-axis shows the sum of square error and the red y-axis shows the coefficient. When we choose the number of the clusters for final comparison, we choose the cluster number after sum of square error decrease significantly and he coefficient before dropping significantly. According to this, we choose cluster number as 12,12,10,12 for active/inactive TSSs in GM12878/K562 cell line, respectively.  Figure S22: Overview of biclustering methods. The upper row of the sketch shows the SVD based biclustering procedure, while the bottom row depicts the alternative, k-means-based method. The SVD based method applies the s4vd repeatedly and obtains a biclustering with a probability for cluster membership. The k-means-approach first applies the k-means clustering algorithm to the matrix columns and subsequently uses the t-test to select rows thatbelong to column-clusters. (HM: histone modification. CM: chromatin modifier, TF: transcription factor).  Figure S23: Comparison of biclustering method and k means method. TSSs clustering results from k means method are assign to biclustering using linear assignment algorithm. Heatmap for contingency table of two different cluster methods are shown in the left column. Color refers to the value in the contingency table, for active/inactive TSS in GM12878/K562 cell line, respectively. For each of the cluster in biclustering method, we show on the right column the ChIP-seq which overlaps with t.test result of k means clustering (described in the method).
Step 1 : cluster d = distance between two neighbour CAGE TSS; d<=200 Step 2  TSSs from more than one annotated gene in a TSS cluster X X Figure S25: Overview of definition of promoters based on CAGE tags. From FANTOM 5 we import all CAGE tag annotation for a large number of cell-lines and cluster nearby peaks in the vicinity of genes into potential promoters. We treat the union of all these promoters across cell-lines as the set of potential promoters to study.
Step 1 shows how we cluster CAGE tag annotated promoters to obtain less redundant CAGE-based promoters.
Step 2 shows how we filter clusters from Step 1 to obtain more reliable CAGE-based promoters.