Clustered ChIP-Seq-defined transcription factor binding sites and histone modifications map distinct classes of regulatory elements
© Rye et al; licensee BioMed Central Ltd. 2011
Received: 11 November 2011
Accepted: 24 November 2011
Published: 24 November 2011
Transcription factor binding to DNA requires both an appropriate binding element and suitably open chromatin, which together help to define regulatory elements within the genome. Current methods of identifying regulatory elements, such as promoters or enhancers, typically rely on sequence conservation, existing gene annotations or specific marks, such as histone modifications and p300 binding methods, each of which has its own biases.
Herein we show that an approach based on clustering of transcription factor peaks from high-throughput sequencing coupled with chromatin immunoprecipitation (Chip-Seq) can be used to evaluate markers for regulatory elements. We used 67 data sets for 54 unique transcription factors distributed over two cell lines to create regulatory element clusters. By integrating the clusters from our approach with histone modifications and data for open chromatin, we identified general methylation of lysine 4 on histone H3 (H3K4me) as the most specific marker for transcription factor clusters. Clusters mapping to annotated genes showed distinct patterns in cluster composition related to gene expression and histone modifications. Clusters mapping to intergenic regions fall into two groups either directly involved in transcription, including miRNAs and long noncoding RNAs, or facilitating transcription by long-range interactions. The latter clusters were specifically enriched with H3K4me1, but less with acetylation of lysine 27 on histone 3 or p300 binding.
By integrating genomewide data of transcription factor binding and chromatin structure and using our data-driven approach, we pinpointed the chromatin marks that best explain transcription factor association with different regulatory elements. Our results also indicate that a modest selection of transcription factors may be sufficient to map most regulatory elements in the human genome.
Keywordstranscription factor ChIP-Seq histone modification chromatin
Transcription factors are DNA-binding proteins that regulate gene expression by binding to promoter regions proximal to gene transcription start sites (TSSs) or to more distal enhancer regions that regulate expression through long-range interactions [1–3]. Transcription factor binding varies between cell types, and one major factor contributing to this cell type-specific binding is chromatin structure. Chromatin consists of DNA wrapped around nucleosomes, and chains of nucleosomes linked by DNA are organised structurally into different domains of accessible (open) and inaccessible (closed) chromatin [4–7]. Chromatin accessibility is regulated by DNA methylation and posttranslational modifications in the N-terminal tails of the nucleosomal histone proteins. Although there are no known combinations of modifications that delineate accessible and closed chromatins, histone acetylation and mono-, di- and trimethylation of lysine 4 on histone H3 (H3K4me1, H3K4me2 and H3K4me3, respectively) are generally associated with accessible chromatin, whereas H3K9me3 and H3K27me3 are associated with closed chromatin. Several other modifications coexist with these marks over different domains , but these modifications are generally less characterised.
Although the interplay between chromatin environments and transcription factor binding is not straightforward, accessible chromatin generally facilitates association of transcription factors to DNA. Some transcription factors, however, can modify the chromatin landscape around their binding site, which may recruit new transcription factors and chromatin-modifying factors to the region [1, 4]. Changes brought on by such events are the foundation for cell differentiation, whereby chromatin domains and transcription factor binding can be used as markers for cell type-specific regulation. Recent advances in high-throughput sequencing coupled with chromatin immunoprecipitation (ChIP-Seq) [8, 9] have enabled genomewide mapping of such domains. Though several studies have used ChIP-Seq to analyse large sets of transcription factors in different organisms [10, 11] or the interplay between sets of histone modifications [7, 12–20], few studies have investigated the relationship between large sets of transcription factors and histone modifications [18, 21, 22]. One reason for this is that such a data set would require considerable resources to produce. However, the joint efforts of researchers in the laboratories participating in the ENCODE project  are now making such studies possible in a few selected cell lines.
The goal of this study was twofold. First, we wanted to investigate whether genomic regions enriched with bound transcription factors can be used to improve the identification of regulatory elements in the human genome. Specifically, we investigated whether such enriched regions concurred with existing genome annotations and data for histone modifications. Furthermore, we used the enriched regions to identify chromatin markers that best correlated with the binding of transcription factors and to evaluate previously used markers for regulatory regions. Second, we wanted to investigate whether the combination of transcription factors associated with the enriched regions differed depending on the type of regulatory element to which the enriched region mapped. Specifically, we wondered whether the transcription factor composition differed between enhancers and TSS proximal promoter elements.
Our analysis was based on ChIP-Seq reads for transcription factors from two cell lines: K562 and Gm12878. Totals of 39 and 28 factors, respectively, were mapped in each cell line, and 13 factors were mapped in both cell lines. We used clusters of colocalised transcription factor-binding events as identifiers for regulatory elements involving transcription factors and verified that these clusters generally overlapped with regions of active chromatin. We then used two different strategies to identify four groups of transcription factor clusters with potentially different regulatory roles. First, we examined clusters mapping to previously annotated genes and promoters and separated these into (1) clusters mapping to annotated promoter regions (promoter clusters) and (2) clusters mapping to annotated genes but outside the promoter region (gene clusters). Second, we performed an alternative cluster separation independent of annotations, where clusters mapping to histone modifications closely associated with active transcription (H3K4me3, H3K36me3 and RNA polymerase II (Pol II) binding: transcript clusters) were separated from clusters with potentially distal regulatory function with respect to transcription (enhancer clusters). This definition represents an additional separation of the regulatory elements normally referred to as 'enhancers'  into elements that produce transcripts and those that do not. The clusters associated with transcripts also correlated with actual transcription levels from high throughput RNA sequencing (RNA-Seq) data. When comparing clusters in these groups, we observed that the clusters differed in their composition of transcription factors and association with specific histone modifications. Especially, we found that the identified enhancer clusters correlated well with the histone modification H3K4me1, a marker previously used for enhancers, but less well with acetylation of lysine 27 on histone 3 (H3K27ac) or binding of the histone acetyltransferase p300, two other commonly used markers for enhancers. We also investigated whether our selection of transcription factors gave good coverage of all cell type-specific regulatory elements in the human genome and found that a relatively modest selection of factors was sufficient to cover 90% of the annotated promoters for transcribed genes.
ChIP-Seq peaks for different transcription factors cluster along the genome
Histone H3 lysine 4 methylation is the chromatin mark best associated with transcription factor binding
Few factors were associated with repressive modifications. An important exception is NRSF (also known as REST), which, in contrast to the other modifications we analysed, mainly associated with H3K27me3 (Additional file 1, Figure S1). NRSF is a transcriptional repressor that acts as a scaffold for recruiting several chromatin-modifying complexes involved in dimethylation of H3K9 (H3K9me2) and demethylation of H3K4 . By binding long noncoding RNAs, NRSF can also colocalise with polycomb-repressive complex 2 (PRC2) , which may explain the observed association between NRSF and H3K27me3. Similarly, H3K27me3 was partly associated with CTCF/Rad21, which is consistent with CTCF's interacting with PRC2 .
Transcription factor clusters map to promoters of transcribed genes and upstream and/or downstream of promoters of both transcribed and silent genes
Composition of gene clusters correlates with H3K4me and differs from promoter clusters of transcribed genes
Correlation coefficients for composition differences in transcription factor clusters mapping to annotated genes and promoters
prom-h - gene-any-h
prom-z - gene-any-z
prom-h - prom-z
gene-any-h - gene-any-z
gene-any-h - prom-z
gene-pos-h - gene-neg-h
gene-pos-z - gene-neg-z
gene-pos-h - prom-h
gene-pos-z - prom-z
gene-pos-h - gene-pos-z
gene-neg-h - gene-neg-z
Consistent with the different regulatory functions of promoter and gene clusters in transcribed genes, these two cluster types showed the largest difference in composition (r 0.46 and 0.04 for K562 and Gm12878, respectively) (see Figure 7A). We also observed a large compositional difference between gene clusters associated with and not associated with H3K4me (r 0.40 and 0.56, respectively) and between gene clusters associated with H3K4me and promoter clusters (r 0.52 and 0.04, respectively) in transcribed genes. Thus transcribed genes had three types of clusters with markedly different composition: promoter clusters close to TSSs, gene clusters more distal to TSSs associated with H3K4me, and gene clusters far into the gene body not associated with H3K4me.
When comparing clusters between transcribed and silent genes, we observed that the composition of gene clusters did not change with respect to transcription (r 0.99 and 0.99, respectively) and that this was true both for gene clusters associated with H3K4me (r 0.93 and 0.97, respectively) (Figure 7B) and for those not associated with H3K4me (r 0.99 and 0.88, respectively). In contrast, and consistent with promoter clusters' having active and local roles in transcription regulation, promoter clusters changed in composition between transcribed and silent genes (r 0.70/0.37, respectively). Thus there is a change in composition between transcribed and silent genes only for promoter clusters, whereas the composition in gene clusters remains similar. This indicates that long-range regulatory interactions are present in both transcribed and silent genes. Further supporting this conclusion, CTCF, which facilitates long-range interactions , was among the most abundant transcription factors in all regions except for promoters of transcribed genes (Additional file 1, Table S4).
For silent genes, the compositional differences between the three types of clusters showed some discrepancies between the two cell lines. Especially for K562, the composition of gene clusters associated with H3K4me mostly resembled the composition of promoter clusters (r 0.89), whereas the closest resemblance was observed with gene clusters not associated with H3K4me in Gm12878 (r 0.81) (see Discussion).
The enrichment of individual transcription factors in promoters and genes did not always reflect the composition of the different cluster groups. Most factors were present in several groups, but the relative enrichment of each factor in each group was sometimes very different (Figure 7 and Additional file 1, Table S4). For example, in transcribed genes, a higher number of promoter clusters contained the factor c-Jun (831) compared to gene clusters (583), but the percentage of clusters containing c-Jun was slightly higher in gene clusters (24%) than in promoter clusters (16%). Thus c-Jun may have a more important regulatory role in gene clusters, even though more ChIP-Seq peaks for this factor mapped to promoter clusters than to gene clusters. Generally, the 11,041 c-Jun peaks mapped better to TSS distal markers, such as H3K4me1 (Additional file 1, Figure S1a) than to promoters, which indicates the involvement of c-Jun in long-range regulatory interactions.
We also noted some H3K4me and Pol II enrichment at promoters of silent genes. However, this enrichment was not transformed into transcriptional output, as evidenced by the zero RNA-Seq expression and low enrichment of H3K36me3 for these genes (Figure 4 and Additional file 1, Figure S3). The increased enrichment of these transcript-related features may explain the higher concentration of transcription factor clusters around TSSs for silent genes in K562 compared to Gm12878 (Figure 5A).
H3K4me3, H3K36me3 and Pol II identify clusters overlapping with transcripts
In addition to annotated promoters and genes, we expected a proportion of the transcription factor clusters to associate with transcripts and enhancers outside annotated regions. To separate clusters located to promoters or to the body of transcripts (transcript clusters, including possibly unannotated transcripts) from clusters not associated with promoters or transcript bodies (enhancer clusters), we used associations with histone modifications H3K36me3 and H3K4me3, together with enrichment of Pol II, as described by Mikkelsen et al. . Because of the special regulatory function of CTCF/Rad21, these two factors were left out of the analysis at this stage. Clusters overlapping with either H3K36me3 or Pol II were classified as transcript clusters. In addition, clusters overlapping with H3K4me3 were classified as transcript clusters if the region of H3K4me3 enrichment overlapped with either H3K36me3 or Pol II. The last criterion separated isolated regions of H3K4me3 from regions of H3K4me3 that involved the other two transcription markers. We used independent data for Pol III to identify additional transcript clusters not transcribed by Pol II.
Three points must be mentioned with respect to the classification of transcript and enhancer clusters. First, we have used the term 'enhancer clusters' to describe clusters which do not contain the histone modifications and polymerase signatures characteristic of transcription and have indicated that these are more likely to be involved in long-range interactions. However, a recent study  showed that a subset of enhancers involved in long-range interactions also produce short noncoding transcripts. Since such enhancers also recruit Pol II and show enrichment of H3K36me3 , these regulatory elements are classified among the transcript clusters according to the definition given above. Second, a subset of enhancer clusters may represent elements that are not involved in direct gene regulation [35, 36]. Third, when we compared this data-driven classification with our previous annotation-based analysis, we observed some transcript clusters in promoters and gene bodies of silent genes, especially in K562. The set of silent genes showed a small enrichment of Pol II and H3K4me3 (but not H3K36me3) around their TSSs (Figure 4 and Additional file 1, Figure S3), but this enrichment did not result in detectable transcription. We still chose to classify clusters in these silent gene regions as transcript clusters, as these signatures were most likely a result of stalled transcription  and not related to long-range interactions. So, though there may be different and possibly overlapping functions between the two classes of regulatory elements, we continue to use the notion of transcript clusters as mainly transcript-producing and enhancer clusters as mainly involved in long-range interactions throughout the rest of the text.
We primarily used Pol II-related transcription to separate transcript clusters from enhancer clusters. Transcripts can be produced by polymerases other than Pol II, and clusters associating with these polymerases should be identified by our model as long as they also associate with H3K36me3 and H3K4me3. However, we did not always observe this association when analysing independent data for Pol III. The overlap between Pol III and H3K36me3 was only 39% and 26%, compared to 72% and 71% for Pol II and H3K36me3, in K562 and Gm12878, respectively. We therefore included the independent Pol III data in our model. On the one hand, we do not know whether other polymerases behave similarly to Pol II or Pol III with respect to H3K36me3 and H3K4me3, so we cannot exclude the possibility that some clusters classified as enhancer clusters may be associated with transcription by other polymerases, such as Pol I. The effect of Pol I transcription may, on the other hand, be most pronounced in genomic repeat regions, which are often excluded when mapping ChIP-Seq data to the genome.
Overlap between independent miRNA and lincRNA transcripts and transcript and enhancer clusters outside annotated genes and promoters
Number of transcripts
Overlap with transcript clustersa
Overlap with enhancer clustersa
H3K4me1 is a better marker for enhancer clusters than p300 or H3K27ac
Another marker commonly used for enhancer identification is the transcription factor and histone acetyltransferase p300 [16, 20, 21, 40, 43, 44]. However, we did not observe a good correspondence between this factor and enhancer clusters in Gm12878. Only 8% of the enhancer-related clusters were covered by this factor (data for p300 in K562 were not available). In fact, we identified ten factors in Gm12878 with better coverage of the enhancer clusters than p300, the best ones being BATF (51%), IRF4 (45%) and PU1 (41%). The latter factors also showed a preference for H3K4me1 (Additional file 1, Figure S1B), whereas p300 also mapped well to H3K4me3 and transcript clusters.
Although we saw poor correspondence between p300 and regulatory elements in Gm12878, p300 could be a better marker in other cell lines . Only approximately 1,500 p300 peaks were identified by the ChIP-Seq analysis, which is one reason for the low coverage in Gm12878. In addition, the identified peaks did not seem to map specifically to enhancers. We did not observe any transcription factors with specificity only towards the enhancer related clusters, and all factors with a high overlap with these clusters also overlapped well with transcript clusters (Additional file 1, Figure S5).
H3K27ac has also recently been used as an identifier for enhancer regions [36, 45], but it showed a weaker overlap with enhancer clusters compared to H3K4me1 (Figure 9). In addition, all enhancer clusters containing H3K27ac also contained H3K4me1. Still, H3K27ac has been shown to be a useful marker for separating active from weak enhancers [34, 36] and might provide valuable information on subclasses of enhancers and enhancer activity in addition to the H3K4me1 mark. The notion that H3K27ac marks specifically active regulatory elements is reinforced by the observation that H3K27ac is present at nearly all active genes (Figure 9). We also observed that H3K9ac shows a mapping pattern similar to that of H3K27ac, especially in K562 (not shown). The percentage overlap of H3K27ac was also similar to that of H3K4me3 (Figure 9); however, these two modifications mark somewhat different enhancer clusters. OCRs also showed good overlap with enhancer clusters, but an OCR in itself was a less specific marker for enhancer clusters because of the large number of OCRs not mapping to any transcription factors (see Discussion). Thus the overall conclusion is that, in our enhancer clusters bound by transcription factors, H3K4me1 was a superior individual marker compared to H3K27ac, p300 and OCRs.
Mapped transcription factors cover a high percentage of promoters in highly transcribed genes
Observed cluster differences are not due to noisy ChIP-Seq peaks
ChIP-Seq data are potentially noisy [48–50], so it was important to separate noisy peaks from true binding events in our study. Given the clustering properties of transcription factors, we regarded overlapping peaks as more confident than singleton peaks, which led us to focus on peak clusters rather than individual peaks when defining regulatory transcription factor elements. We validated this approach (that is, using peaks in clusters) by sorting peaks for each of the 67 transcription factor data sets into 20 bins according to increasing ChIP-Seq tag intensity so that the most intense peaks were associated with the highest bin number. The largest number of singleton peaks was found in bin 3, whereas the largest number of peaks in clusters of size both 2 and 3 were found in bin 11. This indicates that a higher fraction of the singleton peaks are potentially noisy and no gain in peak confidence is realised by increasing the cluster size limit from 2 to 3. We thus decided to use a cluster size limit of 2 in our study. To further verify that peaks in genes clusters were not due to noise, we investigated the confidence of peaks in gene clusters versus promoter clusters. Though we observed a higher average cluster size in promoter clusters than in gene clusters (Additional file 1, Table S4), only minor differences in the average bin value were observed, meaning that ChIP-Seq peaks in gene clusters are not more likely to be noise than peaks in promoter clusters. The reason for the decreased average size of gene clusters could be that fewer factors in these clusters were mapped by ChIP-Seq or that these clusters generally contain fewer transcription factors. We also investigated whether the difference in transcription factor composition among the cluster groups shown in Table 1 changed when we increased the cluster size limit to 3 and 4. Though we generally observed an increase in the correlation values with increasing cluster size limit, the relative differences between the groups stayed the same (data not shown).
Not all singleton peaks are noise. In fact, some of the singleton peaks are among the high-confidence peaks within their data sets. However, by focusing on peak-overlaps, we could concentrate our analysis on regulatory regions containing several peaks, even if each peak in the region was less confident when evaluated on its own. We also considered the effect of losing a few singleton peaks and candidate regulatory regions preferable to including false-positive singleton peaks, which would lead to many additional false regulatory elements.
Clusters and chromatin signatures show discrepancies between the two cell lines
When mapping transcription factor clusters to annotated genes, our results agreed with the common notion that transcription factors in promoter regions drive the recruitment of the transcription initiation complex leading to gene transcription. Clusters were highly enriched within the typical promoter region (-2,000 bp to +200 bp) of the TSSs in transcribed genes and depleted in silent genes. We also identified a significant number of clusters within genes, which, compared to promoter clusters, showed little change in abundance and composition between transcribed and silent genes. Several studies have reported transcription factor binding outside the typical promoter region [51, 52], but the extent and biological significance of such binding events have generally been studied less often. Some of the regulatory roles of transcription factors in these TSS distal clusters could be to modulate the chromatin environment around genes, to work as enhancer elements directed towards their own gene or distant genes, or to regulate individual transcription of noncoding RNA within the gene body. The correlation in composition between some of these clusters and H3K4me should indicate a role in chromatin modulation. The total number of gene clusters was higher in transcribed genes than in silent genes (Figure 5B). Studies of three-dimensional chromosome organisation inside the nucleus  have revealed that accessible and closed chromatin tend to compartmentalise, with the consequence that transcribed genes associated with accessible chromatin have a higher probability of being spatially close to other transcribed genes than to silent genes associated with closed chromatin. A high degree of dynamic intra- and interchromosomal interactions are often observed in the accessible compartment, which may explain the higher frequency of gene clusters in transcribed genes than in silent genes. Our observation that CTCF was one of the factors most enriched in gene clusters (Figure 7 and Additional file 1, Table S4) may also point in this direction, since CTCF is involved in higher-order organisation and modulation of chromatin domains [3, 54, 55].
The discrepancies observed when we compared cluster composition in the two cell lines may have several explanations. The difference in similarity between promoter clusters in transcribed and silent genes for the two cell lines (r 0.70 in K562 vs 0.37 in Gm12878) may partly be explained by the somewhat higher enrichment of Pol II at silent promoters in K562 (compare Figure 4 with Additional file 1, Figure S3). When we removed Pol II-associated promoter clusters from silent genes in K562, the correlation dropped from 0.70 to 0.42, which is more similar to Gm12878. It is thus likely that promoter clusters at silent genes in K562 are more enriched in factors directly involved in transcription, leading to the increased composition similarity with promoter clusters of transcribed genes. Additional observations indicate that transcription factors mapped in K562 are generally more promoter-specific and directly related to transcription than the factors mapped in Gm12878. First, the number of gene clusters relative to promoter clusters is higher in Gm12878 than in K562 (Figure 5 and Additional file 1, Table S4). Second, when we investigated the transcription factor composition differences between enhancer clusters and clusters mapping to annotated promoters in transcribed genes, we observed a larger difference for K562 (r 0.65) than for Gm12878 (r 0.91). Third, the higher influence of singletons for Gm12878 in the coverage analysis also indicates that Gm12878 promoters are less mapped by transcription factors than promoters in K562. The latter difference could also be reinforced by the smaller number of factors mapped in Gm12878 (28) compared to K562 (39).
More transcript clusters mapped to silent genes in K562 (625) compared to Gm12878 (56)
Further examination of the 625 transcript clusters mapping to silent genes in K562 revealed that 100 clusters contained H3K36me3, which may indicate transcriptional events not captured by RNA-Seq. Another 188 clusters contained the silent histone modification H3K27me3 in addition to the active marks, which relates these clusters to bivalent domains [37, 56]. In promoters of bivalent genes the transcriptional machinery, including Pol II, is recruited, but transcription elongation is stalled and no transcriptional output is produced. We could not find the exact cause for the association of the other 337 transcript clusters to silent genes. Another source of the cell type-specific discrepancies is the selection of transcription factors mapped by ChIP-Seq in each cell line. Of the 39 factors mapped in K562 and the 28 mapped in Gm12878, only 13 factors were common to both cell lines. The selected transcription factors can thus be biased towards specific types of clusters, leading to inconsistent results when the composition profiles are compared. We cannot rule out effects of possible cell type-specific regulation mechanisms, which have been observed in recent studies [43, 57].
Many regions enriched with H3K4me and open chromatin regions are not mapped by any transcription factors
In this study, we investigated transcription factor clusters and their relation to gene annotations and chromatin environments throughout the human genome. Our results provide new insight into the relationship between transcription factor binding in regulatory regions and histone modification domains. Specifically, we found that transcription factor clusters mapping to genes outside the core promoter showed similar composition in both transcribed and silent genes, but that the composition differed depending on the presence of H3K4me. H3K4me was also identified as the preferred mark for transcription factor binding in general and should thus be used as a priority marker for identifying transcription factor-binding events in as yet uncharacterised cell types. We also confirmed that the histone modifications H3K36me3 and H3K4me3, together with Pol II, could be used to separate clusters involved in transcription from clusters more related to long-range interactions, although the functions of these two classes of clusters may overlap somewhat. For the latter clusters, H3K4me1 was identified as the preferred marker compared to H3K27ac, p300 and OCRs. The further integration of high-throughput sequencing data of new histone modifications, transcription factors and other regulatory features in different cell types will most certainly increase our knowledge of the complex relationships between transcription factors and histone modifications.
In this study, we used data from the ENCODE project contained in the UCSC Genome Browser Database  and downloaded ChIP-Seq data for the two cell lines K562 and Gm12878. The data consisted of ChIP-Seq reads for 67 transcription factors (26 unique in K562, 15 unique in Gm12878 and 13 mapped in both cell lines) and 9 histone modifications mapped in both cell lines. In addition, we downloaded accessible chromatin tracks analysed by DNase hypersensitivity and FAIRE-Seq (OCR), expression data analysed by RNA-Seq for the same cell lines and gene annotations from UCSC refGene. An overview of all data downloaded from ENCODE database in the UCSC Genome Browser is given in Additional file 1, Table S7. In addition, we downloaded transcript data for miRNA  and lincRNA (lincRNA pipeline based on the paper by Guttman et al. ) from other sources to evaluate cluster classifications outside annotated genes and promoters. The liftOver tool was used to convert the hg19 to the hg18 version of the human genome assembly on both of these data sets using the Galaxy web-based platform (http://main.g2.bx.psu.edu/) for miRNA and the UCSC Genome Browser Database for lincRNA.
Identification of clusters
Confident ChIP-Seq peaks for each transcription factor were identified by using an in-house method  based on output from the programs MACS  and SISSRs . To make sure that peaks belonging to the same regulatory element were identified in the same cluster, we extended each peak to 2,000 bp to emulate the common standard used for promoter regulatory elements (2,000 bp upstream and 200 bp downstream from TSSs). Peaks overlapping within the 2,000 bp extension were identified as belonging to the same cluster.
Identification of random clusters
Within each chromosome, all peak starts were randomly shuffled. Because the full chromosome cannot be mapped by ChIP-Seq, the length of the chromosome was adjusted by the mappability factor in MACS (0.88 for 25-bp tag lengths). Each peak was then extended to 2,000 bp, and overlaps were identified by using the same procedure as that used for the true peaks.
Identification of domains of histone modifications
ChIP-Seq data for histone modification data and Pol II were analysed by using the program SICER version 1.03 . The gap size parameters were set to 200 for H3K4me3 and to 600 for other histone modifications as recommended . For Pol II, we used a larger gap size of 1,000 to capture longer domains of Pol II binding rather than local Pol II peaks. ChIP-Seq data sets for the same modification were combined, resulting in a single track for each modification in each cell line. To account for the difference in domain size of histone modifications, we split each domain into regions with a maximum length of 5,000 bp and used these regions throughout most of the study. The full-length regions were used only during the identification of transcript clusters by H3K36me3, H3K4me3 and Pol II.
Gene expression and promoters
Gene expression was measured by the average RNA-Seq intensity based on tags mapped to exons for each gene. Nonredundant genes were selected by grouping all genes with overlapping subsets of exons transcribed from the same strand, then only the gene with the highest expression was selected from each group. From among the downloaded list of 30,399 genes, 18,682 nonredundant genes were identified by using this approach. A set of transcribed genes was chosen as the top third of all redundant genes when sorted by expression level, whereas a set of silent genes was chosen as all genes with zero expression. This resulted in 6,228 transcribed genes (both cell lines) and 5,591 and 4,124 silent genes in K562 and Gm12878, respectively. Promoters were defined as extending 2,000 bp upstream and 200 bp downstream of TSSs.
Correlation of transcription factor composition between two types of clusters was calculated in the following way. For both cluster types, the percentage of clusters occupied by each factor was calculated, giving a vector of enrichment for each factor in each cluster type. The Pearson's r correlation between the vector for each cluster type was then calculated by using the corrcoef function in Numpy (http://numpy.scipy.org/).
open chromatin region
transcription start site
chromatin immunoprecipitation coupled with high-throughput sequencing.
This work was supported by the Functional Genomics Program (FUGE) of the Norwegian Research Council.
- Farnham PJ: Insights from genomic profiling of transcription factors. Nat Rev Genet. 2009, 10: 605-616. 10.1038/nrg2636.PubMed CentralView ArticlePubMed
- Kagey MH, Newman JJ, Bilodeau S, Zhan Y, Orlando DA, van Berkum NL, Ebmeier CC, Goossens J, Rahl PB, Levine SS, Taatjes DJ, Dekker J, Young RA: Mediator and cohesin connect gene expression and chromatin architecture. Nature. 2010, 467: 430-435. 10.1038/nature09380.PubMed CentralView ArticlePubMed
- Phillips JE, Corces VG: CTCF: master weaver of the genome. Cell. 2009, 137: 1194-1211. 10.1016/j.cell.2009.06.001.PubMed CentralView ArticlePubMed
- Kouzarides T: Chromatin modifications and their function. Cell. 2007, 128: 693-705. 10.1016/j.cell.2007.02.005.View ArticlePubMed
- Li B, Carey M, Workman JL: The role of chromatin during transcription. Cell. 2007, 128: 707-719. 10.1016/j.cell.2007.01.015.View ArticlePubMed
- Bernstein BE, Meissner A, Lander ES: The mammalian epigenome. Cell. 2007, 128: 669-681. 10.1016/j.cell.2007.01.033.View ArticlePubMed
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129: 823-837. 10.1016/j.cell.2007.05.009.View ArticlePubMed
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007, 316: 1497-1502. 10.1126/science.1141319.View ArticlePubMed
- Park PJ: ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009, 10: 669-680.PubMed CentralView ArticlePubMed
- Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH: Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008, 133: 1106-1117. 10.1016/j.cell.2008.04.043.View ArticlePubMed
- Niu W, Lu ZJ, Zhong M, Sarov M, Murray JI, Brdlik CM, Janette J, Chen C, Alves P, Preston E, Slightham C, Jiang L, Hyman AA, Kim SK, Waterston RH, Gerstein M, Snyder M, Reinke V: Diverse transcription factor binding features revealed by genome-wide ChIP-seq in C. elegans. Genome Res. 2011, 21: 245-254. 10.1101/gr.114587.110.PubMed CentralView ArticlePubMed
- Hon G, Wang W, Ren B: Discovery and annotation of functional chromatin signatures in the human genome. PLoS Comput Biol. 2009, 5: e1000566-10.1371/journal.pcbi.1000566.PubMed CentralView ArticlePubMed
- Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, Zhao K: Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet. 2008, 40: 897-903. 10.1038/ng.154.PubMed CentralView ArticlePubMed
- Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010, 28: 817-825. 10.1038/nbt.1662.PubMed CentralView ArticlePubMed
- Hon G, Ren B, Wang W: ChromaSig: a probabilistic approach to finding common chromatin signatures in the human genome. PLoS Comput Biol. 2008, 4: e1000201-10.1371/journal.pcbi.1000201.PubMed CentralView ArticlePubMed
- Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet. 2007, 39: 311-318. 10.1038/ng1966.View ArticlePubMed
- Yu H, Zhu S, Zhou B, Xue H, Han JD: Inferring causal relationships among different histone modifications and gene expression. Genome Res. 2008, 18: 1314-1324. 10.1101/gr.073080.107.PubMed CentralView ArticlePubMed
- modENCODE Consortium, Roy S, Ernst J, Kharchenko PV, Kheradpour P, Negre N, Eaton ML, Landolin JM, Bristow CA, Ma L, Lin MF, Washietl S, Arshinoff BI, Ay F, Meyer PE, Robine N, Washington NL, Di Stefano L, Berezikov E, Brown CD, Candeias R, Carlson JW, Carr A, Jungreis I, Marbach D, Sealfon R, Tolstorukov MY, Will S, Alekseyenko AA, Artieri C, Booth BW, et al: Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010, 330: 1787-1797.View Article
- Kharchenko PV, Alekseyenko AA, Schwartz YB, Minoda A, Riddle NC, Ernst J, Sabo PJ, Larschan E, Gorchakov AA, Gu T, Linder-Basso D, Plachetka A, Shanower G, Tolstorukov MY, Luquette LJ, Xi R, Jung YL, Park RW, Bishop EP, Canfield TK, Sandstrom R, Thurman RE, MacAlpine DM, Stamatoyannopoulos JA, Kellis M, Elgin SC, Kuroda MI, Pirrotta V, Karpen GH, Park PJ: Comprehensive analysis of the chromatin landscape in Drosophila melanogaster. Nature. 2011, 471: 480-485. 10.1038/nature09725.PubMed CentralView ArticlePubMed
- Won KJ, Chepelev I, Ren B, Wang W: Prediction of regulatory elements in mammalian genomes using chromatin signatures. BMC Bioinformatics. 2008, 9: 547-10.1186/1471-2105-9-547.PubMed CentralView ArticlePubMed
- Won KJ, Ren B, Wang W: Genome-wide prediction of transcription factor binding sites using an integrated model. Genome Biol. 2010, 11: R7-10.1186/gb-2010-11-1-r7.PubMed CentralView ArticlePubMed
- Kim J, Chu J, Shen X, Wang J, Orkin SH: An extended transcriptional network for pluripotency of embryonic stem cells. Cell. 2008, 132: 1049-1061. 10.1016/j.cell.2008.02.039.View ArticlePubMed
- ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigó R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, Kuehn MS, Taylor CM, Neph S, Koch CM, Asthana S, Malhotra A, Adzhubei I, Greenbaum JA, Andrews RM, Flicek P, Boyle PJ, Cao H, Carter NP, Clelland GK, Davis S, Day N, Dhami P, Dillon SC, Dorschner MO, Fiegler H, et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447: 799-816. 10.1038/nature05874.View Article
- Bulger M, Groudine M: Functional and mechanistic diversity of distal transcription enhancers. Cell. 2011, 144: 327-339. 10.1016/j.cell.2011.01.024.PubMed CentralView ArticlePubMed
- Blanchette M, Bataille AR, Chen X, Poitras C, Laganière J, Lefèbvre C, Deblois G, Giguère V, Ferretti V, Bergeron D, Coulombe B, Robert F: Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression. Genome Res. 2006, 16: 656-668. 10.1101/gr.4866006.PubMed CentralView ArticlePubMed
- Gupta M, Liu JS: De novo cis-regulatory module elicitation for eukaryotic genomes. Proc Natl Acad Sci USA. 2005, 102: 7079-7084. 10.1073/pnas.0408743102.PubMed CentralView ArticlePubMed
- Zhou Q, Wong WH: CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling. Proc Natl Acad Sci USA. 2004, 101: 12114-12119. 10.1073/pnas.0402858101.PubMed CentralView ArticlePubMed
- Ooi L, Wood IC: Chromatin crosstalk in development and disease: lessons from REST. Nat Rev Genet. 2007, 8: 544-554. 10.1038/nrg2100.View ArticlePubMed
- Tsai MC, Manor O, Wan Y, Mosammaparast N, Wang JK, Lan F, Shi Y, Segal E, Chang HY: Long noncoding RNA as modular scaffold of histone modification complexes. Science. 2010, 329: 689-693. 10.1126/science.1192002.PubMed CentralView ArticlePubMed
- Li T, Hu JF, Qiu X, Ling J, Chen H, Wang S, Hou A, Vu TH, Hoffman AR: CTCF regulates allelic expression of Igf2 by orchestrating a promoter-polycomb repressive complex 2 intrachromosomal loop. Mol Cell Biol. 2008, 28: 6473-6482. 10.1128/MCB.00204-08.PubMed CentralView ArticlePubMed
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007, 448: 553-560. 10.1038/nature06008.PubMed CentralView ArticlePubMed
- Koch CM, Andrews RM, Flicek P, Dillon SC, Karaöz U, Clelland GK, Wilcox S, Beare DM, Fowler JC, Couttet P, James KD, Lefebvre GC, Bruce AW, Dovey OM, Ellis PD, Dhami P, Langford CF, Weng Z, Birney E, Carter NP, Vetrie D, Dunham I: The landscape of histone modifications across 1% of the human genome in five human cell lines. Genome Res. 2007, 17: 691-707. 10.1101/gr.5704207.PubMed CentralView ArticlePubMed
- Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S, Markenscoff-Papadimitriou E, Kuhl D, Bito H, Worley PF, Kreiman G, Greenberg ME: Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010, 465: 182-187. 10.1038/nature09033.PubMed CentralView ArticlePubMed
- Zentner GE, Tesar PJ, Scacheri PC: Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome Res. 2011, 21: 1273-1283. 10.1101/gr.122382.111.PubMed CentralView ArticlePubMed
- MacQuarrie KL, Fong AP, Morse RH, Tapscott SJ: Genome-wide transcription factor binding: beyond direct target regulation. Trends Genet. 2011, 27: 141-148. 10.1016/j.tig.2011.01.001.PubMed CentralView ArticlePubMed
- Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, Boyer LA, Young RA, Jaenisch R: Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc Natl Acad Sci USA. 2010, 107: 21931-21936. 10.1073/pnas.1016071107.PubMed CentralView ArticlePubMed
- Kanhere A, Viiri K, Araújo CC, Rasaiyaah J, Bouwman RD, Whyte WA, Pereira CF, Brookes E, Walker K, Bell GW, Pombo A, Fisher AG, Young RA, Jenner RG: Short RNAs are transcribed from repressed polycomb target genes and interact with polycomb repressive complex-2. Mol Cell. 2010, 38: 675-688. 10.1016/j.molcel.2010.03.019.PubMed CentralView ArticlePubMed
- Marson A, Levine SS, Cole MF, Frampton GM, Brambrink T, Johnstone S, Guenther MG, Johnston WK, Wernig M, Newman J, Calabrese JM, Dennis LM, Volkert TL, Gupta S, Love J, Hannett N, Sharp PA, Bartel DP, Jaenisch R, Young RA: Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell. 2008, 134: 521-533. 10.1016/j.cell.2008.07.020.PubMed CentralView ArticlePubMed
- Guttman M, Amit I, Garber M, French C, Lin MF, Feldser D, Huarte M, Zuk O, Carey BW, Cassady JP, Cabili MN, Jaenisch R, Mikkelsen TS, Jacks T, Hacohen N, Bernstein BE, Kellis M, Regev A, Rinn JL, Lander ES: Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature. 2009, 458: 223-227. 10.1038/nature07672.PubMed CentralView ArticlePubMed
- Visel A, Blow MJ, Li Z, Zhang T, Akiyama JA, Holt A, Plajzer-Frick I, Shoukry M, Wright C, Chen F, Afzal V, Ren B, Rubin EM, Pennacchio LA: ChIP-seq accurately predicts tissue-specific activity of enhancers. Nature. 2009, 457: 854-858. 10.1038/nature07730.PubMed CentralView ArticlePubMed
- Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, Plajzer-Frick I, Akiyama J, De Val S, Afzal V, Black BL, Couronne O, Eisen MB, Visel A, Rubin EM: In vivo enhancer analysis of human conserved non-coding sequences. Nature. 2006, 444: 499-502. 10.1038/nature05295.View ArticlePubMed
- Nobrega MA, Ovcharenko I, Afzal V, Rubin EM: Scanning human gene deserts for long-range enhancers. Science. 2003, 302: 413-10.1126/science.1088328.View ArticlePubMed
- Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, Ching KA, Antosiewicz-Bourget JE, Liu H, Zhang X, Green RD, Lobanenkov VV, Stewart R, Thomson JA, Crawford GE, Kellis M, Ren B: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009, 459: 108-112. 10.1038/nature07829.PubMed CentralView ArticlePubMed
- Gotea V, Visel A, Westlund JM, Nobrega MA, Pennacchio LA, Ovcharenko I: Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 2010, 20: 565-577. 10.1101/gr.104471.109.PubMed CentralView ArticlePubMed
- Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J: A unique chromatin signature uncovers early developmental enhancers in humans. Nature. 2011, 470: 279-283. 10.1038/nature09692.PubMed CentralView ArticlePubMed
- Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, Akalin A, Schmeier S, Kanamori-Katayama M, Bertin N, Carninci P, Daub CO, Forrest AR, Gough J, Grimmond S, Han JH, Hashimoto T, Hide W, Hofmann O, Kamburov A, Kaur M, Kawaji H, Kubosaki A, Lassmann T, van Nimwegen E, MacPherson CR, Ogawa C, Radovanovic A, Schwartz A, Teasdale RD, et al: An atlas of combinatorial transcriptional regulation in mouse and man. Cell. 2010, 140: 744-752. 10.1016/j.cell.2010.01.044.View ArticlePubMed
- Vaquerizas JM, Kummerfeld SK, Teichmann SA, Luscombe NM: A census of human transcription factors: function, expression and evolution. Nat Rev Genet. 2009, 10: 252-263. 10.1038/nrg2538.View ArticlePubMed
- Rye MB, Sætrom P, Drabløs F: A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs. Nucleic Acids Res. 2011, 39: e25-10.1093/nar/gkq1187.PubMed CentralView ArticlePubMed
- Kharchenko PV, Tolstorukov MY, Park PJ: Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008, 26: 1351-1359. 10.1038/nbt.1508.PubMed CentralView ArticlePubMed
- Xu H, Handoko L, Wei XL, Ye CP, Sheng JP, Wei CL, Lin F, Sung WK: A signal-noise model for significance analysis of ChIP-seq with negative control. Bioinformatics. 2010, 26: 1199-1204. 10.1093/bioinformatics/btq128.View ArticlePubMed
- Wallerman O, Motallebipour M, Enroth S, Patra K, Bysani MS, Komorowski J, Wadelius C: Molecular interactions between HNF4a, FOXA2 and GABP identified at regulatory DNA elements through ChIP-sequencing. Nucleic Acids Res. 2009, 37: 7498-7508. 10.1093/nar/gkp823.PubMed CentralView ArticlePubMed
- Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol. 2003, 20: 1377-1419. 10.1093/molbev/msg140.View ArticlePubMed
- Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, Sandstrom R, Bernstein B, Bender MA, Groudine M, Gnirke A, Stamatoyannopoulos J, Mirny LA, Lander ES, Dekker J: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009, 326: 289-293. 10.1126/science.1181369.PubMed CentralView ArticlePubMed
- Cuddapah S, Jothi R, Schones DE, Roh TY, Cui K, Zhao K: Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. Genome Res. 2009, 19: 24-32.PubMed CentralView ArticlePubMed
- Kim TH, Abdullaev ZK, Smith AD, Ching KA, Loukinov DI, Green RD, Zhang MQ, Lobanenkov VV, Ren B: Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell. 2007, 128: 1231-1245. 10.1016/j.cell.2006.12.048.PubMed CentralView ArticlePubMed
- Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenisch R, Wagschal A, Feil R, Schreiber SL, Lander ES: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell. 2006, 125: 315-326. 10.1016/j.cell.2006.02.041.View ArticlePubMed
- Pekowska A, Benoukraf T, Ferrier P, Spicuglia S: A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res. 2010, 20: 1493-1502. 10.1101/gr.109389.110.PubMed CentralView ArticlePubMed
- Boyle AP, Davis S, Shulha HP, Meltzer P, Margulies EH, Weng Z, Furey TS, Crawford GE: High-resolution mapping and characterization of open chromatin across the genome. Cell. 2008, 132: 311-322. 10.1016/j.cell.2007.12.014.PubMed CentralView ArticlePubMed
- Gaulton KJ, Nammo T, Pasquali L, Simon JM, Giresi PG, Fogarty MP, Panhuis TM, Mieczkowski P, Secchi A, Bosco D, Berney T, Montanya E, Mohlke KL, Lieb JD, Ferrer J: A map of open chromatin in human pancreatic islets. Nat Genet. 2010, 42: 255-259. 10.1038/ng.530.PubMed CentralView ArticlePubMed
- Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ, University of California Santa Cruz: The UCSC Genome Browser Database. Nucleic Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.PubMed CentralView ArticlePubMed
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS: Model-based analysis of ChIP-Seq (MACS). Genome Biol. 2008, 9: R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMed
- Jothi R, Cuddapah S, Barski A, Cui K, Zhao K: Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008, 36: 5221-5231. 10.1093/nar/gkn488.PubMed CentralView ArticlePubMed
- Zang CZ, Schones DE, Zeng C, Cui KR, Zhao KJ, Peng WQ: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009, 25: 1952-1958. 10.1093/bioinformatics/btp340.PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.