Enhanced 5-methylcytosine detection in single-molecule, real-time sequencing via Tet1 oxidation
© Clark et al; licensee BioMed Central Ltd. 2013
Received: 24 July 2012
Accepted: 22 January 2013
Published: 22 January 2013
Skip to main content
© Clark et al; licensee BioMed Central Ltd. 2013
Received: 24 July 2012
Accepted: 22 January 2013
Published: 22 January 2013
DNA methylation serves as an important epigenetic mark in both eukaryotic and prokaryotic organisms. In eukaryotes, the most common epigenetic mark is 5-methylcytosine, whereas prokaryotes can have 6-methyladenine, 4-methylcytosine, or 5-methylcytosine. Single-molecule, real-time sequencing is capable of directly detecting all three types of modified bases. However, the kinetic signature of 5-methylcytosine is subtle, which presents a challenge for detection. We investigated whether conversion of 5-methylcytosine to 5-carboxylcytosine using the enzyme Tet1 would enhance the kinetic signature, thereby improving detection.
We characterized the kinetic signatures of various cytosine modifications, demonstrating that 5-carboxylcytosine has a larger impact on the local polymerase rate than 5-methylcytosine. Using Tet1-mediated conversion, we show improved detection of 5-methylcytosine using in vitro methylated templates and apply the method to the characterization of 5-methylcytosine sites in the genomes of Escherichia coli MG1655 and Bacillus halodurans C-125.
We have developed a method for the enhancement of directly detecting 5-methylcytosine during single-molecule, real-time sequencing. Using Tet1 to convert 5-methylcytosine to 5-carboxylcytosine improves the detection rate of this important epigenetic marker, thereby complementing the set of readily detectable microbial base modifications, and enhancing the ability to interrogate eukaryotic epigenetic markers.
The DNA of most organisms is comprised of more than the four canonical bases (A, C, G and T). In mammals, for example, 5-methylcytosine (5mC) constitutes about 1% of all DNA bases and is found primarily in CpG dinucleotides. Methylation plays a critical role in the regulation of gene expression, genomic imprinting and the suppression of transposable elements . Often referred to as the sixth base, 5-hydroxymethylcytosine (5hmC) is also found in many metazoan genomes . 5hmC is converted from 5mC by the Ten-eleven translocation (Tet) family of proteins [3, 4]. Recently, it was discovered that Tet proteins can also convert 5mC to 5-formylcytosine (5fC)  and 5-carboxylcytosine (5caC) [6, 7]. In humans, there are three different Tet proteins (Tet1, Tet2, Tet3) that are all capable of this conversion [6, 7]. It is currently thought that DNA demethylation may occur through this process of 5mC oxidation followed by base excision repair [6, 8], and possibly decarboxylation .
Many of the genomes of bacteria and archea also contain modified DNA bases . The three most common forms of methylation are 6-methyladenine (6mA), 4-methylcytosine (4mC) and 5mC. The primary function of methylation is DNA self-recognition via restriction-modification systems that protect the organism against invading DNA. However, there are methyltransferases (MTases), such as dam, that do not fall in restriction-modification systems and are important in chromosome stability, mismatch repair and replication . There is some evidence that the presence of methylation can also impact gene expression . Thus, detection and identification of methylated bases in both prokaryotes and eukaryotes is essential to the complete understanding of genome function.
The most common techniques for large-scale detection of DNA methylation rely on bisulfite treatment of the DNA prior to sequencing. Sodium bisulfite chemically deaminates cytosine residues to uracil, which are subsequently read out as thymine. Methylated cytosines are converted with much lower efficiency and thus remain cytosines. The presence of 5mC is inferred from comparing bisulfite-treated DNA sequences to an untreated reference. In standard bisulfite sequencing, 5mC cannot be distinguished from 5hmC . The conversion of 5mC to 5caC through the activity of Tet1  and 5hmC to 5fC through chemical conversion  followed by bisulfite sequencing runs has recently been exploited for the genome-wide sequencing of 5mC and 5hmC.
We have previously described a technique for the direct detection of modified DNA using single-molecule, real-time (SMRT®) sequencing [16, 17]. SMRT sequencing involves the monitoring of a DNA polymerase as it makes a copy of a DNA molecule [18, 19]. When the DNA polymerase encounters a modified base on the template strand, its rate of progression changes in a characteristic way relative to an unmodified template with the same sequence context [16, 17]. The speed of the polymerase is monitored by determining the length of time between the fluorescent pulses that indicate nucleotide incorporation. The time between pulses is called the interpulse duration (IPD). The change in IPD between a modified and control template varies in magnitude and position depending on the nature of the base modification and the local sequence context. We refer to these reproducible changes in IPD as the kinetic signature for that modification.
Although many base modifications, such as 6mA, 4mC, 5hmC and 8-oxo-guanine, are readily detectable in SMRT sequencing [16, 17, 20, 21], the kinetic signature of 5mC is more subtle, requiring high sequencing fold coverage to make out the small effect on polymerase speed. The methyl group is small, and unlike for the case of 6mA and 4mC, it is oriented towards the major groove and is not involved in base pairing - in fact the methyl group has to be readily accepted by DNA polymerases at this position as it is present on thymine, the other canonical pyrimidine base. We hypothesized that conversion of 5mC into a larger group may increase the magnitude of the kinetic signature during SMRT sequencing, thus enhancing the ability to detect 5mC. The Tet family of proteins carries out conversion of 5mC to several other modified forms of cytosine including 5hmC, 5fC and 5caC [6, 7]. This strategy has been shown to be effective in the recently developed Tet-assisted bisulfite sequencing of 5hmC .
Here, we demonstrate that mouse Tet1 (mTet1) can be used to enhance direct detection of 5mC during SMRT sequencing. Using synthetic templates made from oligonucleotides containing 5mC, 5hmC, 5fC or 5caC modifications, we tested the kinetic signatures of each modification. We discovered that each of the moieties into which 5mC can be converted via Tet increased the magnitude of the kinetic signature, with 5caC having the largest effect. Next, we observed that oxidation of 5mC to 5caC on either synthetic templates or in vitro methylated DNA enhanced our ability to detect positions of 5mC. We then used our improved 5mC detection method for the genome-wide characterization of MTase activities in two different bacterial strains.
As the size of the chemical structure of the modification increases, the magnitude of the kinetic signature also increases. The IPD ratio peaks range from approximately two-fold for 5mC and approximately three-fold to higher than five-fold for 5fC and 5caC (Figure 1). For each modification type, an extended signature consisting of multiple IPD ratio peaks was observed, with the most prominent signals at positions 0, +2 and +6 relative to the polymerase movement, with 0 being the position of the modification in the template. In most instances investigated here, the +2 peak was the most pronounced. As previously observed [16, 17], the kinetic signatures for a given modification varied slightly depending on the surrounding sequence context. These differences in the pattern and magnitudes of the kinetic signatures for each of the four different modifications are a parameter that can be used to discriminate between different modifications on the same DNA template, although they are not used in the current implementation of the software. To further explore the effects of local sequence context on the kinetic signatures of 5mC and 5caC, we used a synthetic SMRTbell template that contained a modified base in a 5'-CG-3' sequence context, surrounded by two random bases on each side. Additional file 1 shows a heat map of IPD ratios for the 256 possible sequence contexts at each position from -3 to +6 relative to the modified position in the template. As observed previously [17, 20], the magnitude and position of the kinetic signals for both 5mC and 5caC are dependent upon the surrounding sequence context. The conversion of 5mC to 5caC enhances the magnitude of the IPD ratio at each position where ratios above 1.0 are observed for 5mC, that is, positions 0, +2, and +6, and brings out an additional detectable signal at the -2 position for some sequence contexts. Tet conversion enhances the kinetic signals relatively evenly across all sequence contexts, which is apparent from the good preservation of the overall sequence context profiles. We are currently investigating possible additional correlations that could exist between different base positions in a given context. This could aid in the development of more refined identification algorithms.
Because 5caC has the largest kinetic signature, conversion of 5mC to 5caC should significantly improve the ability to detect 5mC in SMRT sequencing. The Tet family of proteins has been shown to convert 5mC to 5caC in mammalian genomes [6, 7]. This conversion can be over 97% for sequencing purposes and does not exhibit significant sequence context bias . We tested the ability of Tet1-mediated oxidation of 5mC to 5caC to enhance direct detection on in vitro methylated DNA templates, described in detail in Methods. Briefly, we first generated an approximately 6-kb plasmid by inserting a lambda DNA fragment into the pCRBlunt vector and subjected it to whole genome amplification (WGA) to erase all modifications. We then generated an approximately 500 bp randomly sheared shotgun SMRTbell template library from the WGA material, followed by in vitro methylation using the HpaII MTase that modifies the internal cytosine in 5'-CCGG-3' sequence contexts. Considering both the forward and reverse DNA strands, the plasmid sequence contains 70 instances of the 5'-CCGG-3' sequence motif. Methylated positions within the SMRTbell templates were converted to 5caC by treatment with the Tet1 enzyme. In vitro methylated (5mC), Tet1 converted (5caC) and WGA control (no modification) libraries were then subjected to SMRT sequencing.
Most bacterial and archeal genomes contain DNA MTases. Many of these MTases are paired with restriction endonucleases as part of a restriction-modification system that protects the organism from foreign DNA . These MTases typically methylate a specific sequence context, which blocks the activity of the restriction enzyme that recognizes the same site. The three most common types of methylation found in bacteria and archea are 6mA, 4mC and 5mC. To test the ability of the mTet1-enhanced signal to detect 5mC in genomic DNA, we selected two bacterial strains that are known to express a 5mC MTase .
Escherichia coli K12 MG1655 is a well-studied, common laboratory strain that is known to express three different MTases. EcoKdam is a 6mA MTase that modifies the adenosine in a 5'-GATC-3' sequence context (methylated base underlined). EcoKI is a type I MTase that modifies the sequence context 5'-GCAC(N6)GTT-3' and reverse complement 5'-AAC(N6)GTGC-3'. The 5mC MTase is EcoKdcm that modifies the internal cytosine in a 5'-CCWGG-3', where W is either an A or a T. We made SMRTbell templates from randomly sheared E. coli K12 MG1655 genomic DNA, a portion of which was sequenced in its native form and another portion of which was subjected to the mTet1 treatment. Both samples were sequenced to approximately 150 × per-DNA strand fold coverage.
Detection of 5mC in native versus mTet1-enhanced SMRT sequencing for the bacterial genomes
Number in genome
Number detected (%)
Number unassigned (%)a
In SMRT sequencing, modified bases in the DNA template are identified by the transient slowing of the DNA polymerase at and around the site of the modification. We previously demonstrated the detection of 5mC and 5hmC through such kinetic analysis . Here, we extend the spectrum of detectable base modifications to the full complement of currently known modified forms of cytosine. Both 5fC and 5caC showed an increased interference with polymerase movement compared with 5mC, resulting in stronger kinetic signals in SMRT sequencing. In addition to the increased size of the modification, the higher polarity of the formyl and carboxyl group could also contribute to the increased signal levels.
In this work, we describe improving the direct SMRT sequencing of 5mC via mTet1-mediated oxidation to 5caC, thereby reducing the relatively high sequencing coverage required to detect the subtle signals imparted by 5mC with high confidence. mTet1 efficiently converted 5mC to 5caC in synthetic oligonucleotides, in vitro methylated plasmids, bacterial genomic DNA and mammalian genomic DNA , facilitating identification of microbial 5mC MTase specificities, thus complementing the other two common, readily detectable bacterial methylation marks of m6A and m4C described previously [16, 17]. The protocol is rapid and specific to 5mC, allowing all three base modifications to be simultaneously detected in a single sequencing experiment. We anticipate that, for the sequencing of bacterial and archeal genomes, such comprehensive characterization of the methylome, in addition to de novo assembly of the genome [25, 26], will improve our understanding of important microbiological phenomena, such as adaptation, pathogenicity and resistance evolution. It has been demonstrated through bulk biochemical and genetic studies that the dynamics of methylation in bacteria plays critical roles in basic cellular functions as well as directly affecting virulence [11, 12, 27].
The kinetic signatures of 4mC and 5caC are sufficiently different to allow for discrimination of the two types of cytosine modifications in bacteria. When sequencing through 4mC, the polymerase slows down only when incorporating the cognate nucleotide opposite the modification, with no significant secondary IPD ratio peaks . By contrast, the primary IPD ratio peak for 5caC is located two bases after the modification (+2 position). The combination of observing the sequence identity and the specific kinetic signature make it possible to not only discover the presence of a base modification but also to determine the chemical identity of the type of modification. We are working on algorithmically harnessing this information contained in the kinetic signatures to expand the power of direct detection of modified bases unique to SMRT sequencing . Algorithms that incorporate IPD data from multiple positions across the entire footprint of the polymerase may further enhance the ability to detect and discriminate between modification types. This multi-site analysis and a further understanding of the sequence context dependence of the 5caC kinetic signature should improve detection of 5caC, potentially reducing the sequencing coverage needed to detect converted 5mC positions even further.
In higher eukaryotes, the epigenome is much more complex as at least four different forms of cytosine can occur and dynamically interconvert at epigenetically regulated genomic positions. Emerging evidence suggests that the Tet proteins and the modified cytosines they generate are crucial for a growing list of biological processes, including zygotic epigenetic reprogramming, pluripotent stem cell differentiation, hematopoiesis and development of leukemia . Thus, methods for comprehensive genome-wide mapping of all cytosine modifications will be critical for epigenomic studies. Several methods have been described recently for discriminating between 5mC and 5hmC using bisulfite sequencing in combination with chemical or enzymatic conversion [14, 15]. Since for a given sequence context in SMRT sequencing, the kinetic signatures of 5mC, 5hmC, 5fC, and 5caC are different, there is the potential for direct identification of the various modifications on native DNA samples. We are working to expand the bioinformatics analysis algorithms towards discrimination of different epigenetic marks, taking into account the different signatures as a function of sequence context, as well as partial modification and mixtures of modification types. There are already several strategies for enhancing the kinetic signature of two cytosine modifications allowing for direct detection of 5mC and 5hmC in a single sample using SMRT sequencing. 5hmC positions can first be glucosylated , followed by Tet1-mediated oxidation of 5mC to 5caC. Glucosylated 5hmC will be protected from conversion and discrimination of the two forms can be made based on the differing kinetic signatures. We expect that these and further advances in the direct detection of modified bases during routine genome sequencing will become an important tool to further our understanding of genome and epigenome function.
Custom oligonucleotides containing modified bases were synthesized on-site or purchased from Trilink BioTechnologies (San Diego, CA, USA) and Integrated DNA Technologies (Coralville, IA, USA). All oligonucleotides contained 5' phosphate groups. The plasmid (pCRBlunt) was obtained from Life Technologies (Carlsbad, CA, USA). A list of the sequences can be found in Additional file 6.
Bacterial strains and/or genomic DNA from bacterial strains were purchased from the American Type Culture Collection (Manassas, VA, USA). The following strains were used in this study: E. coli K12 MG1655, and B. halodurans C-125 (JCM 9153).
Synthetic SMRTbell templates were made as previously described by ligating several synthetic oligonucleotides . For plasmid and genomic DNA samples, an aliquot of approximately 25 ng of DNA was subjected to WGA using the REPLI-g Midi Kit (Qiagen, Valencia, CA, USA). WGA and native DNA was sheared to an average size of approximately 500 bp via adaptive focused acoustics (Covaris, Woburn, MA, USA). SMRTbell template sequencing libraries were prepared as previously described [16, 29]. SMRTbell libraries made from whole-genome-amplified pCRBlunt-6K plasmid were in vitro methylated using the HpaII MTase (recognition sequence: 5'-C5mCGG-3'; New England BioLabs; Ipswich, MA, USA) as per the manufacturer's instructions. Complete methylation was assessed by modifying lambda DNA in parallel and subjecting to methylation-sensitive restriction using the HpaII restriction enzyme (New England BioLabs).
The 5mC modifications in SMRTbell template libraries were converted to 5caC using the 5mC mTet1 Oxidation Kit from Wisegene (Chicago, IL, USA) as per the manufacturer's instructions. Approximately 500 ng of SMRTbell templates were treated with the Tet1 enzyme at 37°C for 60 minutes followed by proteinase K treatment at 50°C for 60 minutes. Converted SMRTbell templates were purified using Micro Bio-Spin 30 Columns (BioRad, Hercules, CA, USA) with additional purification and concentration using MinElute PCR Purification Columns (Qiagen).
SMRTbell templates were subjected to standard SMRT sequencing, as described [18, 19]. Reads were processed and mapped to the respective reference sequences using the BLASR mapper  and Pacific Biosciences' SMRT Analysis pipeline  using the standard mapping protocol. IPDs were measured as previously described  and processed as described  for all pulses aligned to each position in the reference sequence.
For the bacterial methylome analysis , we used Pacific Biosciences' SMRTPortal analysis platform v. 1.3.1, which uses an in silico kinetic reference and a t-test based detection of modified base positions . The following GenBank reference sequences were used: U00096.2 for E. coli K-12 MG1655 and BA000004.3 for B. halodurans C-125. MTase target sequence motifs were identified by selecting the top 1,000 kinetic hits and subjecting a ±20 base window around the detected base to MEME-ChIP , and compared to the predictions in REBASE . To estimate the enhancement of detection of methylated 5mC positions (Table 1), we first selected an orthogonal off-target motif of similar sequence content and calculated the kinetic score representing the 99th percentile of all genomic positions of that motif (5'-GGWCC-3' for E. coli (score threshold = 35.6); 5'-CCGG-3' for B. halodurans (30.4)). We then used this 1% false positive detection threshold for determining the number of genomic positions of the on-target methylation sites detected as methylated (Figures 3c and 4c; Table 1). IPD ratio plots were visualized using Circos .
The following additional data are available with the online version of the paper. Additional data file 1 is a figure that demonstrates the sequence context dependence of the kinetic signatures for 5mC and 5caC. Additional data file 2 is a figure that shows IPD ratio data for synthetic SMRTbell templates before and after conversion of 5mC to 5caC. Additional file 3 is a figure with IPD ratio distributions for all methylated sequence motifs in E. coli and B. halodurans. Additional files 4 and 5 are tables that contain detection rate information for all methylated motifs in E. coli and B. halodurans, respectively. Additional data file 6 is a table of oligonucleotide sequences used in this study.
whole genome amplification.
We thank S. Kamtekar, K. Spittle, J. Londry, and P. Marks for helpful discussions, assistance with sample preparation and data analysis, and Min Zhou for generously providing mTet1 conversion kits.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.