cs-e5875 high-throughput bioinformatics dna methylation

31
1/ 29 CS-E5875 High-Throughput Bioinformatics DNA methylation analysis Harri L¨ ahdesm¨ aki Department of Computer Science Aalto University November 13, 2020

Upload: others

Post on 12-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS-E5875 High-Throughput Bioinformatics DNA methylation

1/ 29

CS-E5875 High-Throughput BioinformaticsDNA methylation analysis

Harri Lahdesmaki

Department of Computer ScienceAalto University

November 13, 2020

Page 2: CS-E5875 High-Throughput Bioinformatics DNA methylation

2/ 29

Contents

I DNA methylation

I Bisulfite sequencing (BS-seq) protocol

I Alignment and quantification of BS-seq data

I Statistical analysis of BS-seq data

Page 3: CS-E5875 High-Throughput Bioinformatics DNA methylation

3/ 29

DNA methylation

I Epigenetic changes are reversible modifications on DNA, or“on top of DNA”, which do notchange the DNA sequence itself

I DNA methylation is an epigenetic modification where methyl group is added to the 5position of a cytosine in DNA

I Methyl group is added enzymatically by DNA methyl transferases (DNMT)I By far the most extensively studied epigenetic modification on DNA

Figure from http://www.ks.uiuc.edu/Research/methylation/

Page 4: CS-E5875 High-Throughput Bioinformatics DNA methylation

4/ 29

DNA methylation

I In mammaling genomes, DNA methylationprimarily occurs in the context of CpGdinucleotides

I Non-CpG methylation found e.g. in stemcells and brain

I CpGs occur with a smaller frequency thanexpected

I Human genome GC content is 42%I CpGs are expected to occur 4.41% of the

timeI The frequency of CpG dinucleotides is

1%I Methylated CpGs are prone to

spontaneous deamination to thyminesFigure from (Schubeler, 2009)

Page 5: CS-E5875 High-Throughput Bioinformatics DNA methylation

5/ 29

DNA methylation

I Two general classes of enzymatic methylation activitiesI De novo methylationI Maintenance methylation

Figure from http://2014.igem.org/Team:Heidelberg/Project/PCR_2.0

Page 6: CS-E5875 High-Throughput Bioinformatics DNA methylation

6/ 29

DNA methylation in gene regulation and various traits

I CpG islands (C+G dense &500 long regions) are present in the 5’ regulatory regions ofmany genes

I Hypermethylation (=overmethylation) of CpG islands near gene promoters contributes totranscriptional silencing by

I Affecting binding of transcription factors (DNA binding protein that regulate genetranscription)

I Binding proteins with methyl-CpG-binding domains (MBDs), and recruiting e.g. histonedeacetylases and other chromatin remodellers

I DNA methylation differences are associated with many diseases

I DNA methylation is also known to associate with e.g. age of an individual and smoking

Page 7: CS-E5875 High-Throughput Bioinformatics DNA methylation

6/ 29

DNA methylation in gene regulation and various traits

I CpG islands (C+G dense &500 long regions) are present in the 5’ regulatory regions ofmany genes

I Hypermethylation (=overmethylation) of CpG islands near gene promoters contributes totranscriptional silencing by

I Affecting binding of transcription factors (DNA binding protein that regulate genetranscription)

I Binding proteins with methyl-CpG-binding domains (MBDs), and recruiting e.g. histonedeacetylases and other chromatin remodellers

I DNA methylation differences are associated with many diseases

I DNA methylation is also known to associate with e.g. age of an individual and smoking

Page 8: CS-E5875 High-Throughput Bioinformatics DNA methylation

7/ 29

DNA methylation

Figure from (Spruijt & Vermeulen, 2014)

Page 9: CS-E5875 High-Throughput Bioinformatics DNA methylation

8/ 29

DNA demethylation

I Until recently, it was believed that methylated DNA can be unmethylated only by dilutionduring cell differentiation/DNA replication

I Recently, TET family proteins were shown to be dioxygenases that converted 5mC to5hmC, 5fC and 5caC, which can be further converted back to unmethylated C

I TETs thus contribute to active demethylation, but 5hmC, 5fC and 5caC can also havemultiple functions

Page 10: CS-E5875 High-Throughput Bioinformatics DNA methylation

9/ 29

DNA demethylation

Nature Reviews | Molecular Cell Biology

N

N

NH2

O

RCytosine 5mC

N

N

NH2

O

R5hmC

N

N

NH2

O

R

OH

5fC

N

N

NH2

O

R

O

5caC

N

N

NH2

O

R

O

OH

5hmU

HN

N

O

O

R

OH

DNMT TET proteinTET proteinTET protein

AID or APOBEC?

TDG or SMUG1 and BER

TDG and BER

DNMT enzymes?

Unknown decarboxylase?a

cmC G

G Cm

mC GG Cm

mC GG Cm

mC GG C

C GG Cm

hmC GG C

C GG Chm

Replication Replication

hmC GG Chm

hmC GG C

C GG Chm

TET protein

UHRF1

UHRF1

DNMT1

DNMT1

Maintenancemethylation

Impairedmaintenancemethylation?

m

m

UHRF1DNMT1

UHRF1DNMT1

HN

N

O

O

RThymine

β-glucosylhydroxymethyluracil (base J)

HN

N

O

O

R

O O

OH

OHOH

HO

5hmU

HN

N

O

O

R

OH

Unknownβ-glucosyltransferase

JBPs

b Trypansoma brucei, other kinetoplastid protozoa

Figure 1 | Mechanisms of TET-mediated demethylation. a |�-PQYP�CPF�RWVCVKXG�RCVJYC[U�QH�&0#�FGOGVJ[NCVKQP�VJCV�KPXQNXG�QZKFK\GF�OGVJ[NE[VQUKPG�KPVGTOGFKCVGU��6GP�GNGXGP�VTCPUNQECVKQP�6'6��RTQVGKPU�UGSWGPVKCNN[�QZKFK\G���OGVJ[NE[VQUKPG��O%��VQ���J[FTQZ[OGVJ[NE[VQUKPG��JO%�����HQTO[NE[VQUKPG��H%��CPF���ECTDQZ[NE[VQUKPG��EC%����H%�CPF��EC%�ECP�DG�TGOQXGF�D[�VJ[OKPG�&0#�IN[EQU[NCUG�6&)��CPF�TGRNCEGF�D[�E[VQUKPG�XKC�DCUG�GZEKUKQP�TGRCKT�$'4���CNVJQWIJ�VJG�GZVGPV�VQ�YJKEJ�VJKU�OGEJCPKUO�QRGTCVGU�KP�URGEKHKE�EGNN�V[RGU�FWTKPI�FGXGNQROGPV�KU�WPMPQYP��1VJGT�RTQRQUGF�OGEJCPKUOU�QH�FGOGVJ[NCVKQP�CTG�NGUU�YGNN�GUVCDNKUJGF��KPENWFKPI�FGECTDQZ[NCVKQP�QH��EC%��&0#|OGVJ[NVTCPUHGTCUG�&0/6��OGFKCVGF�TGOQXCN�QH�VJG�J[FTQZ[OGVJ[N�ITQWR�QH��JO%�CPF�FGCOKPCVKQP�QH��JO%�CPF��O%��UGG�OCKP�VGZV��D[�VJG�E[VKFKPG�FGCOKPCUGU�#+&�CEVKXCVKQP�KPFWEGF�E[VKFKPG�FGCOKPCUG���CPF�#21$'%�CRQNKRQRTQVGKP�$�O40#�GFKVKPI�GP\[OG��ECVCN[VKE�RQN[RGRVKFG���#+&�GP\[OGU�FGCOKPCVG�E[VQUKPG�DCUGU�KP�&0#�VQ�yield uracil. #+&�CPF�VJG�NCTIGT�HCOKN[�QH�#21$'%�GP\[OGU�JCXG�DGGP�RTQRQUGF�VQ�GHHGEV�&0#�FGOGVJ[NCVKQP�D[�FGCOKPCVKPI��O%�CPF��JO%�KP�&0#�VQ�[KGNF�VJ[OKPG�CPF��JO7��TGURGEVKXGN[���#U�VJGUG�CTG�RTGUGPV�KP�OKUOCVEJGF�6�)�CPF��JO7�)�DCUGRCKTU��VJG[�JCXG�DGGP�RTQRQUGF�VQ�DG�GZEKUGF�D[�5/7)��UKPING�UVTCPF�UGNGEVKXG�OQPQHWPEVKQPCN�WTCEKN�&0#�IN[EQU[NCUG��QT�6&)��6JKU�OGEJCPKUO�KU�EQPVTQXGTUKCN��JQYGXGT�UGG�OCKP�VGZV���b | The mechanism of base J β�Ż�INWEQU[N�J[FTQZ[OGVJ[NWTCEKN� DKQU[PVJGUKU��6JG�VJ[OKFKPG�QZKFCVKQP�UVGR�OGFKCVGF�D[�,�DKPFKPI�RTQVGKP���,$2���QT�,$2���VQ�RTQFWEG���J[FTQZ[WTCEKN��JO7���KU�CPCNQIQWU�VQ�VJG��O%�QZKFCVKQP�OGFKCVGF�D[�6'6�RTQVGKPU��,$2U�CTG�VJG�HQWPFKPI�OGODGTU�QH�VJG�6'6s,$2�UWRGTHCOKN[��VJG�RTGFKEVGF�QZ[IGPCUG�FQOCKPU�QH�,$2��CPF�,$2��YGTG�WUGF�CU�VJG�UVCTVKPI�RQKPV�HQT�VJG�UGSWGPEG�RTQHKNG�UGCTEJGU�VJCV�TGEQXGTGF�VJG�JQOQNQIQWU�FQOCKPU�QH�VJG�VJTGG�OCOOCNKCP�6'6�proteins. c |�/GEJCPKUO�D[�YJKEJ��JO%�EQWNF�HCEKNKVCVG�TGRNKECVKQP�FGRGPFGPV�&0#�FGOGVJ[NCVKQP��#�U[OOGVTKECNN[�OGVJ[NCVG�F�%R)�UGSWGPEG�KU�EQPXGTVGF�FWTKPI�&0#�TGRNKECVKQP�KPVQ�VYQ�CU[OOGVTKECNN[�OGVJ[NCVGF�&0#�UVTCPFU�NGHV|RCPGN���*GOKOGVJ[NCVGF�%R)�UKVGU�CTG�TGEQIPK\GF�D[�7*4(+��VJG�QDNKICVG�RCTVPGT�QH�VJG�OCKPVGPCPEG�&0#�OGVJ[NVTCPUHGTCUG�&0/6���YJKEJ�TGUVQTGU�U[OOGVTKECN�OGVJ[NCVKQP��6'6�RTQVGKPU�CEV�CV�OGVJ[NCVGF�%R)�UKVGU�VQ�IGPGTCVG�U[OOGVTKECNN[�J[FTQZ[OGVJ[NCVGF�%R)�UGSWGPEGU���JO%�CPF�QVJGT�QZK\KFGF�OGVJ[NE[VQUKPGU�OC[�KORCKT�OCKPVGPCPEG�OGVJ[NCVKQP�D[�KPJKDKVKPI�7*4(��DKPFKPI��&0/6��CEVKXKV[��QT�DQVJ�TKIJV�RCPGN���#U�C�TGUWNV��VJG�%R)�UGSWGPEG�RTQITGUUKXGN[�NQUGU�&0#�OGVJ[NCVKQP�VJTQWIJ�UWEEGUUKXG�&0#�TGRNKECVKQP�E[ENGU�

REVIEWS

342 | JUNE 2013 | VOLUME 14 www.nature.com/reviews/molcellbio

© 2013 Macmillan Publishers Limited. All rights reserved

BER := base excision repair TDG := thymine DNA glycosylase AID := activation-induced deaminase APOBEC := apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like

References and Notes1. J. P. Doyle et al., Cell 135, 749 (2008).2. M. Heiman et al., Cell 135, 738 (2008).3. G. R. Wyatt, S. S. Cohen, Biochem. J. 55, 774

(1953).4. S. Tardy-Planechaud, J. Fujimoto, S. S. Lin, L. C. Sowers,

Nucleic Acids Res. 25, 553 (1997).5. A. Burdzy, K. T. Noyes, V. Valinluck, L. C. Sowers,

Nucleic Acids Res. 30, 4068 (2002).6. S. Zuo, R. J. Boorstein, G. W. Teebor, Nucleic Acids Res.

23, 3239 (1995).7. J. Cadet, T. Douki, J. L. Ravanat, Nat. Chem. Biol. 2, 348

(2006).

8. N. W. Penn, R. Suwalski, C. O'Riley, K. Bojanowski, R. Yura,Biochem. J. 126, 781 (1972).

9. R. M. Kothari, V. Shankar, J. Mol. Evol. 7, 325 (1976).10. J. H. Gommers-Ampt, P. Borst, FASEB J. 9, 1034 (1995).11. H. Hayatsu, M. Shiragami, Biochemistry 18, 632 (1979).12. S. K. Ooi, T. H. Bestor, Cell 133, 1145 (2008).13. V. Valinluck et al., Nucleic Acids Res. 32, 4100 (2004).14. B. H. Ramsahoye, Methods Mol. Biol. 200, 9 (2002).15. We thank B. Gauthier for technical assistance; S. Mazel,

C. Bare, and X. Fan for flow cytometry advice and nucleisorts; and H. Deng and J. Fernandez for acquirement ofMS data and help with HPLC. We are grateful to membersof the Heintz laboratory for discussions and support. This

work was supported by the Howard Hughes MedicalInstitute and the Simons Foundation Autism ResearchInitiative.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/1169786/DC1Materials and MethodsFigs. S1 to S6References

15 December 2008; accepted 18 March 2009Published online 16 April 2009;10.1126/science.1169786Include this information when citing this paper.

Conversion of 5-Methylcytosine to5-Hydroxymethylcytosine in MammalianDNA by MLL Partner TET1Mamta Tahiliani,1 Kian Peng Koh,1 Yinghua Shen,2 William A. Pastor,1Hozefa Bandukwala,1 Yevgeny Brudno,2 Suneet Agarwal,3 Lakshminarayan M. Iyer,4David R. Liu,2* L. Aravind,4* Anjana Rao1*DNA cytosine methylation is crucial for retrotransposon silencing and mammalian development. In acomputational search for enzymes that could modify 5-methylcytosine (5mC), we identified TET proteinsas mammalian homologs of the trypanosome proteins JBP1 and JBP2, which have been proposed tooxidize the 5-methyl group of thymine. We show here that TET1, a fusion partner of the MLL gene in acutemyeloid leukemia, is a 2-oxoglutarate (2OG)- and Fe(II)-dependent enzyme that catalyzes conversionof 5mC to 5-hydroxymethylcytosine (hmC) in cultured cells and in vitro. hmC is present in the genome ofmouse embryonic stem cells, and hmC levels decrease upon RNA interference–mediated depletion of TET1.Thus, TET proteins have potential roles in epigenetic regulation through modification of 5mC to hmC.

5-methylcytosine (5mC) is a minor base inmammalian DNA: It constitutes ~1% of allDNA bases and is found almost exclusively

as symmetrical methylation of the dinucleotideCpG (1). The majority of methylated CpG is

found in repetitive DNA elements, suggestingthat cytosine methylation evolved as a defenseagainst transposons and other parasitic elements(2). Methylation patterns change dynamically inearly embryogenesis, when CpG methylation is

essential for X-inactivation and asymmetric ex-pression of imprinted genes (3). In somatic cells,promoter methylation often shows a correlationwith gene expression: CpG methylation may di-rectly interfere with the binding of certain transcrip-tional regulators to their cognate DNA sequencesor may enable recruitment of methyl-CpG bindingproteins that create a repressed chromatin environ-ment (4). DNA methylation patterns are highlydysregulated in cancer: Changes in methylationstatus have been postulated to inactivate tumorsuppressors and activate oncogenes, thus con-tributing to tumorigenesis (5).

Fig. 2. Two-dimensional TLC, HPLC, and MS identification of hmC. (A) Two-dimensional TLC analysis of synthetic DNA templates indicates that hmCcomigrates with the “x” spot (Fig. 1). (B) HPLC chromatograms (A, 254 nm) ofthe nucleosides derived from synthetic and cerebellum DNA. The peaks wereidentified by MS. The arrow points to the peak, which elutes at the same time

as hmdC. (C) MS of the fraction corresponding to the HPLC peak indicatedabove. Closed arrows indicate the masses of 5-hydroxymethylcytosine and5-hydroxymethyl-2′-deoxycytidine sodium ions (structures are shown in theinsets). Open arrows indicate the ions generated by 2′-deoxycytidine, whichelutes in a large nearby peak and spills over into the analyzed fraction.

1Department of Pathology, Harvard Medical School and Im-mune Disease Institute, 200 Longwood Avenue, Boston, MA02115, USA. 2Department of Chemistry and Chemical Bi-ology and the Howard Hughes Medical Institute, HarvardUniversity, Cambridge, MA 02138, USA. 3Division of Pe-diatric Hematology/Oncology, Children’s Hospital Bostonand Dana-Farber Cancer Institute, Boston, MA 02115, USA.4National Center for Biotechnology Information, NationalLibrary of Medicine, National Institutes of Health, Bethesda,MD 20894, USA.

*To whom correspondence should be addressed. E-mail:[email protected] (A.R.); [email protected](L.A.); [email protected] (D.R.L.)

15 MAY 2009 VOL 324 SCIENCE www.sciencemag.org930

REPORTS

References and Notes1. J. P. Doyle et al., Cell 135, 749 (2008).2. M. Heiman et al., Cell 135, 738 (2008).3. G. R. Wyatt, S. S. Cohen, Biochem. J. 55, 774

(1953).4. S. Tardy-Planechaud, J. Fujimoto, S. S. Lin, L. C. Sowers,

Nucleic Acids Res. 25, 553 (1997).5. A. Burdzy, K. T. Noyes, V. Valinluck, L. C. Sowers,

Nucleic Acids Res. 30, 4068 (2002).6. S. Zuo, R. J. Boorstein, G. W. Teebor, Nucleic Acids Res.

23, 3239 (1995).7. J. Cadet, T. Douki, J. L. Ravanat, Nat. Chem. Biol. 2, 348

(2006).

8. N. W. Penn, R. Suwalski, C. O'Riley, K. Bojanowski, R. Yura,Biochem. J. 126, 781 (1972).

9. R. M. Kothari, V. Shankar, J. Mol. Evol. 7, 325 (1976).10. J. H. Gommers-Ampt, P. Borst, FASEB J. 9, 1034 (1995).11. H. Hayatsu, M. Shiragami, Biochemistry 18, 632 (1979).12. S. K. Ooi, T. H. Bestor, Cell 133, 1145 (2008).13. V. Valinluck et al., Nucleic Acids Res. 32, 4100 (2004).14. B. H. Ramsahoye, Methods Mol. Biol. 200, 9 (2002).15. We thank B. Gauthier for technical assistance; S. Mazel,

C. Bare, and X. Fan for flow cytometry advice and nucleisorts; and H. Deng and J. Fernandez for acquirement ofMS data and help with HPLC. We are grateful to membersof the Heintz laboratory for discussions and support. This

work was supported by the Howard Hughes MedicalInstitute and the Simons Foundation Autism ResearchInitiative.

Supporting Online Materialwww.sciencemag.org/cgi/content/full/1169786/DC1Materials and MethodsFigs. S1 to S6References

15 December 2008; accepted 18 March 2009Published online 16 April 2009;10.1126/science.1169786Include this information when citing this paper.

Conversion of 5-Methylcytosine to5-Hydroxymethylcytosine in MammalianDNA by MLL Partner TET1Mamta Tahiliani,1 Kian Peng Koh,1 Yinghua Shen,2 William A. Pastor,1Hozefa Bandukwala,1 Yevgeny Brudno,2 Suneet Agarwal,3 Lakshminarayan M. Iyer,4David R. Liu,2* L. Aravind,4* Anjana Rao1*DNA cytosine methylation is crucial for retrotransposon silencing and mammalian development. In acomputational search for enzymes that could modify 5-methylcytosine (5mC), we identified TET proteinsas mammalian homologs of the trypanosome proteins JBP1 and JBP2, which have been proposed tooxidize the 5-methyl group of thymine. We show here that TET1, a fusion partner of the MLL gene in acutemyeloid leukemia, is a 2-oxoglutarate (2OG)- and Fe(II)-dependent enzyme that catalyzes conversionof 5mC to 5-hydroxymethylcytosine (hmC) in cultured cells and in vitro. hmC is present in the genome ofmouse embryonic stem cells, and hmC levels decrease upon RNA interference–mediated depletion of TET1.Thus, TET proteins have potential roles in epigenetic regulation through modification of 5mC to hmC.

5-methylcytosine (5mC) is a minor base inmammalian DNA: It constitutes ~1% of allDNA bases and is found almost exclusively

as symmetrical methylation of the dinucleotideCpG (1). The majority of methylated CpG is

found in repetitive DNA elements, suggestingthat cytosine methylation evolved as a defenseagainst transposons and other parasitic elements(2). Methylation patterns change dynamically inearly embryogenesis, when CpG methylation is

essential for X-inactivation and asymmetric ex-pression of imprinted genes (3). In somatic cells,promoter methylation often shows a correlationwith gene expression: CpG methylation may di-rectly interfere with the binding of certain transcrip-tional regulators to their cognate DNA sequencesor may enable recruitment of methyl-CpG bindingproteins that create a repressed chromatin environ-ment (4). DNA methylation patterns are highlydysregulated in cancer: Changes in methylationstatus have been postulated to inactivate tumorsuppressors and activate oncogenes, thus con-tributing to tumorigenesis (5).

Fig. 2. Two-dimensional TLC, HPLC, and MS identification of hmC. (A) Two-dimensional TLC analysis of synthetic DNA templates indicates that hmCcomigrates with the “x” spot (Fig. 1). (B) HPLC chromatograms (A, 254 nm) ofthe nucleosides derived from synthetic and cerebellum DNA. The peaks wereidentified by MS. The arrow points to the peak, which elutes at the same time

as hmdC. (C) MS of the fraction corresponding to the HPLC peak indicatedabove. Closed arrows indicate the masses of 5-hydroxymethylcytosine and5-hydroxymethyl-2′-deoxycytidine sodium ions (structures are shown in theinsets). Open arrows indicate the ions generated by 2′-deoxycytidine, whichelutes in a large nearby peak and spills over into the analyzed fraction.

1Department of Pathology, Harvard Medical School and Im-mune Disease Institute, 200 Longwood Avenue, Boston, MA02115, USA. 2Department of Chemistry and Chemical Bi-ology and the Howard Hughes Medical Institute, HarvardUniversity, Cambridge, MA 02138, USA. 3Division of Pe-diatric Hematology/Oncology, Children’s Hospital Bostonand Dana-Farber Cancer Institute, Boston, MA 02115, USA.4National Center for Biotechnology Information, NationalLibrary of Medicine, National Institutes of Health, Bethesda,MD 20894, USA.

*To whom correspondence should be addressed. E-mail:[email protected] (A.R.); [email protected](L.A.); [email protected] (D.R.L.)

15 MAY 2009 VOL 324 SCIENCE www.sciencemag.org930

REPORTS

data have accession numbers AFHZ00000000 (AAA001-B15),AFIB00000000 (AAA001-C10), AFHY00000000(AAA007-O20), and AFIA00000000 (AAA240-J09). Rawsequences were deposited in the GenBank Short Read Archiveunder accession numbers SRA029592 and SRA035467(AAA001-B15), SRA029604 and SRA035394 (AAA001-C10),

SRA029593 and SRA035468 (AAA007-O20), and SRA029596and SRA035470 (AAA240-J09).

Supporting Online Materialwww.sciencemag.org/cgi/content/full/333/6047/1296/DC1Materials and Methods

Figs. S1 to S19Tables S1 to S15References

1 February 2011; accepted 13 July 201110.1126/science.1203690

Tet Proteins Can Convert5-Methylcytosine to 5-Formylcytosineand 5-CarboxylcytosineShinsuke Ito,1,2* Li Shen,1,2* Qing Dai,3 Susan C. Wu,1,2 Leonard B. Collins,4 James A. Swenberg,2,4

Chuan He,3 Yi Zhang1,2†

5-methylcytosine (5mC) in DNA plays an important role in gene expression, genomic imprinting, andsuppression of transposable elements. 5mC can be converted to 5-hydroxymethylcytosine (5hmC) bythe Tet (ten eleven translocation) proteins. Here, we show that, in addition to 5hmC, the Tet proteins cangenerate 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC) from 5mC in an enzymatic activity–dependent manner. Furthermore, we reveal the presence of 5fC and 5caC in genomic DNA of mouseembryonic stem cells andmouse organs. The genomic content of 5hmC, 5fC, and 5caC can be increased orreduced through overexpression or depletion of Tet proteins. Thus, we identify two previously unknowncytosine derivatives in genomic DNA as the products of Tet proteins. Our study raises the possibilitythat DNA demethylation may occur through Tet-catalyzed oxidation followed by decarboxylation.

Although enzymes that catalyze DNAmeth-ylation process are well studied (1), howDNA demethylation is achieved is less

known, especially in animals (2, 3). A repair-based mechanism is used in DNA demethylationin plants, but whether a similar mechanism is

also used in mammalian cells is unclear (3, 4).Identification of hydroxymethylcytosine (5hmC)as the sixth base of the mammalian genome(5, 6) and the capacity of Tet (ten eleven trans-location) proteins to convert 5-methylcytosine(5mC) to 5hmC in an Fe(II) and alpha-ketoglutarate(a-KG)–dependent oxidation reaction (6, 7) raisedthe possibility that a Tet-catalyzed reaction mightbe part of the DNA demethylation process.

A potential 5mC demethylation mechanismcan be envisioned from similar chemistry forthymine-to-uracil conversion (3, 8, 9) (fig. S1A),

1Howard Hughes Medical Institute and Department of Bio-chemistry and Biophysics, University of North Carolina atChapel Hill, Chapel Hill, NC 27599–7295, USA. 2LinebergerComprehensive Cancer Center, University of North Carolina atChapel Hill, Chapel Hill, NC 27599–7295, USA. 3Departmentof Chemistry and Institute for Biophysical Dynamics, Uni-versity of Chicago, Chicago, IL 60637, USA. 4Department ofEnvironmental Sciences and Engineering, University of NorthCarolina at Chapel Hill, Chapel Hill, NC 27599–7295, USA.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected]

Fig. 1. Optimization of conditions for detection of cytosine and its5-position modified forms by TLC. (A) Migration of labeled C andits 5-position modified forms by TLC under the first developingbuffer. Lanes 1 to 3 serve as controls for the migration of 5mC and5hmC generated from DNA oligos incubated with wild-type (WT) orcatalytic mutant (MUT) Tet2. (B) The same samples used in (A)were separated by TLC under the second developing buffer. Withthe exception of 5mC and C, all of the other forms of C can beseparated under this condition. (C) Autoradiographs of 2D-TLCanalysis of samples derived from 5mC-containing TaqI 20-meroligo DNA incubated with WT and catalytic-deficient mutant Tet1,Tet2, and Tet3.

2 SEPTEMBER 2011 VOL 333 SCIENCE www.sciencemag.org1300

REPORTS

on

Sept

embe

r 23,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

data have accession numbers AFHZ00000000 (AAA001-B15),AFIB00000000 (AAA001-C10), AFHY00000000(AAA007-O20), and AFIA00000000 (AAA240-J09). Rawsequences were deposited in the GenBank Short Read Archiveunder accession numbers SRA029592 and SRA035467(AAA001-B15), SRA029604 and SRA035394 (AAA001-C10),

SRA029593 and SRA035468 (AAA007-O20), and SRA029596and SRA035470 (AAA240-J09).

Supporting Online Materialwww.sciencemag.org/cgi/content/full/333/6047/1296/DC1Materials and Methods

Figs. S1 to S19Tables S1 to S15References

1 February 2011; accepted 13 July 201110.1126/science.1203690

Tet Proteins Can Convert5-Methylcytosine to 5-Formylcytosineand 5-CarboxylcytosineShinsuke Ito,1,2* Li Shen,1,2* Qing Dai,3 Susan C. Wu,1,2 Leonard B. Collins,4 James A. Swenberg,2,4

Chuan He,3 Yi Zhang1,2†

5-methylcytosine (5mC) in DNA plays an important role in gene expression, genomic imprinting, andsuppression of transposable elements. 5mC can be converted to 5-hydroxymethylcytosine (5hmC) bythe Tet (ten eleven translocation) proteins. Here, we show that, in addition to 5hmC, the Tet proteins cangenerate 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC) from 5mC in an enzymatic activity–dependent manner. Furthermore, we reveal the presence of 5fC and 5caC in genomic DNA of mouseembryonic stem cells andmouse organs. The genomic content of 5hmC, 5fC, and 5caC can be increased orreduced through overexpression or depletion of Tet proteins. Thus, we identify two previously unknowncytosine derivatives in genomic DNA as the products of Tet proteins. Our study raises the possibilitythat DNA demethylation may occur through Tet-catalyzed oxidation followed by decarboxylation.

Although enzymes that catalyze DNAmeth-ylation process are well studied (1), howDNA demethylation is achieved is less

known, especially in animals (2, 3). A repair-based mechanism is used in DNA demethylationin plants, but whether a similar mechanism is

also used in mammalian cells is unclear (3, 4).Identification of hydroxymethylcytosine (5hmC)as the sixth base of the mammalian genome(5, 6) and the capacity of Tet (ten eleven trans-location) proteins to convert 5-methylcytosine(5mC) to 5hmC in an Fe(II) and alpha-ketoglutarate(a-KG)–dependent oxidation reaction (6, 7) raisedthe possibility that a Tet-catalyzed reaction mightbe part of the DNA demethylation process.

A potential 5mC demethylation mechanismcan be envisioned from similar chemistry forthymine-to-uracil conversion (3, 8, 9) (fig. S1A),

1Howard Hughes Medical Institute and Department of Bio-chemistry and Biophysics, University of North Carolina atChapel Hill, Chapel Hill, NC 27599–7295, USA. 2LinebergerComprehensive Cancer Center, University of North Carolina atChapel Hill, Chapel Hill, NC 27599–7295, USA. 3Departmentof Chemistry and Institute for Biophysical Dynamics, Uni-versity of Chicago, Chicago, IL 60637, USA. 4Department ofEnvironmental Sciences and Engineering, University of NorthCarolina at Chapel Hill, Chapel Hill, NC 27599–7295, USA.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected]

Fig. 1. Optimization of conditions for detection of cytosine and its5-position modified forms by TLC. (A) Migration of labeled C andits 5-position modified forms by TLC under the first developingbuffer. Lanes 1 to 3 serve as controls for the migration of 5mC and5hmC generated from DNA oligos incubated with wild-type (WT) orcatalytic mutant (MUT) Tet2. (B) The same samples used in (A)were separated by TLC under the second developing buffer. Withthe exception of 5mC and C, all of the other forms of C can beseparated under this condition. (C) Autoradiographs of 2D-TLCanalysis of samples derived from 5mC-containing TaqI 20-meroligo DNA incubated with WT and catalytic-deficient mutant Tet1,Tet2, and Tet3.

2 SEPTEMBER 2011 VOL 333 SCIENCE www.sciencemag.org1300

REPORTS

on

Sept

embe

r 23,

201

1w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 11: CS-E5875 High-Throughput Bioinformatics DNA methylation

10/ 29

Contents

I DNA methylation

I Bisulfite sequencing (BS-seq) protocol

I Alignment and quantification of BS-seq data

I Statistical analysis of BS-seq data

Page 12: CS-E5875 High-Throughput Bioinformatics DNA methylation

11/ 29

Bisulfite sequencing (BS-seq) protocol

I Bisulfite treatment of genomic DNA converts unmethylated cytosines to urasils which areread as thymine during sequencing

I Methylated (and hydroxymethylated) cytosines are resistant to the conversion and are readas cytosine

3. W. Kim, S. Kook, D. J. Kim, C. Teodorof, W. K. Song,J. Biol. Chem. 279, 8333 (2004).

4. V. Giambra et al., Mol. Cell. Biol. 28, 6123 (2008).5. F. E. Garrett et al., Mol. Cell. Biol. 25, 1511

(2005).6. W. A. Dunnick et al., J. Exp. Med. 206, 2613

(2009).7. M. Cogné et al., Cell 77, 737 (1994).8. J. P. Manis et al., J. Exp. Med. 188, 1421 (1998).9. A. G. Bébin et al., J. Immunol. 184, 3710 (2010).

10. E. Pinaud et al., Immunity 15, 187 (2001).11. C. Vincent-Fabert et al., Blood 116, 1895 (2010).12. R. Wuerffel et al., Immunity 27, 711 (2007).13. Z. Ju et al., J. Biol. Chem. 282, 35169 (2007).14. H. Duan, H. Xiang, L. Ma, L. M. Boxer, Oncogene 27,

6720 (2008).15. M. Gostissa et al., Nature 462, 803 (2009).16. C. Chauveau, M. Cogné, Nat. Genet. 14, 15 (1996).

17. C. Chauveau, E. Pinaud, M. Cogne, Eur. J. Immunol. 28,3048 (1998).

18. M. A. Sepulveda, F. E. Garrett, A. Price-Whelan,B. K. Birshtein, Mol. Immunol. 42, 605 (2005).

19. E. Pinaud, C. Aupetit, C. Chauveau, M. Cogné,Eur. J. Immunol. 27, 2981 (1997).

20. A. A. Khamlichi et al., Blood 103, 3828 (2004).21. R. Shinkura et al., Nat. Immunol. 4, 435 (2003).22. A. Yamane et al., Nat. Immunol. 12, 62 (2011).23. M. Liu et al., Nature 451, 841 (2008).24. J. Stavnezer, J. E. Guikema, C. E. Schrader, Annu. Rev.

Immunol. 26, 261 (2008).25. S. Duchez et al., Proc. Natl. Acad. Sci. U.S.A. 107, 3064

(2010).26. T. K. Kim et al., Nature 465, 182 (2010).

Acknowledgments: We thank T. Honjo for providingAID−/− mice and F. Lechouane for sorted B cells DNA samples.

We are indebted to the cell sorting facility of LimogesUniversity for excellent technical assistance in cell sorting. Thiswork was supported by grants from Association pour laRecherche sur le Cancer, Ligue Nationale contre le Cancer,Cancéropôle Grand Sud-Ouest, Institut National du Cancer,and Région Limousin. The data presented in this paper aretabulated here and in the supplementary materials.

Supplementary Materialswww.sciencemag.org/cgi/content/full/science.1218692/DC1Materials and MethodsFigs. S1 to S4Tables S1 and S2References (27–30)

4 January 2012; accepted 27 March 2012Published online 26 April 2012;10.1126/science.1218692

Quantitative Sequencing of5-Methylcytosine and5-Hydroxymethylcytosine atSingle-Base ResolutionMichael J. Booth,1* Miguel R. Branco,2,3* Gabriella Ficz,2 David Oxley,4 Felix Krueger,5

Wolf Reik,2,3† Shankar Balasubramanian1,6,7†

5-Methylcytosine can be converted to 5-hydroxymethylcytosine (5hmC) in mammalian DNA by theten-eleven translocation (TET) enzymes. We introduce oxidative bisulfite sequencing (oxBS-Seq),the first method for quantitative mapping of 5hmC in genomic DNA at single-nucleotide resolution.Selective chemical oxidation of 5hmC to 5-formylcytosine (5fC) enables bisulfite conversion of5fC to uracil. We demonstrate the utility of oxBS-Seq to map and quantify 5hmC at CpG islands(CGIs) in mouse embryonic stem (ES) cells and identify 800 5hmC-containing CGIs that haveon average 3.3% hydroxymethylation. High levels of 5hmC were found in CGIs associated withtranscriptional regulators and in long interspersed nuclear elements, suggesting that theseregions might undergo epigenetic reprogramming in ES cells. Our results open new questionson 5hmC dynamics and sequence-specific targeting by TETs.

5-Methylcytosine (5mC) is an epigenetic DNAmark that plays important roles in genesilencing and genome stability and is found

enriched at CpG dinucleotides (1). In metazoa,5mC can be oxidized to 5-hydroxymethylcytosine(5hmC) by the ten-eleven translocation (TET) en-zyme family (2, 3). 5hmCmay be an intermediatein active DNA demethylation but could also con-stitute an epigenetic mark per se (4). Levels of5hmC in genomic DNA can be quantified withanalytical methods (2, 5, 6) and mapped throughthe enrichment of 5hmC-containing DNA frag-

ments that are then sequenced (7–13). Such ap-proaches have relatively poor resolution and giveonly relative quantitative information. Single-nucleotide sequencing of 5mC has been per-formed by using bisulfite sequencing (BS-Seq),but this method cannot discriminate 5mC from5hmC (14, 15). Single-molecule real-time se-quencing (SMRT) can detect derivatized 5hmCin genomic DNA (16). However, enrichment of5hmC-containing DNA fragments is required,which causes loss of quantitative information(16). Furthermore, SMRT has a relatively highrate of sequencing errors (17), and the peak call-ing of modifications is imprecise (16). Proteinand solid-state nanopores can resolve 5mC from5hmC and have the potential to sequence unam-plified DNA (18, 19).

We observed the decarbonylation and deami-nation of 5-formylcytosine (5fC) to uracil (U)under bisulfite conditions that would leave 5mCunchanged (Fig. 1A and supplementary text).Thus, 5hmC sequencing would be possible if5hmC could be selectively oxidized to 5fC andthen converted to U in a two-step procedure (Fig.

1B). Whereas BS-Seq leads to both 5mC and5hmC being detected as Cs, this “oxidativebisulfite” sequencing (oxBS-Seq) approach wouldyield Cs only at 5mC sites and therefore allowus to determine the amount of 5hmC at a partic-ular nucleotide position by subtraction of thisreadout from a BS-Seq one (Fig. 1C).

Specific oxidation of 5hmC to 5fC (table S1)was achieved with potassium perruthenate (KRuO4).In our reactivity studies on a synthetic 15-nucleotideoligomer single-stranded DNA (ssDNA) contain-ing 5hmC, we established conditions under whichKRuO4 reacted specifically with the primary al-cohol of 5hmC (Fig. 2A). Fifteen-nucleotide oligo-mer ssDNA that contained C or 5mC did notshow any base-specific reactions with KRuO4 (fig.S1, A and B). For 5hmC in DNA, we only ob-served the aldehyde (5fC) and not the carboxylicacid (20), even with a moderate excess of oxidant.The KRuO4 oxidation can oxidize 5hmC in sam-ples presented as double-stranded DNA (dsDNA),with an initial denaturing step before addition ofthe oxidant; this results in a quantitative conver-sion of 5hmC to 5fC (Fig. 2B).

To test the efficiency and selectivity of the oxi-dative bisulfite method, three synthetic dsDNAscontaining either C, 5mC, or 5hmC were eachoxidized with KRuO4 and then subjected to aconventional bisulfite conversion protocol. Sangersequencing revealed that 5mC residues did notconvert to U, whereas both C and 5hmC resi-dues did convert to U (fig. S2). Because Sangersequencing is not quantitative, to gain a moreaccurate measure of the efficiency of transforming5hmC to U, Illumina (San Diego, California) se-quencing was carried out on the synthetic DNAcontaining 5hmC (122-nucleotide oligomer) afteroxidative bisulfite treatment. An overall 5hmC-to-U conversion level of 94.5% was observed (Fig.2C and fig. S14). The oxidative bisulfite proto-col was also applied to a synthetic dsDNA thatcontained multiple 5hmC residues (135-nucleotideoligomer) in a range of different contexts thatshowed a similarly high conversion efficiency(94.7%) of 5hmC to U (Fig. 2C and fig. S14).Last, the KRuO4 oxidation was carried out ongenomic DNA and showed through mass spec-trometry a quantitative conversion of 5hmC to

1Department of Chemistry, University of Cambridge, CambridgeCB2 1EW, UK. 2Epigenetics Programme, Babraham Institute,Cambridge CB22 3AT, UK. 3Centre for Trophoblast Research,University of Cambridge, Cambridge CB2 3EG, UK. 4ProteomicsResearch Group, Babraham Institute, Cambridge CB22 3AT,UK. 5Bioinformatics Group, Babraham Institute, CambridgeCB22 3AT, UK. 6School of Clinical Medicine, University ofCambridge, Cambridge CB2 0SP, UK. 7Cancer Research UK,Cambridge Research Institute, Li Ka Shing Centre, Cam-bridge CB2 0RE, UK.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (W.R.); [email protected] (S.B.)

18 MAY 2012 VOL 336 SCIENCE www.sciencemag.org934

REPORTS

3. W. Kim, S. Kook, D. J. Kim, C. Teodorof, W. K. Song,J. Biol. Chem. 279, 8333 (2004).

4. V. Giambra et al., Mol. Cell. Biol. 28, 6123 (2008).5. F. E. Garrett et al., Mol. Cell. Biol. 25, 1511

(2005).6. W. A. Dunnick et al., J. Exp. Med. 206, 2613

(2009).7. M. Cogné et al., Cell 77, 737 (1994).8. J. P. Manis et al., J. Exp. Med. 188, 1421 (1998).9. A. G. Bébin et al., J. Immunol. 184, 3710 (2010).

10. E. Pinaud et al., Immunity 15, 187 (2001).11. C. Vincent-Fabert et al., Blood 116, 1895 (2010).12. R. Wuerffel et al., Immunity 27, 711 (2007).13. Z. Ju et al., J. Biol. Chem. 282, 35169 (2007).14. H. Duan, H. Xiang, L. Ma, L. M. Boxer, Oncogene 27,

6720 (2008).15. M. Gostissa et al., Nature 462, 803 (2009).16. C. Chauveau, M. Cogné, Nat. Genet. 14, 15 (1996).

17. C. Chauveau, E. Pinaud, M. Cogne, Eur. J. Immunol. 28,3048 (1998).

18. M. A. Sepulveda, F. E. Garrett, A. Price-Whelan,B. K. Birshtein, Mol. Immunol. 42, 605 (2005).

19. E. Pinaud, C. Aupetit, C. Chauveau, M. Cogné,Eur. J. Immunol. 27, 2981 (1997).

20. A. A. Khamlichi et al., Blood 103, 3828 (2004).21. R. Shinkura et al., Nat. Immunol. 4, 435 (2003).22. A. Yamane et al., Nat. Immunol. 12, 62 (2011).23. M. Liu et al., Nature 451, 841 (2008).24. J. Stavnezer, J. E. Guikema, C. E. Schrader, Annu. Rev.

Immunol. 26, 261 (2008).25. S. Duchez et al., Proc. Natl. Acad. Sci. U.S.A. 107, 3064

(2010).26. T. K. Kim et al., Nature 465, 182 (2010).

Acknowledgments: We thank T. Honjo for providingAID−/− mice and F. Lechouane for sorted B cells DNA samples.

We are indebted to the cell sorting facility of LimogesUniversity for excellent technical assistance in cell sorting. Thiswork was supported by grants from Association pour laRecherche sur le Cancer, Ligue Nationale contre le Cancer,Cancéropôle Grand Sud-Ouest, Institut National du Cancer,and Région Limousin. The data presented in this paper aretabulated here and in the supplementary materials.

Supplementary Materialswww.sciencemag.org/cgi/content/full/science.1218692/DC1Materials and MethodsFigs. S1 to S4Tables S1 and S2References (27–30)

4 January 2012; accepted 27 March 2012Published online 26 April 2012;10.1126/science.1218692

Quantitative Sequencing of5-Methylcytosine and5-Hydroxymethylcytosine atSingle-Base ResolutionMichael J. Booth,1* Miguel R. Branco,2,3* Gabriella Ficz,2 David Oxley,4 Felix Krueger,5

Wolf Reik,2,3† Shankar Balasubramanian1,6,7†

5-Methylcytosine can be converted to 5-hydroxymethylcytosine (5hmC) in mammalian DNA by theten-eleven translocation (TET) enzymes. We introduce oxidative bisulfite sequencing (oxBS-Seq),the first method for quantitative mapping of 5hmC in genomic DNA at single-nucleotide resolution.Selective chemical oxidation of 5hmC to 5-formylcytosine (5fC) enables bisulfite conversion of5fC to uracil. We demonstrate the utility of oxBS-Seq to map and quantify 5hmC at CpG islands(CGIs) in mouse embryonic stem (ES) cells and identify 800 5hmC-containing CGIs that haveon average 3.3% hydroxymethylation. High levels of 5hmC were found in CGIs associated withtranscriptional regulators and in long interspersed nuclear elements, suggesting that theseregions might undergo epigenetic reprogramming in ES cells. Our results open new questionson 5hmC dynamics and sequence-specific targeting by TETs.

5-Methylcytosine (5mC) is an epigenetic DNAmark that plays important roles in genesilencing and genome stability and is found

enriched at CpG dinucleotides (1). In metazoa,5mC can be oxidized to 5-hydroxymethylcytosine(5hmC) by the ten-eleven translocation (TET) en-zyme family (2, 3). 5hmCmay be an intermediatein active DNA demethylation but could also con-stitute an epigenetic mark per se (4). Levels of5hmC in genomic DNA can be quantified withanalytical methods (2, 5, 6) and mapped throughthe enrichment of 5hmC-containing DNA frag-

ments that are then sequenced (7–13). Such ap-proaches have relatively poor resolution and giveonly relative quantitative information. Single-nucleotide sequencing of 5mC has been per-formed by using bisulfite sequencing (BS-Seq),but this method cannot discriminate 5mC from5hmC (14, 15). Single-molecule real-time se-quencing (SMRT) can detect derivatized 5hmCin genomic DNA (16). However, enrichment of5hmC-containing DNA fragments is required,which causes loss of quantitative information(16). Furthermore, SMRT has a relatively highrate of sequencing errors (17), and the peak call-ing of modifications is imprecise (16). Proteinand solid-state nanopores can resolve 5mC from5hmC and have the potential to sequence unam-plified DNA (18, 19).

We observed the decarbonylation and deami-nation of 5-formylcytosine (5fC) to uracil (U)under bisulfite conditions that would leave 5mCunchanged (Fig. 1A and supplementary text).Thus, 5hmC sequencing would be possible if5hmC could be selectively oxidized to 5fC andthen converted to U in a two-step procedure (Fig.

1B). Whereas BS-Seq leads to both 5mC and5hmC being detected as Cs, this “oxidativebisulfite” sequencing (oxBS-Seq) approach wouldyield Cs only at 5mC sites and therefore allowus to determine the amount of 5hmC at a partic-ular nucleotide position by subtraction of thisreadout from a BS-Seq one (Fig. 1C).

Specific oxidation of 5hmC to 5fC (table S1)was achieved with potassium perruthenate (KRuO4).In our reactivity studies on a synthetic 15-nucleotideoligomer single-stranded DNA (ssDNA) contain-ing 5hmC, we established conditions under whichKRuO4 reacted specifically with the primary al-cohol of 5hmC (Fig. 2A). Fifteen-nucleotide oligo-mer ssDNA that contained C or 5mC did notshow any base-specific reactions with KRuO4 (fig.S1, A and B). For 5hmC in DNA, we only ob-served the aldehyde (5fC) and not the carboxylicacid (20), even with a moderate excess of oxidant.The KRuO4 oxidation can oxidize 5hmC in sam-ples presented as double-stranded DNA (dsDNA),with an initial denaturing step before addition ofthe oxidant; this results in a quantitative conver-sion of 5hmC to 5fC (Fig. 2B).

To test the efficiency and selectivity of the oxi-dative bisulfite method, three synthetic dsDNAscontaining either C, 5mC, or 5hmC were eachoxidized with KRuO4 and then subjected to aconventional bisulfite conversion protocol. Sangersequencing revealed that 5mC residues did notconvert to U, whereas both C and 5hmC resi-dues did convert to U (fig. S2). Because Sangersequencing is not quantitative, to gain a moreaccurate measure of the efficiency of transforming5hmC to U, Illumina (San Diego, California) se-quencing was carried out on the synthetic DNAcontaining 5hmC (122-nucleotide oligomer) afteroxidative bisulfite treatment. An overall 5hmC-to-U conversion level of 94.5% was observed (Fig.2C and fig. S14). The oxidative bisulfite proto-col was also applied to a synthetic dsDNA thatcontained multiple 5hmC residues (135-nucleotideoligomer) in a range of different contexts thatshowed a similarly high conversion efficiency(94.7%) of 5hmC to U (Fig. 2C and fig. S14).Last, the KRuO4 oxidation was carried out ongenomic DNA and showed through mass spec-trometry a quantitative conversion of 5hmC to

1Department of Chemistry, University of Cambridge, CambridgeCB2 1EW, UK. 2Epigenetics Programme, Babraham Institute,Cambridge CB22 3AT, UK. 3Centre for Trophoblast Research,University of Cambridge, Cambridge CB2 3EG, UK. 4ProteomicsResearch Group, Babraham Institute, Cambridge CB22 3AT,UK. 5Bioinformatics Group, Babraham Institute, CambridgeCB22 3AT, UK. 6School of Clinical Medicine, University ofCambridge, Cambridge CB2 0SP, UK. 7Cancer Research UK,Cambridge Research Institute, Li Ka Shing Centre, Cam-bridge CB2 0RE, UK.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (W.R.); [email protected] (S.B.)

18 MAY 2012 VOL 336 SCIENCE www.sciencemag.org934

REPORTS

Bisulphite sequencing (BS-seq)

Oxidative bisulphite sequencing (oxBS-seq)

(He et al., 2011; Ito et al., 2011; Tahiliani et al., 2009). These

oxidized methylcytosines (oxi-mC) have been proposed to play a role

in active DNA demethylation through 5mC oxidation and DNA re-

pair, and in chromatin regulation (Pastor et al., 2013). 5mC and all5 the oxi-mC species are of great interest due to the alleged role of

DNA methylation in diseases, such as different cancers (Baylin,

2005), Alzheimer (De Jager et al., 2014), asthma (Rastogi et al.,

2013), autism (Nardone et al., 2014) and type 2 diabetes (Dayeh

et al., 2014). However, studies of primary human clinical samples are10 complicated by many factors; for instance, greater biological vari-

ation compared with more controlled molecular biology studies, pos-

sible confounding factors and case-control matching.

Bisulphite sequencing (BS-seq) has become the gold standard

technique for profiling methylation at single nucleotide resolution15 (Lister et al., 2009, 2013; Rein et al., 1998). In BS-seq, genomic

DNA is treated with sodium bisulphite, which will rapidly deami-

nate unmodified cytosine (and 5fC and 5caC) to uracil, while de-

amination of 5mC and 5hmC are much slower (Frommer et al.,

1992). Next, after PCR amplification, uracil and cytosine are read20 as thymine and cytosine, respectively. Importantly, 5fC and 5caC

will have the same read-out as unmodified cytosine and, similarly,

5hmC and 5mC share the same read-out in BS-seq (Huang et al.,

2010). This observation drove the development of various modified

bisulphite sequencing protocols (reviewed in Plongthongkum et al.,25 2014). For instance, oxidative bisulphite sequencing (oxBS-seq)

(Booth et al., 2012) and Tet-assisted bisulphite sequencing (TAB-

seq) (Yu et al., 2012) were developed for distinguishing 5hmC from

5mC. Both methods, oxBS-seq and TAB-seq, are based on oxida-

tion; 5hmC is oxidised into 5fC by KRuO4 in oxBS-seq, whereas in30 TAB-seq 5mC is oxidised into 5caC by recombinant mouse Tet1. To

gain information on 5fC, 5fC chemical modification-assisted bisul-

phite sequencing (fCAB-seq) (Lu et al., 2013) and reduced bisulphite

sequencing (redBS-seq) (Booth et al., 2014) have been proposed.

Chemical modification-assisted bisulphite sequencing (CAB-seq) to-35gether with BS-seq allows the quantification of 5caC by protecting

5caC from deamination by sodium bisulphite with 1-ethyl-3-[3-

dimethylaminopropyl]-carbodiimide hydrochloride (Lu et al.,

2013). CpG methyltransferase (M.SssI) assisted bisulphite sequenc-

ing (MAB-seq) when combined with BS-seq distinguishes 5fC/5caC40from C (Wu et al., 2014). A summary of the read-outs of the

described bisulphite sequencing approaches is listed in Figure 1A.

In order to estimate proportions of multiple methylation modifi-

cations, one has to deconvolute and integrate data from multiple

bisulphite based measurements (Fig. 1A) which often have biases45due to imperfect experimental steps (Plongthongkum et al., 2014).

Many computational methods have been developed for analysing

the standard bisulphite sequencing data (here we will describe only

the most relevant methodologies, for a more comprehensive list of

different methods see €Aijo et al., 2016). Methods based on beta-50binomial models have been proposed allowing modeling of sampling

and biological variation. For instance, MOABS uses a hierarchical

beta-binomial model with an empirical Bayesian approach (Sun

et al., 2014). To assess differential methylation, MOABS uses cred-

ible methylation difference metric for summarizing statistical and55biological significance (Sun et al., 2014). Another method,

RADMeth, takes into account covariates under the beta-binomial

model using a generalised linear model approach with the logit link

function (Dolzhenko and Smith, 2014). RADMeth detects differen-

tial methylation by using the log-likelihood ratio test and the evi-60dence for differential methylation across neighbouring cytosines is

shared using the Stouffer-Liptak weighted Z test. Recently, the

MACAU method was proposed, which combines a binomial mixed

model with a sampling-based inference algorithm to model various

genetic relatedness/population structures (Lea et al., 2015).65MACAU uses Wald test statistics on the posterior samples to call

whether a covariate has an effect on methylation (Lea et al., 2015).

A C

B

Fig. 1. (A) The conversion chart of C, 5mC, 5hmC, 5fC and 5caC in BS-seq, oxBS-seq, TAB-seq, CAB-seq, fCAB-seq, redBS-seq and MAB-seq experiments. (B) The

experimental steps of BS- and oxBS-seq experiments are represented in terms of experimental parameters. Green and red arrows depict successful and unsuc-

cessful steps, respectively. (C) The proposed hierarchical model for modeling methylation modification proportions for BS-seq and oxBS-seq data and parts of

the original Lux model represented in the plate notation. The grey and white circles are used to represent observed variables and latent variables, respectively.

The grey squares represent fixed hyperparameters. The components, which model the experimental parameters and control cytosines are the same as in the Lux

model (€Aijo et al., 2016)

i2 T.€Aijo et al.

potassium perruthenate (KRuO4)

5fC (Fig. 2D), with no detectable degradation ofC (fig. S1C). Thus, the oxidative bisulfite protocolspecifically converts 5hmC to U in DNA, leavingC and 5mC unchanged, enabling quantitative,single-nucleotide-resolution sequencing on wide-ly available platforms.

We then used oxBS-Seq to quantitatively map5hmC at high resolution in the genomic DNAof mouse embryonic stem (ES) cells. We choseto combine oxidative bisulfite with reduced rep-resentation bisulfite sequencing (RRBS) (21),which allows deep, selective sequencing of afraction of the genome that is highly enrichedfor CpG islands (CGIs). We generated RRBSand oxidative RRBS (oxRRBS) data sets, achiev-ing an average sequencing depth of ~120 readsper CpG, which when pooled yielded an aver-age of ~3300 methylation calls per CGI (fig.S3). After applying depth and breadth cutoffs(supplementary materials, materials and meth-ods), 55% (12,660) of all CGIs (22) were cov-ered in our data sets.

To identify 5hmC-containing CGIs, we testedfor differences between the RRBS and oxRRBSdata sets using stringent criteria, yielding a falsediscovery rate of 3.7% (supplementary materials,materials and methods). We identified 800 5hmC-containing CGIs, which had an average of 3.3%(range of 0.2 to 18.5%) CpG hydroxymethylation(Fig. 3, A and B). We also identified 4577 5mC-containing CGIs averaging 8.1% CpG methyla-tion (Fig. 3B). We carried out sequencing on anindependent biological duplicate sample ofthe same ES cell line but at a different passage

A

B

C

0 100 200 300 4000

1

2

3d5fC

dU

Time / minutes

Nor

mal

ised

Con

cent

ratio

n / x

10-1

2

NO

N

NH2

R

OH

NO

N

NH2

R

O

NO

HN

O

R

Oxidation 1) NaHSO3

5hmC 5fC U

2) NaOH

Base Sequence BS Sequence oxBS Sequence C C T T

5mC C C C

5hmC C C T

A G T TA G T T

A G T C 5mC 5hmC

C C C T

Oxidation

BS and amplification

BS and amplification

Compare sequences

Input DNA

A G T C 5mC 5fC

Fig. 1. A method for single-base resolution sequencing of 5hmC. (A)Reaction of 2ʹ-deoxy-5-formylcytidine (d5fC) with NaHSO3 (bisulfite)quenched by NaOH at different time points and then analyzed with high-performance liquid chromatography (HPLC). Data are mean T SD of three

replicates. (B) Oxidative bisulfite reaction scheme: oxidation of 5hmC to5fC followed by bisulfite treatment and NaOH to convert 5fC to U. The Rgroup is DNA. (C) Diagram and table outlining the BS-Seq and oxBS-Seqtechniques.

5mC 5hmC0

20

40

60

80

100 94.5

2.1

% C

-T c

onve

rsio

n

5mC 5hmC0

20

40

60

80

100 94.7

2.1

% C

-T c

onve

rsio

n

Input Oxidised0

1

2

3

4

5 5hmC5fC

Nor

mal

ised

Con

cent

ratio

n / x

10-1

Input Oxidised0

100

200

300

5hm

C (n

orm

alis

ed p

eak

area

s)

Input Oxidised0

2

4

6 5hmC5fC

Nor

mal

ised

Con

cent

ratio

n / x

10-2

Input Oxidised0

10

20

30

5fC

(nor

mal

ised

pea

k ar

eas)

A B

DC Genomic DNA OxidationSingle 5hmCpG Multiple 5hmCpGs

Single Stranded DNA Oxidation Double Stranded DNA Oxidation

Fig. 2. Quantification of 5hmC oxidation. (A) Levels of 5hmC and 5fC (normalized to T) in a 15-nucleotideoligomer ssDNA oligonucleotide before and after KRuO4 oxidation, measured with mass spectrometry. (B)Levels of 5hmC and 5fC (normalized to 5mC) in a 135-nucleotide oligomer dsDNA fragment before andafter KRuO4 oxidation. (C) C-to-T conversion levels as determined by means of Illumina sequencing of twodsDNA fragments containing either a single 5hmCpG (122-nucleotide oligomer) or multiple 5hmCpGs(135-nucleotide oligomer) after oxidative bisulfite treatment. 5mC was also present in these strands. (D)Levels of 5hmC and 5fC (normalized to 5mC in primer sequence) in ES cell DNA measured before and afteroxidation. Data are mean T SD.

www.sciencemag.org SCIENCE VOL 336 18 MAY 2012 935

REPORTS

BS-seq, oxBS-seq, etc.

3. W. Kim, S. Kook, D. J. Kim, C. Teodorof, W. K. Song,J. Biol. Chem. 279, 8333 (2004).

4. V. Giambra et al., Mol. Cell. Biol. 28, 6123 (2008).5. F. E. Garrett et al., Mol. Cell. Biol. 25, 1511

(2005).6. W. A. Dunnick et al., J. Exp. Med. 206, 2613

(2009).7. M. Cogné et al., Cell 77, 737 (1994).8. J. P. Manis et al., J. Exp. Med. 188, 1421 (1998).9. A. G. Bébin et al., J. Immunol. 184, 3710 (2010).

10. E. Pinaud et al., Immunity 15, 187 (2001).11. C. Vincent-Fabert et al., Blood 116, 1895 (2010).12. R. Wuerffel et al., Immunity 27, 711 (2007).13. Z. Ju et al., J. Biol. Chem. 282, 35169 (2007).14. H. Duan, H. Xiang, L. Ma, L. M. Boxer, Oncogene 27,

6720 (2008).15. M. Gostissa et al., Nature 462, 803 (2009).16. C. Chauveau, M. Cogné, Nat. Genet. 14, 15 (1996).

17. C. Chauveau, E. Pinaud, M. Cogne, Eur. J. Immunol. 28,3048 (1998).

18. M. A. Sepulveda, F. E. Garrett, A. Price-Whelan,B. K. Birshtein, Mol. Immunol. 42, 605 (2005).

19. E. Pinaud, C. Aupetit, C. Chauveau, M. Cogné,Eur. J. Immunol. 27, 2981 (1997).

20. A. A. Khamlichi et al., Blood 103, 3828 (2004).21. R. Shinkura et al., Nat. Immunol. 4, 435 (2003).22. A. Yamane et al., Nat. Immunol. 12, 62 (2011).23. M. Liu et al., Nature 451, 841 (2008).24. J. Stavnezer, J. E. Guikema, C. E. Schrader, Annu. Rev.

Immunol. 26, 261 (2008).25. S. Duchez et al., Proc. Natl. Acad. Sci. U.S.A. 107, 3064

(2010).26. T. K. Kim et al., Nature 465, 182 (2010).

Acknowledgments: We thank T. Honjo for providingAID−/− mice and F. Lechouane for sorted B cells DNA samples.

We are indebted to the cell sorting facility of LimogesUniversity for excellent technical assistance in cell sorting. Thiswork was supported by grants from Association pour laRecherche sur le Cancer, Ligue Nationale contre le Cancer,Cancéropôle Grand Sud-Ouest, Institut National du Cancer,and Région Limousin. The data presented in this paper aretabulated here and in the supplementary materials.

Supplementary Materialswww.sciencemag.org/cgi/content/full/science.1218692/DC1Materials and MethodsFigs. S1 to S4Tables S1 and S2References (27–30)

4 January 2012; accepted 27 March 2012Published online 26 April 2012;10.1126/science.1218692

Quantitative Sequencing of5-Methylcytosine and5-Hydroxymethylcytosine atSingle-Base ResolutionMichael J. Booth,1* Miguel R. Branco,2,3* Gabriella Ficz,2 David Oxley,4 Felix Krueger,5

Wolf Reik,2,3† Shankar Balasubramanian1,6,7†

5-Methylcytosine can be converted to 5-hydroxymethylcytosine (5hmC) in mammalian DNA by theten-eleven translocation (TET) enzymes. We introduce oxidative bisulfite sequencing (oxBS-Seq),the first method for quantitative mapping of 5hmC in genomic DNA at single-nucleotide resolution.Selective chemical oxidation of 5hmC to 5-formylcytosine (5fC) enables bisulfite conversion of5fC to uracil. We demonstrate the utility of oxBS-Seq to map and quantify 5hmC at CpG islands(CGIs) in mouse embryonic stem (ES) cells and identify 800 5hmC-containing CGIs that haveon average 3.3% hydroxymethylation. High levels of 5hmC were found in CGIs associated withtranscriptional regulators and in long interspersed nuclear elements, suggesting that theseregions might undergo epigenetic reprogramming in ES cells. Our results open new questionson 5hmC dynamics and sequence-specific targeting by TETs.

5-Methylcytosine (5mC) is an epigenetic DNAmark that plays important roles in genesilencing and genome stability and is found

enriched at CpG dinucleotides (1). In metazoa,5mC can be oxidized to 5-hydroxymethylcytosine(5hmC) by the ten-eleven translocation (TET) en-zyme family (2, 3). 5hmCmay be an intermediatein active DNA demethylation but could also con-stitute an epigenetic mark per se (4). Levels of5hmC in genomic DNA can be quantified withanalytical methods (2, 5, 6) and mapped throughthe enrichment of 5hmC-containing DNA frag-

ments that are then sequenced (7–13). Such ap-proaches have relatively poor resolution and giveonly relative quantitative information. Single-nucleotide sequencing of 5mC has been per-formed by using bisulfite sequencing (BS-Seq),but this method cannot discriminate 5mC from5hmC (14, 15). Single-molecule real-time se-quencing (SMRT) can detect derivatized 5hmCin genomic DNA (16). However, enrichment of5hmC-containing DNA fragments is required,which causes loss of quantitative information(16). Furthermore, SMRT has a relatively highrate of sequencing errors (17), and the peak call-ing of modifications is imprecise (16). Proteinand solid-state nanopores can resolve 5mC from5hmC and have the potential to sequence unam-plified DNA (18, 19).

We observed the decarbonylation and deami-nation of 5-formylcytosine (5fC) to uracil (U)under bisulfite conditions that would leave 5mCunchanged (Fig. 1A and supplementary text).Thus, 5hmC sequencing would be possible if5hmC could be selectively oxidized to 5fC andthen converted to U in a two-step procedure (Fig.

1B). Whereas BS-Seq leads to both 5mC and5hmC being detected as Cs, this “oxidativebisulfite” sequencing (oxBS-Seq) approach wouldyield Cs only at 5mC sites and therefore allowus to determine the amount of 5hmC at a partic-ular nucleotide position by subtraction of thisreadout from a BS-Seq one (Fig. 1C).

Specific oxidation of 5hmC to 5fC (table S1)was achieved with potassium perruthenate (KRuO4).In our reactivity studies on a synthetic 15-nucleotideoligomer single-stranded DNA (ssDNA) contain-ing 5hmC, we established conditions under whichKRuO4 reacted specifically with the primary al-cohol of 5hmC (Fig. 2A). Fifteen-nucleotide oligo-mer ssDNA that contained C or 5mC did notshow any base-specific reactions with KRuO4 (fig.S1, A and B). For 5hmC in DNA, we only ob-served the aldehyde (5fC) and not the carboxylicacid (20), even with a moderate excess of oxidant.The KRuO4 oxidation can oxidize 5hmC in sam-ples presented as double-stranded DNA (dsDNA),with an initial denaturing step before addition ofthe oxidant; this results in a quantitative conver-sion of 5hmC to 5fC (Fig. 2B).

To test the efficiency and selectivity of the oxi-dative bisulfite method, three synthetic dsDNAscontaining either C, 5mC, or 5hmC were eachoxidized with KRuO4 and then subjected to aconventional bisulfite conversion protocol. Sangersequencing revealed that 5mC residues did notconvert to U, whereas both C and 5hmC resi-dues did convert to U (fig. S2). Because Sangersequencing is not quantitative, to gain a moreaccurate measure of the efficiency of transforming5hmC to U, Illumina (San Diego, California) se-quencing was carried out on the synthetic DNAcontaining 5hmC (122-nucleotide oligomer) afteroxidative bisulfite treatment. An overall 5hmC-to-U conversion level of 94.5% was observed (Fig.2C and fig. S14). The oxidative bisulfite proto-col was also applied to a synthetic dsDNA thatcontained multiple 5hmC residues (135-nucleotideoligomer) in a range of different contexts thatshowed a similarly high conversion efficiency(94.7%) of 5hmC to U (Fig. 2C and fig. S14).Last, the KRuO4 oxidation was carried out ongenomic DNA and showed through mass spec-trometry a quantitative conversion of 5hmC to

1Department of Chemistry, University of Cambridge, CambridgeCB2 1EW, UK. 2Epigenetics Programme, Babraham Institute,Cambridge CB22 3AT, UK. 3Centre for Trophoblast Research,University of Cambridge, Cambridge CB2 3EG, UK. 4ProteomicsResearch Group, Babraham Institute, Cambridge CB22 3AT,UK. 5Bioinformatics Group, Babraham Institute, CambridgeCB22 3AT, UK. 6School of Clinical Medicine, University ofCambridge, Cambridge CB2 0SP, UK. 7Cancer Research UK,Cambridge Research Institute, Li Ka Shing Centre, Cam-bridge CB2 0RE, UK.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (W.R.); [email protected] (S.B.)

18 MAY 2012 VOL 336 SCIENCE www.sciencemag.org934

REPORTS

3. W. Kim, S. Kook, D. J. Kim, C. Teodorof, W. K. Song,J. Biol. Chem. 279, 8333 (2004).

4. V. Giambra et al., Mol. Cell. Biol. 28, 6123 (2008).5. F. E. Garrett et al., Mol. Cell. Biol. 25, 1511

(2005).6. W. A. Dunnick et al., J. Exp. Med. 206, 2613

(2009).7. M. Cogné et al., Cell 77, 737 (1994).8. J. P. Manis et al., J. Exp. Med. 188, 1421 (1998).9. A. G. Bébin et al., J. Immunol. 184, 3710 (2010).

10. E. Pinaud et al., Immunity 15, 187 (2001).11. C. Vincent-Fabert et al., Blood 116, 1895 (2010).12. R. Wuerffel et al., Immunity 27, 711 (2007).13. Z. Ju et al., J. Biol. Chem. 282, 35169 (2007).14. H. Duan, H. Xiang, L. Ma, L. M. Boxer, Oncogene 27,

6720 (2008).15. M. Gostissa et al., Nature 462, 803 (2009).16. C. Chauveau, M. Cogné, Nat. Genet. 14, 15 (1996).

17. C. Chauveau, E. Pinaud, M. Cogne, Eur. J. Immunol. 28,3048 (1998).

18. M. A. Sepulveda, F. E. Garrett, A. Price-Whelan,B. K. Birshtein, Mol. Immunol. 42, 605 (2005).

19. E. Pinaud, C. Aupetit, C. Chauveau, M. Cogné,Eur. J. Immunol. 27, 2981 (1997).

20. A. A. Khamlichi et al., Blood 103, 3828 (2004).21. R. Shinkura et al., Nat. Immunol. 4, 435 (2003).22. A. Yamane et al., Nat. Immunol. 12, 62 (2011).23. M. Liu et al., Nature 451, 841 (2008).24. J. Stavnezer, J. E. Guikema, C. E. Schrader, Annu. Rev.

Immunol. 26, 261 (2008).25. S. Duchez et al., Proc. Natl. Acad. Sci. U.S.A. 107, 3064

(2010).26. T. K. Kim et al., Nature 465, 182 (2010).

Acknowledgments: We thank T. Honjo for providingAID−/− mice and F. Lechouane for sorted B cells DNA samples.

We are indebted to the cell sorting facility of LimogesUniversity for excellent technical assistance in cell sorting. Thiswork was supported by grants from Association pour laRecherche sur le Cancer, Ligue Nationale contre le Cancer,Cancéropôle Grand Sud-Ouest, Institut National du Cancer,and Région Limousin. The data presented in this paper aretabulated here and in the supplementary materials.

Supplementary Materialswww.sciencemag.org/cgi/content/full/science.1218692/DC1Materials and MethodsFigs. S1 to S4Tables S1 and S2References (27–30)

4 January 2012; accepted 27 March 2012Published online 26 April 2012;10.1126/science.1218692

Quantitative Sequencing of5-Methylcytosine and5-Hydroxymethylcytosine atSingle-Base ResolutionMichael J. Booth,1* Miguel R. Branco,2,3* Gabriella Ficz,2 David Oxley,4 Felix Krueger,5

Wolf Reik,2,3† Shankar Balasubramanian1,6,7†

5-Methylcytosine can be converted to 5-hydroxymethylcytosine (5hmC) in mammalian DNA by theten-eleven translocation (TET) enzymes. We introduce oxidative bisulfite sequencing (oxBS-Seq),the first method for quantitative mapping of 5hmC in genomic DNA at single-nucleotide resolution.Selective chemical oxidation of 5hmC to 5-formylcytosine (5fC) enables bisulfite conversion of5fC to uracil. We demonstrate the utility of oxBS-Seq to map and quantify 5hmC at CpG islands(CGIs) in mouse embryonic stem (ES) cells and identify 800 5hmC-containing CGIs that haveon average 3.3% hydroxymethylation. High levels of 5hmC were found in CGIs associated withtranscriptional regulators and in long interspersed nuclear elements, suggesting that theseregions might undergo epigenetic reprogramming in ES cells. Our results open new questionson 5hmC dynamics and sequence-specific targeting by TETs.

5-Methylcytosine (5mC) is an epigenetic DNAmark that plays important roles in genesilencing and genome stability and is found

enriched at CpG dinucleotides (1). In metazoa,5mC can be oxidized to 5-hydroxymethylcytosine(5hmC) by the ten-eleven translocation (TET) en-zyme family (2, 3). 5hmCmay be an intermediatein active DNA demethylation but could also con-stitute an epigenetic mark per se (4). Levels of5hmC in genomic DNA can be quantified withanalytical methods (2, 5, 6) and mapped throughthe enrichment of 5hmC-containing DNA frag-

ments that are then sequenced (7–13). Such ap-proaches have relatively poor resolution and giveonly relative quantitative information. Single-nucleotide sequencing of 5mC has been per-formed by using bisulfite sequencing (BS-Seq),but this method cannot discriminate 5mC from5hmC (14, 15). Single-molecule real-time se-quencing (SMRT) can detect derivatized 5hmCin genomic DNA (16). However, enrichment of5hmC-containing DNA fragments is required,which causes loss of quantitative information(16). Furthermore, SMRT has a relatively highrate of sequencing errors (17), and the peak call-ing of modifications is imprecise (16). Proteinand solid-state nanopores can resolve 5mC from5hmC and have the potential to sequence unam-plified DNA (18, 19).

We observed the decarbonylation and deami-nation of 5-formylcytosine (5fC) to uracil (U)under bisulfite conditions that would leave 5mCunchanged (Fig. 1A and supplementary text).Thus, 5hmC sequencing would be possible if5hmC could be selectively oxidized to 5fC andthen converted to U in a two-step procedure (Fig.

1B). Whereas BS-Seq leads to both 5mC and5hmC being detected as Cs, this “oxidativebisulfite” sequencing (oxBS-Seq) approach wouldyield Cs only at 5mC sites and therefore allowus to determine the amount of 5hmC at a partic-ular nucleotide position by subtraction of thisreadout from a BS-Seq one (Fig. 1C).

Specific oxidation of 5hmC to 5fC (table S1)was achieved with potassium perruthenate (KRuO4).In our reactivity studies on a synthetic 15-nucleotideoligomer single-stranded DNA (ssDNA) contain-ing 5hmC, we established conditions under whichKRuO4 reacted specifically with the primary al-cohol of 5hmC (Fig. 2A). Fifteen-nucleotide oligo-mer ssDNA that contained C or 5mC did notshow any base-specific reactions with KRuO4 (fig.S1, A and B). For 5hmC in DNA, we only ob-served the aldehyde (5fC) and not the carboxylicacid (20), even with a moderate excess of oxidant.The KRuO4 oxidation can oxidize 5hmC in sam-ples presented as double-stranded DNA (dsDNA),with an initial denaturing step before addition ofthe oxidant; this results in a quantitative conver-sion of 5hmC to 5fC (Fig. 2B).

To test the efficiency and selectivity of the oxi-dative bisulfite method, three synthetic dsDNAscontaining either C, 5mC, or 5hmC were eachoxidized with KRuO4 and then subjected to aconventional bisulfite conversion protocol. Sangersequencing revealed that 5mC residues did notconvert to U, whereas both C and 5hmC resi-dues did convert to U (fig. S2). Because Sangersequencing is not quantitative, to gain a moreaccurate measure of the efficiency of transforming5hmC to U, Illumina (San Diego, California) se-quencing was carried out on the synthetic DNAcontaining 5hmC (122-nucleotide oligomer) afteroxidative bisulfite treatment. An overall 5hmC-to-U conversion level of 94.5% was observed (Fig.2C and fig. S14). The oxidative bisulfite proto-col was also applied to a synthetic dsDNA thatcontained multiple 5hmC residues (135-nucleotideoligomer) in a range of different contexts thatshowed a similarly high conversion efficiency(94.7%) of 5hmC to U (Fig. 2C and fig. S14).Last, the KRuO4 oxidation was carried out ongenomic DNA and showed through mass spec-trometry a quantitative conversion of 5hmC to

1Department of Chemistry, University of Cambridge, CambridgeCB2 1EW, UK. 2Epigenetics Programme, Babraham Institute,Cambridge CB22 3AT, UK. 3Centre for Trophoblast Research,University of Cambridge, Cambridge CB2 3EG, UK. 4ProteomicsResearch Group, Babraham Institute, Cambridge CB22 3AT,UK. 5Bioinformatics Group, Babraham Institute, CambridgeCB22 3AT, UK. 6School of Clinical Medicine, University ofCambridge, Cambridge CB2 0SP, UK. 7Cancer Research UK,Cambridge Research Institute, Li Ka Shing Centre, Cam-bridge CB2 0RE, UK.

*These authors contributed equally to this work.†To whom correspondence should be addressed. E-mail:[email protected] (W.R.); [email protected] (S.B.)

18 MAY 2012 VOL 336 SCIENCE www.sciencemag.org934

REPORTS

Bisulphite sequencing (BS-seq)

Oxidative bisulphite sequencing (oxBS-seq)

(He et al., 2011; Ito et al., 2011; Tahiliani et al., 2009). These

oxidized methylcytosines (oxi-mC) have been proposed to play a role

in active DNA demethylation through 5mC oxidation and DNA re-

pair, and in chromatin regulation (Pastor et al., 2013). 5mC and all5 the oxi-mC species are of great interest due to the alleged role of

DNA methylation in diseases, such as different cancers (Baylin,

2005), Alzheimer (De Jager et al., 2014), asthma (Rastogi et al.,

2013), autism (Nardone et al., 2014) and type 2 diabetes (Dayeh

et al., 2014). However, studies of primary human clinical samples are10 complicated by many factors; for instance, greater biological vari-

ation compared with more controlled molecular biology studies, pos-

sible confounding factors and case-control matching.

Bisulphite sequencing (BS-seq) has become the gold standard

technique for profiling methylation at single nucleotide resolution15 (Lister et al., 2009, 2013; Rein et al., 1998). In BS-seq, genomic

DNA is treated with sodium bisulphite, which will rapidly deami-

nate unmodified cytosine (and 5fC and 5caC) to uracil, while de-

amination of 5mC and 5hmC are much slower (Frommer et al.,

1992). Next, after PCR amplification, uracil and cytosine are read20 as thymine and cytosine, respectively. Importantly, 5fC and 5caC

will have the same read-out as unmodified cytosine and, similarly,

5hmC and 5mC share the same read-out in BS-seq (Huang et al.,

2010). This observation drove the development of various modified

bisulphite sequencing protocols (reviewed in Plongthongkum et al.,25 2014). For instance, oxidative bisulphite sequencing (oxBS-seq)

(Booth et al., 2012) and Tet-assisted bisulphite sequencing (TAB-

seq) (Yu et al., 2012) were developed for distinguishing 5hmC from

5mC. Both methods, oxBS-seq and TAB-seq, are based on oxida-

tion; 5hmC is oxidised into 5fC by KRuO4 in oxBS-seq, whereas in30 TAB-seq 5mC is oxidised into 5caC by recombinant mouse Tet1. To

gain information on 5fC, 5fC chemical modification-assisted bisul-

phite sequencing (fCAB-seq) (Lu et al., 2013) and reduced bisulphite

sequencing (redBS-seq) (Booth et al., 2014) have been proposed.

Chemical modification-assisted bisulphite sequencing (CAB-seq) to-35gether with BS-seq allows the quantification of 5caC by protecting

5caC from deamination by sodium bisulphite with 1-ethyl-3-[3-

dimethylaminopropyl]-carbodiimide hydrochloride (Lu et al.,

2013). CpG methyltransferase (M.SssI) assisted bisulphite sequenc-

ing (MAB-seq) when combined with BS-seq distinguishes 5fC/5caC40from C (Wu et al., 2014). A summary of the read-outs of the

described bisulphite sequencing approaches is listed in Figure 1A.

In order to estimate proportions of multiple methylation modifi-

cations, one has to deconvolute and integrate data from multiple

bisulphite based measurements (Fig. 1A) which often have biases45due to imperfect experimental steps (Plongthongkum et al., 2014).

Many computational methods have been developed for analysing

the standard bisulphite sequencing data (here we will describe only

the most relevant methodologies, for a more comprehensive list of

different methods see €Aijo et al., 2016). Methods based on beta-50binomial models have been proposed allowing modeling of sampling

and biological variation. For instance, MOABS uses a hierarchical

beta-binomial model with an empirical Bayesian approach (Sun

et al., 2014). To assess differential methylation, MOABS uses cred-

ible methylation difference metric for summarizing statistical and55biological significance (Sun et al., 2014). Another method,

RADMeth, takes into account covariates under the beta-binomial

model using a generalised linear model approach with the logit link

function (Dolzhenko and Smith, 2014). RADMeth detects differen-

tial methylation by using the log-likelihood ratio test and the evi-60dence for differential methylation across neighbouring cytosines is

shared using the Stouffer-Liptak weighted Z test. Recently, the

MACAU method was proposed, which combines a binomial mixed

model with a sampling-based inference algorithm to model various

genetic relatedness/population structures (Lea et al., 2015).65MACAU uses Wald test statistics on the posterior samples to call

whether a covariate has an effect on methylation (Lea et al., 2015).

A C

B

Fig. 1. (A) The conversion chart of C, 5mC, 5hmC, 5fC and 5caC in BS-seq, oxBS-seq, TAB-seq, CAB-seq, fCAB-seq, redBS-seq and MAB-seq experiments. (B) The

experimental steps of BS- and oxBS-seq experiments are represented in terms of experimental parameters. Green and red arrows depict successful and unsuc-

cessful steps, respectively. (C) The proposed hierarchical model for modeling methylation modification proportions for BS-seq and oxBS-seq data and parts of

the original Lux model represented in the plate notation. The grey and white circles are used to represent observed variables and latent variables, respectively.

The grey squares represent fixed hyperparameters. The components, which model the experimental parameters and control cytosines are the same as in the Lux

model (€Aijo et al., 2016)

i2 T.€Aijo et al.

potassium perruthenate (KRuO4)

5fC (Fig. 2D), with no detectable degradation ofC (fig. S1C). Thus, the oxidative bisulfite protocolspecifically converts 5hmC to U in DNA, leavingC and 5mC unchanged, enabling quantitative,single-nucleotide-resolution sequencing on wide-ly available platforms.

We then used oxBS-Seq to quantitatively map5hmC at high resolution in the genomic DNAof mouse embryonic stem (ES) cells. We choseto combine oxidative bisulfite with reduced rep-resentation bisulfite sequencing (RRBS) (21),which allows deep, selective sequencing of afraction of the genome that is highly enrichedfor CpG islands (CGIs). We generated RRBSand oxidative RRBS (oxRRBS) data sets, achiev-ing an average sequencing depth of ~120 readsper CpG, which when pooled yielded an aver-age of ~3300 methylation calls per CGI (fig.S3). After applying depth and breadth cutoffs(supplementary materials, materials and meth-ods), 55% (12,660) of all CGIs (22) were cov-ered in our data sets.

To identify 5hmC-containing CGIs, we testedfor differences between the RRBS and oxRRBSdata sets using stringent criteria, yielding a falsediscovery rate of 3.7% (supplementary materials,materials and methods). We identified 800 5hmC-containing CGIs, which had an average of 3.3%(range of 0.2 to 18.5%) CpG hydroxymethylation(Fig. 3, A and B). We also identified 4577 5mC-containing CGIs averaging 8.1% CpG methyla-tion (Fig. 3B). We carried out sequencing on anindependent biological duplicate sample ofthe same ES cell line but at a different passage

A

B

C

0 100 200 300 4000

1

2

3d5fC

dU

Time / minutes

Nor

mal

ised

Con

cent

ratio

n / x

10-1

2

NO

N

NH2

R

OH

NO

N

NH2

R

O

NO

HN

O

R

Oxidation 1) NaHSO3

5hmC 5fC U

2) NaOH

Base Sequence BS Sequence oxBS Sequence C C T T

5mC C C C

5hmC C C T

A G T TA G T T

A G T C 5mC 5hmC

C C C T

Oxidation

BS and amplification

BS and amplification

Compare sequences

Input DNA

A G T C 5mC 5fC

Fig. 1. A method for single-base resolution sequencing of 5hmC. (A)Reaction of 2ʹ-deoxy-5-formylcytidine (d5fC) with NaHSO3 (bisulfite)quenched by NaOH at different time points and then analyzed with high-performance liquid chromatography (HPLC). Data are mean T SD of three

replicates. (B) Oxidative bisulfite reaction scheme: oxidation of 5hmC to5fC followed by bisulfite treatment and NaOH to convert 5fC to U. The Rgroup is DNA. (C) Diagram and table outlining the BS-Seq and oxBS-Seqtechniques.

5mC 5hmC0

20

40

60

80

100 94.5

2.1

% C

-T c

onve

rsio

n

5mC 5hmC0

20

40

60

80

100 94.7

2.1

% C

-T c

onve

rsio

n

Input Oxidised0

1

2

3

4

5 5hmC5fC

Nor

mal

ised

Con

cent

ratio

n / x

10-1

Input Oxidised0

100

200

300

5hm

C (n

orm

alis

ed p

eak

area

s)

Input Oxidised0

2

4

6 5hmC5fC

Nor

mal

ised

Con

cent

ratio

n / x

10-2

Input Oxidised0

10

20

30

5fC

(nor

mal

ised

pea

k ar

eas)

A B

DC Genomic DNA OxidationSingle 5hmCpG Multiple 5hmCpGs

Single Stranded DNA Oxidation Double Stranded DNA Oxidation

Fig. 2. Quantification of 5hmC oxidation. (A) Levels of 5hmC and 5fC (normalized to T) in a 15-nucleotideoligomer ssDNA oligonucleotide before and after KRuO4 oxidation, measured with mass spectrometry. (B)Levels of 5hmC and 5fC (normalized to 5mC) in a 135-nucleotide oligomer dsDNA fragment before andafter KRuO4 oxidation. (C) C-to-T conversion levels as determined by means of Illumina sequencing of twodsDNA fragments containing either a single 5hmCpG (122-nucleotide oligomer) or multiple 5hmCpGs(135-nucleotide oligomer) after oxidative bisulfite treatment. 5mC was also present in these strands. (D)Levels of 5hmC and 5fC (normalized to 5mC in primer sequence) in ES cell DNA measured before and afteroxidation. Data are mean T SD.

www.sciencemag.org SCIENCE VOL 336 18 MAY 2012 935

REPORTS

BS-seq, oxBS-seq, etc.

Figure from (Booth et al, 2012)

Page 13: CS-E5875 High-Throughput Bioinformatics DNA methylation

12/ 29

Bisulfite sequencing (BS-seq) protocol

I Bisulfite treatment of genomic DNA converts unmethylated cytosines to urasils which areread as thymine during sequencing

I Methylated (and hydroxymethylated) cytosines are resistant to the conversion and are readas cytosine

146 | VOL.9 NO.2 | FEBRUARY 2012 | NATURE METHODS

REVIEW

amplifying bisulfite-treated DNA by PCR yields products in which unmethylated cytosines appear as thymines. By comparing the modified DNA with the original sequence, the methylation state of the original DNA can therefore be inferred. Bisulfite treatment of 5-hydroxymethylcytosine (5hmC) yields a similar intermediate to 5mC, meaning that BS-seq can be used to detect whether a position is (hydroxy-) methylated but not to determine the exact type of modification21,25 (Fig. 1). This limitation does not apply to antibody-based techniques, which can be used to specifically enrich 5hmC26–28.

Capillary electrophoresis–based bisulfite sequencing was consid-ered the gold standard for methylation analysis because of its clear readout and single-base resolution22, but it could only be applied to relatively small regions. New sequencing technologies mean that BS-seq is now a viable option for the sequencing of entire mam-malian methylomes6–8,29–32 (Supplementary Table 1).

For researchers primarily interested in CpG island methylation, the cost of bisulfite sequencing can be reduced by enriching CpG-dense regions by digesting genomic DNA with a methylation-insensitive restriction enzyme containing a C-G as part of its recognition site and selecting short fragments6,30,33. Even though the selected fragments are used to interrogate only a few percent of the genome, these data are informative for the majority of CpG islands. This approach, termed reduced representation BS-seq (RRBS), has been extensively described and compared to other techniques23,33–35, and several genome-wide methylation maps based on RRBS have been reported6,30.

In this Review we provide an overview of the computational analysis of bisulfite sequencing data. We highlight points to con-sider when designing a BS-seq experiment and point out pitfalls that can occur during the initial analysis. We also discuss dif-ferent alignment strategies and their implementation by current bioinformatic tools. In particular, we present the main differences between the analysis of base space (Illumina) and color space (SOLiD, Applied Biosystems) BS-seq data.

Challenges of BS-seq data mappingAs the methylation state of bisulfite-treated DNA must be inferred by comparison to an unmodified reference sequence, a correct alignment is of critical importance. This is challenging because the aligned sequences do not exactly match the reference, and the complexity of the libraries is reduced. Also, as cytosine methyl-ation is not symmetrical, the two strands of DNA in the reference genome must be considered separately. A single site can have a different methylation state in different cells. Thus, when sequenc-ing cell mixtures or tissue fractions, the percentage of methylation at each site needs to be determined36.

When performing an alignment one must discriminate between different types of bisulfite-treated DNA libraries (for a schematic

drawing, see ref. 16). In the first, termed directional libraries, adapters are attached to the DNA fragments such that only the original top or bottom strands will be sequenced7,30. Alternatively, all four DNA strands that arise through bisulfite treatment and subsequent PCR amplification can be sequenced with the same frequency in nondirectional libraries32,37,38. BS-seq mapping may therefore require up to four different strand alignments to be analyzed for each sequence. Because of the complexity of BS-seq alignments, standard sequence alignment software cannot be used. However, several different tools for BS-seq analysis have been developed.

Base-space BS-seq data alignmentsMethylation-‘aware’ alignment tools consider both cytosine and thymine as potential matches to a genomic cytosine. This strategy provides the highest possible mapping efficiency (high sensitivity) because it makes optimal use of the information present in the reads. However, a drawback of this technique is that methylated sequences will be aligned with greater efficiency because they carry more information than their unmethylated counterparts, leading this type of aligner to overestimate methylation levels.

Alternatively, in unbiased approaches usually any residual cytosines in the BS-seq read and all cytosines in the reference genome are converted into thymines before the alignment is per-formed7,30. This means that the read sequence to be aligned is unaffected by its methylation state. It also means that there will be an exact match between the converted read and converted genome sequence so that standard sequence alignment tools can be used to perform the mapping39,40. This approach, however, comes at the cost of slightly reduced mapping efficiencies (Fig. 2a).

BS-seq in color spaceIn contrast to the intuitive base-space sequence generated by Illumina sequencers, SOLiD sequencing (Applied Biosystems) encodes its reads in color space such that each color resembles the transition from one base to the next41. Single-nucleotide poly-morphisms (SNPs) can be called with high confidence because they will result in two adjacent color changes, whereas technical errors are indicated by a single color change (Supplementary Fig. 1a,b). Owing to the way color-space encoding works, residual cytosines are correctly converted into thymines in the bisulfite reads in silico before the mapping only if the reads are completely error-free. A single measurement error in the read would lead to incorrect conversions throughout the rest of the read (Supplementary Fig. 1c). As a consequence, the in silico cytosine to thymine conversion, which guarantees unbiased align-ments, should not be performed on color-space datasets.

Current tools to align color space BS-seq data to a reference genome either use methylation-aware alignments (SOCS-B42), which can be computationally intensive for complex genomes,

PCR amplification

Bisulfite conversionTop strand

mC

OTCTOT

CTOBOB

mC mC mC

mC

Bottom strand

Figure 1 | Effect of bisulfite treatment of DNA. Bisulfite conversion of genomic DNA and subsequent PCR amplification gives rise to two PCR products and up to four potentially different DNA fragments for any given locus. (Hydroxy)methylated cytosine residues are resistant to bisulfite conversion and can be used as a readout of the DNA methylation state. mC, 5-methylcytosine; hmC, 5-hydroxymethylcytosine; OT, original top strand; CTOT, strand complementary to the original top strand; OB, original bottom strand; and CTOB, strand complementary to the original bottom strand.

Figure from (Krueger et al, 2012)

Page 14: CS-E5875 High-Throughput Bioinformatics DNA methylation

13/ 29

Reduced representation BS-seq (RRBS-seq)

I BS-seq provides an accurate map of methylation state at single nucleotide resolution

I Whole genome analysis is expensive because only about 1% of the human genomecontains CpGs

→ Experimental techniques to enrich for the areas of the genome that have a high CpGcontent

I Reduced representation BS-seq (RRBS-seq) uses restriction enzymes prior to bisulfitesequencing

I MspI digests genomic DNA in a methylation-insensitive mannerI MspI targets 5’CCGG3’ sequences and cleaves the phosphodiester bonds upstream of CpG

dinucleotide.→ Each fragment will have a CpG at each end

I RRBS-seq will cover majority of promoters and GC rich regions

Page 15: CS-E5875 High-Throughput Bioinformatics DNA methylation

14/ 29

Reduced representation BS-seq (RRBS-seq)

Figure from (Lianga et al, 2014)

Page 16: CS-E5875 High-Throughput Bioinformatics DNA methylation

15/ 29

Contents

I DNA methylation

I Bisulfite sequencing (BS-seq) protocol

I Alignment and quantification of BS-seq data

I Statistical analysis of BS-seq data

Page 17: CS-E5875 High-Throughput Bioinformatics DNA methylation

16/ 29

Aligning BS-seq reads

I Bisulfite treatment introduces mutations into genomic DNA in a methylation dependentmanner

I Alignment of BS-seq reads is more challengingI Standard alignment methods cannot be used directly

I Bismark tool uses the following approach to map BS-seq readsI Reads from a BS-seq experiment are converted into a C-to-T version and a G-to-A versionI The same conversion for the genomeI Bowtie alignment in the genome that has reduced complexityI A unique best alignment is determined from four parallel alignment processes (see next page)

Page 18: CS-E5875 High-Throughput Bioinformatics DNA methylation

17/ 29

Bismark tool

Figure from (Krueger & Andrews, 2011)

Page 19: CS-E5875 High-Throughput Bioinformatics DNA methylation

18/ 29

Quantifying BS-seq data

I Bismark outputs, among others, one line per read containing useful informationI Mapping position, alignment strand, the bisulfite read sequence, its equivalent genomic

sequence and a methylation call string

I Bismark automatically extracts the methylation information at individual cytosinepositions

I For different sequence contexts (CpG, CHG, CHH; where H can be either A, T or C)I Strand-specific or strands merged

I That is, for each cytosine Bismark outputsI ni the number of reads covering the cytosine in sample iI mi the number of methylated readouts (i.e., “C”) for the cytosine in sample i

I One way to quantify methylation proportion is

pi =mi

ni=

the number of C reads overlapping the cytosine

the number of C or T reads overlapping the cytosine

Page 20: CS-E5875 High-Throughput Bioinformatics DNA methylation

19/ 29

Contents

I DNA methylation

I Bisulfite sequencing (BS-seq) protocol

I Alignment and quantification of BS-seq data

I Statistical analysis of BS-seq data

Page 21: CS-E5875 High-Throughput Bioinformatics DNA methylation

20/ 29

Beta-binomial model

I At the end, one is typically interested in testing a hypothesis, e.g. is there a statisticallysignificant difference in methylation levels between group A and group B

I Some early methods applied e.g. the t-test on the estimated methylation fractions pi (ortheir logit transformations)

I We will look at RadMeth tool (Dolzhenko and Smith, 2014)

I RadMeth uses the beta-binomial regression model, where beta-binomial is a compounddistribution obtained from the binomial by assuming that its probability of successparameter follows a beta distribution

Page 22: CS-E5875 High-Throughput Bioinformatics DNA methylation

21/ 29

Beta-binomial model

I i = 1, . . . , s, where s is the number of samples

I For each cytosine in the genome we have the following modelI ni : the number of reads covering the cytosine in sample iI mi : the number of reads that contain “C” readout (i.e. methylated) at the cytosine in

sample i (0 ≤ mi ≤ ni )I If we knew the underlying methylation level pi , then: Mi ∼ Binom(pi , ni )

I pi : the unknown methylation level of the cytosine in sample iI Instead of assuming a fixed (unknown) methylation level, assume pi has a compounding

distribution pi ∼ Beta(α, β), α ≥ 0, β ≥ 0I The probability of observing methylation level Mi = mi for a coverage ni follows so called

beta-binomial model

P(Mi = mi |ni , α, β) =

∫ 1

0

Binom(mi |pi , ni )Beta(pi |α, β)dpi

=

(nimi

)B(mi + α, ni −mi + β)

B(α, β),

where B is the beta function

Page 23: CS-E5875 High-Throughput Bioinformatics DNA methylation

21/ 29

Beta-binomial model

I i = 1, . . . , s, where s is the number of samples

I For each cytosine in the genome we have the following modelI ni : the number of reads covering the cytosine in sample iI mi : the number of reads that contain “C” readout (i.e. methylated) at the cytosine in

sample i (0 ≤ mi ≤ ni )I If we knew the underlying methylation level pi , then: Mi ∼ Binom(pi , ni )I pi : the unknown methylation level of the cytosine in sample iI Instead of assuming a fixed (unknown) methylation level, assume pi has a compounding

distribution pi ∼ Beta(α, β), α ≥ 0, β ≥ 0I The probability of observing methylation level Mi = mi for a coverage ni follows so called

beta-binomial model

P(Mi = mi |ni , α, β) =

∫ 1

0

Binom(mi |pi , ni )Beta(pi |α, β)dpi

=

(nimi

)B(mi + α, ni −mi + β)

B(α, β),

where B is the beta function

Page 24: CS-E5875 High-Throughput Bioinformatics DNA methylation

22/ 29

Beta-binomial model

I An illustration of binomial / beta / beta-binomial densities

0 10 200

0.05

0.1

0.15

0.2

0.25binomial: p=0.8

0 10 200

0.05

0.1

0.15

0.2

0.25beta-binomial: a=80, b=20

0 0.5 10

0.2

0.4

0.6

0.8

1p=0.8

0 10 200

0.05

0.1

0.15

0.2

0.25beta-binomial: a=8, b=2

0 0.5 10

2

4

6

8

10beta: a=80, b=20

0 0.5 10

1

2

3

4beta: a=8, b=2

Binomial and beta-binomial densities

Page 25: CS-E5875 High-Throughput Bioinformatics DNA methylation

23/ 29

Beta-binomial model

I Mean and variance of the beta-binomial model are

µ =niα

α + βand σ2 =

niαβ(α + β + ni )

(α + β)2(α + β + 1)

I ReparameterizationI π = α

α+βis the the average methylation level of a set of replicate samples

I γ = 1α+β+1

is the common dispersion parameter

allows us to write the same model as

Mi ∼ BetaBinomial(ni , π, γ)

where the mean and the variance are now defined asI E(Mi ) = niπI Var(Mi ) = niπ(1− π)(1 + (ni − 1)γ)

I Recall that the variance of the binomial distribution is niπ(1− π) which is smaller thanVar(Mi ) for ni ≥ 2

Page 26: CS-E5875 High-Throughput Bioinformatics DNA methylation

24/ 29

Generalized beta-binomial model

I In most of the real world applications, methylation levels can be confounded by one ormore factors (e.g. age and smoking)

I The generalized linear model (GLM) generalizes the ordinary linear regression to allow forresponse variables that have likelihood models other than a normal distribution

Page 27: CS-E5875 High-Throughput Bioinformatics DNA methylation

25/ 29

Generalized beta-binomial model

I For each sample i (and for each cytosine), the mean methylation level πi depends oncovariates xi = (xi1, xi2, . . . , xit)

T

g(πi ) =t∑

j=1

xijηj = xTi η

where η is a t × 1 parameter vector and

g(π) = logit(π) = log

1− π

)πi = logit−1(xTi η) = logistic(xTi η) =

exp(xTi η)

exp(xTi η) + 1

I logit(·) :]0, 1[→ R, thus logit(·)−1 : R→]0, 1[

Page 28: CS-E5875 High-Throughput Bioinformatics DNA methylation

26/ 29

Model fitting and inference

I The beta-binomial regression is fit separately for each CpG site

I The parameters η and γ are estimated using maximum likelihood

I Iteratively reweighted least squares algorithm using a Newton-Raphson method

I Test the differential methylation w.r.t. a test factor ηj :I Learn the full model and the reduced model without the test factorI Compare the models using log-likelihood ratio test

D = −2 ln

(likelihood of the reduced model

likelihood of the full model

)I p-value from chi-square test with dfull − dreduced degrees of freedom, where dfull denotes

the number of free parameters in the full model

Page 29: CS-E5875 High-Throughput Bioinformatics DNA methylation

27/ 29

RadMeth application

I Neuron and non-neuron RRBS-seq samples from mouse frontal cortex: xi1 ∈ {0, 1}I 6 samples: s = 6

I Two additional factors: age (xi2 ∈ R+), sex (xi3 ∈ {0, 1})I 72 000 differentially methylated (DM) regions between neuron and non-neuron samples

that contain at least 10 CpGs

I DM regions with minimum methylation difference above 0.55I 1708 lowly methylated (active) regions in neuronsI These regions are associated with (located close to) 1089 genesI GO enrichment analysis by DAVID found a strong association of these genes with various

aspects of neuronal development and function

Page 30: CS-E5875 High-Throughput Bioinformatics DNA methylation

28/ 29

RadMeth application

Dolzhenko and Smith BMC Bioinformatics 2014, 15:215 Page 6 of 8http://www.biomedcentral.com/1471-2105/15/215

frequ

ency

0

10

20

30

5000 5000frequency

0.0 0.2 0.4 0.6 0.8

1000

1500

neuron low

non-neuron low

log odds ratio

minimum methylation difference

DMR HMR

5 kb

Eno2Lrrc23

female 12mo

female12mo

female 6wk

male 7wk

female 6wk

male 7wkne

uron

non-

neur

on

Figure 2 DM regions between neuron and non-neuron samples. (Top left) Methylation profile of the neuron specific enolase (Eno2) – a markerof neuron cells – across frontal cortex samples. (Right) Histogram of log-odds-ratios of DM regions containing at least 10 CpGs. (Bottom left)Histogram of minimum methylation differences of DM regions containing at least 10 CpGs.

file 1). Although predominantly glial, non-neuron sam-ples consisted of multiple cell types. Hence the majorityof DM regions, especially the ones corresponding to mod-est methylation changes, are likely to indicate differencebetween individual cell types and neurons. To obtainDM regions with consistent methylation changes betweenneurons and non-neurons in the majority of moleculescomprising the samples, we selected DM regions withminimum methylation difference above 0.55. The 1,708of these regions were lowly methylated in neurons andwere associated with 1,089 genes. The GO term enrich-ment analysis, performed using DAVID [30], revealed astrong association of these genes with various aspectsof neuronal development and function (see Additionalfile 2).

Large-scale datasetThe second dataset [31] consisted of 152 MethylC-seq libraries. The methylome samples obtained fromthese libraries with MethPipe [14] had mean coverage

11.2 (s.d. 2.7); 54 of these samples came from inflores-cence (flower cluster) and the remaining 98 from theleaf of Aradidopsis thaliana. RADMeth identified 13,576DM regions between the two groups of samples (seeAdditional file 1). Out of these, 5,049 DM regions contain-ing at least 10 CpG sites were retained for downstreamanalysis.

It is well known that methylation in Aradidopsis playsan important role in silencing of transposable elements(e.g. [32]), which are usually heavily methylated. Inter-estingly, most of the DM regions we found overlappedtransposons (1.781 observed over expected ratio; see alsoFigure 3). The methylation differences between inflo-rescence and leaf samples were modest: above 0.1 for1,271 DM regions and above 0.2 for just 129 regions,indicating relative loss of methylation within transposonsin a relatively small fraction of sequenced molecules.Promoter and gene bound DM regions were underrepre-sented, with 0.19 and 0.28 observed over expected ratiosrespectively.

Figure from (Dolzhenko and Andrew, 2014)

Page 31: CS-E5875 High-Throughput Bioinformatics DNA methylation

29/ 29

ReferencesI Michael J. Booth et al., Quantitative Sequencing of 5-Methylcytosine and 5-Hydroxymethylcytosine at Single-Base Resolution, Science, 336(6063):934-937, 2012

I Jeremy J Day & J David Sweatt, DNA methylation and memory formation, Nature Neuroscience 13:1319-1323, 2010

I Egor Dolzhenko and Andrew D Smith, Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencingexperiments, BMC Bioinformatics, 15:215, 2014

I Eckhardt F et al., DNA methylation profiling of human chromosomes 6, 20 and 22, Nature Genetics,38(12):1378-85, 2006.

I Felix Krueger and Simon R. Andrews, Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications, Bioinformatics, 27(11):1571-1572, 2011.

I Felix Krueger et al., DNA methylome analysis using short bisulfite sequencing data, Nature Methods 9, 145-151, 2012

I Jialong Lianga et al., Single-Cell Sequencing Technologies: Current and Future, Journal of Genetics and Genomics, 41(10):513-528, 2014

I Alexander Meissner, et al., Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis, Nucleic Acids Res., 33(18):5868-77,2005.

I Christoph Plass, et al., Mutations in regulators of the epigenome and their connections to global chromatin patterns in cancer, Nature Reviews Genetics 14, 765-780, 2013

I Dirk Schubeler, Epigenomics: Methylation matters, Nature 462:296-297, 2009

I Cornelia G Spruijt & Michiel Vermeulen, DNA methylation: old dog, new tricks?, Nature Structural & Molecular Biology 21, 949-954, 2014