a survey of the approaches for identifying differential ... · (e.g. methylation and gene...

17
A survey of the approaches for identifying differential methylation using bisulfite sequencing data Adib Shafi, Cristina Mitrea, Tin Nguyen and Sorin Draghici Corresponding author. Sorin Draghici, Department of Computer Science and Department of Obstetrics and Gynecology, Wayne State University, 14th Floor, Suite 14200, 5057 Woodward, Detroit, MI 48202. Tel.: þ1(313)577-2162; Fax: þ1(313) 577-6868; E-mail: [email protected] Abstract DNA methylation is an important epigenetic mechanism that plays a crucial role in cellular regulatory systems. Recent advancements in sequencing technologies now enable us to generate high-throughput methylation data and to measure methylation up to single-base resolution. This wealth of data does not come without challenges, and one of the key challenges in DNA methylation studies is to identify the significant differences in the methylation levels of the base pairs across distinct biological conditions. Several computational methods have been developed to identify differential methyla- tion using bisulfite sequencing data; however, there is no clear consensus among existing approaches. A comprehensive survey of these approaches would be of great benefit to potential users and researchers to get a complete picture of the available resources. In this article, we present a detailed survey of 22 such approaches focusing on their underlying statis- tical models, primary features, key advantages and major limitations. Importantly, the intrinsic drawbacks of the approaches pointed out in this survey could potentially be addressed by future research. Key words: DNA methylation; epigenetic modification; differentially methylated cytosines (DMCs); differentially methylated regions (DMRs); bisulfite sequencing Introduction Epigenetics is the field of study that provides information on how, where and when genes are switched on and off inside a living cell. DNA methylation is an intensively studied and well understood epigenetic mechanism that plays a vital role in many processes [1]. Due to its role in regulating gene expres- sion, DNA methylation is an important part of cellular processes such as cell development and differentiation. Furthermore, pat- terns of hypermethylation have been identified in human can- cers, which can provide novel insights into the development and progression of such complex diseases [2]. Specifically, in cancer, one of the causes of silenced tumor suppressor genes is hypermethylation. The most studied form of DNA methylation, known as 5-methylcytosine (5-mc), involves the addition of a methyl group to the 5-carbon of the cytosine (C) base of a DNA strand. Although approximately only 5% of the cytosine bases in the human genome are methylated, cytosine (C) followed by a guanine (G), which is known as a CpG site, is methylated 70–80% of the time [3, 4]. Methylation can also occur in non-CpG context, such as CHG and CHH sites (where H ¼ C, T or A), Adib Shafi is a PhD candidate in the Department of Computer Science at Wayne State University, USA. His research interests include biological pathway analysis, finding mechanism using multi-omics data and variant analysis. Cristina Mitrea is a PhD candidate in the Department of Computer Science at Wayne State University, USA. Her work is focused on research in data mining techniques applied to bioinformatics and computational biology. Other interests include network discovery and meta-analysis applied to pathway analysis. Tin Nguyen received his PhD from the Computer Science Department at Wayne State University. His research interests include computational and statis- tical methods for analyzing high-throughput data. His current foci are meta-analysis and multi-omics data integration. Sorin Draghici currently holds the Robert J. Sokol, MD Endowed Chair in Systems Biology, as well as appointments as Full Professor with the Department of Computer Science and the Department of Obstetrics and Gynecology, Wayne State University. He is also the head of the Intelligent Systems and Bioinformatics Laboratory in the Department of Computer Science. His work is focused on research in artificial intelligence, machine learning and data mining techniques applied to bioinformatics and computational biology. He has published 2 best-selling books on data analysis of high-throughput gen- omics data, 8 book chapters and over 190 peer-reviewed journal and conference papers. Submitted: 1 September 2016; Received (in revised form): 14 January 2017 V C The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 1 Briefings in Bioinformatics, 2017, 1–17 doi: 10.1093/bib/bbx013 Paper

Upload: others

Post on 18-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

A survey of the approaches for identifying differential

methylation using bisulfite sequencing dataAdib Shafi Cristina Mitrea Tin Nguyen and Sorin DraghiciCorresponding author Sorin Draghici Department of Computer Science and Department of Obstetrics and Gynecology Wayne State University 14thFloor Suite 14200 5057 Woodward Detroit MI 48202 Tel thorn1(313)577-2162 Fax thorn1(313) 577-6868 E-mail sorinwayneedu

Abstract

DNA methylation is an important epigenetic mechanism that plays a crucial role in cellular regulatory systems Recentadvancements in sequencing technologies now enable us to generate high-throughput methylation data and to measuremethylation up to single-base resolution This wealth of data does not come without challenges and one of the keychallenges in DNA methylation studies is to identify the significant differences in the methylation levels of the base pairsacross distinct biological conditions Several computational methods have been developed to identify differential methyla-tion using bisulfite sequencing data however there is no clear consensus among existing approaches A comprehensivesurvey of these approaches would be of great benefit to potential users and researchers to get a complete picture of theavailable resources In this article we present a detailed survey of 22 such approaches focusing on their underlying statis-tical models primary features key advantages and major limitations Importantly the intrinsic drawbacks of theapproaches pointed out in this survey could potentially be addressed by future research

Key words DNA methylation epigenetic modification differentially methylated cytosines (DMCs) differentially methylatedregions (DMRs) bisulfite sequencing

Introduction

Epigenetics is the field of study that provides information onhow where and when genes are switched on and off inside aliving cell DNA methylation is an intensively studied and wellunderstood epigenetic mechanism that plays a vital role inmany processes [1] Due to its role in regulating gene expres-sion DNA methylation is an important part of cellular processessuch as cell development and differentiation Furthermore pat-terns of hypermethylation have been identified in human can-cers which can provide novel insights into the development

and progression of such complex diseases [2] Specifically incancer one of the causes of silenced tumor suppressor genes ishypermethylation

The most studied form of DNA methylation known as5-methylcytosine (5-mc) involves the addition of a methylgroup to the 5-carbon of the cytosine (C) base of a DNA strandAlthough approximately only 5 of the cytosine bases in thehuman genome are methylated cytosine (C) followed by aguanine (G) which is known as a CpG site is methylated70ndash80 of the time [3 4] Methylation can also occur in non-CpGcontext such as CHG and CHH sites (where Hfrac14C T or A)

Adib Shafi is a PhD candidate in the Department of Computer Science at Wayne State University USA His research interests include biological pathwayanalysis finding mechanism using multi-omics data and variant analysisCristina Mitrea is a PhD candidate in the Department of Computer Science at Wayne State University USA Her work is focused on research in data miningtechniques applied to bioinformatics and computational biology Other interests include network discovery and meta-analysis applied to pathwayanalysisTin Nguyen received his PhD from the Computer Science Department at Wayne State University His research interests include computational and statis-tical methods for analyzing high-throughput data His current foci are meta-analysis and multi-omics data integrationSorin Draghici currently holds the Robert J Sokol MD Endowed Chair in Systems Biology as well as appointments as Full Professor with the Departmentof Computer Science and the Department of Obstetrics and Gynecology Wayne State University He is also the head of the Intelligent Systems andBioinformatics Laboratory in the Department of Computer Science His work is focused on research in artificial intelligence machine learning and datamining techniques applied to bioinformatics and computational biology He has published 2 best-selling books on data analysis of high-throughput gen-omics data 8 book chapters and over 190 peer-reviewed journal and conference papersSubmitted 1 September 2016 Received (in revised form) 14 January 2017

VC The Author 2017 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom

1

Briefings in Bioinformatics 2017 1ndash17

doi 101093bibbbx013Paper

especially in plants and stem cells [5 6] Recent studies havealso shown that the Ten-Eleven translocation (TET) proteins areinvolved in oxidizing 5-mc into 5-hydroxymethylcytosine (5-hmC) 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC)However the abundance level of these methylation variants (5-hmC 5-fC 5-caC) is low compared with that of 5-mc [7]Therefore our survey focuses on 5-mc methylation in CpG con-text considering most of the methods have been developed foranalyzing this type of epigenetic modification When a CpG siteis methylated in the promoter regions it typically represses thetranscriptional activity of that region by restricting the bindingof specific transcription factors (TFs) Alternatively when a CpGsite is unmethylated in promoter regions it allows for the bind-ing of those TFs [8ndash10] Given its regulatory role in cellular activ-ities identifying changes in DNA methylation across multiplebiological conditions is of great interest

The availability of the reference genome and the advancedsequencing technologies have led to methods that providehigh-resolution methylation profiles on a genome scale Basedon the resolution at which the methylation levels are measuredcurrent sequencing-based technologies can be divided in twocategories (i) enrichment-based approaches and (ii) bisulfitesequencing-based approaches [11 12] The former allows us tomeasure the methylation levels at 100ndash200 base resolutionwhile the latter allows us to measure the methylation levels atsingle-base resolution One of the challenges in measuringgenome-level methylation is the amount of biological materialneeded which has only recently reached levels feasible for clin-ical samples [13] Other challenges are related to processingdata from new technologies and integrating them with differenttypes of data in a meaningful way to provide biological insights(eg methylation and gene expression) In this review we focuson bisulfite sequencing-based approaches

Within the past few years many tools have been developedfor differential methylation (DM) analysis using bisulfitesequencing data (Figure 1) but only a few attempts have beenmade to provide a review of these approaches Robinson et al[14] provides a mini review of the approaches that identify DMbriefly discussing their methodologies and current challengesThis review not only includes the approaches that use bisulfitesequencing data but also the approaches that use DNA methy-lation arrays (Illuminarsquos 27k or 450k) and enrichment assays(MeDIP-seq) Yet the number of approaches based on bisulfitesequencing data and the number of features considered foreach approach are low Klein et al [15] evaluates nineapproaches that can possibly be used for DM analysis Howeverthe methods are limited to the scope of analyzing DM in prede-fined regions using only reduced representation bisulfitesequencing (RRBS) data Among the nine approaches only fourof them are originally designed for analyzing DM The other fiveapproaches are general approaches that can be applied for RNA-

Seq and gene expression data Yu and Sun [16] evaluate onlyfive approaches developed for the purpose of identifying differ-entially methylated regions (DMRs) Sun et al [17] briefly sum-marize the commonly used platforms for methylation profilingdata preprocessing techniques and statistical approaches forDM analysis This review provides a well-organized conceptualoverview of approaches that identify DM using bisulfitesequencing data However this survey only includes seven suchapproaches In summary all previous attempts of reviewing theapproaches that identify DM using bisulfite sequencing data arelimited in at least one of the following aspects (i) the total num-ber of approaches covered in the survey (fewer than 10 methodsreviewed) (ii) the applicability (eg only methods dealing withRRBS data) or (iii) a small number of biological features con-sidered To address these issues a comprehensive survey of theapproaches that identify DM using bisulfite sequencing data isgreatly needed

In this article we review 22 different approaches for DM ana-lysis including approaches for identifying differentially methy-lated cytosines (DMCs) DMRs (both predefined and de novoregions) and methylation patterns using bisulfite sequencingdata (whole genome bisulfite sequencing [WGBS] and RRBS) Weclassify these approaches into seven different categories basedon the primary concepts and key techniques used to identifyDM In addition we provide a short overview of several generalhypothesis-based tests which can also be applied for DM ana-lysis In the following sections first we will provide a brief over-view of bisulfite sequencing technology and the workflow ofanalyzing bisulfite sequencing data Next we will provide a sys-tematic review of the approaches highlighting their pros andcons discussing their key characteristics

Bisulfite sequencing

The gold standard for measuring cytosine methylation is bisul-fite sequencing which has the advantage of measuring methy-lation at single-base resolution In this technique DNA istreated with sodium bisulfite which deaminates unmethylatedcytosines (C) to uracils (U) leaving the methylated cytosines un-changed Uracils are read as thymines (T) during the sequencingstep Methylation level at each CpG site is estimated by simplycounting the ratio of C(CthornT) Thus this process allows se-quence-specific discrimination between methylated and unme-thylated CpG sites [18]

Several technologies have been developed for measuringDNA methylation based on bisulfite sequencing conversionThe most comprehensive protocol among them is WGBS whichprovides genome-wide DNA profiling However the applicationof this protocol on the whole genome is expensive when itcomes to studying organisms with large genomes More cost-effective protocols such as RRBS and enhanced RRBS have

Figure 1 Timeline of the approaches that identify DM using bisulfite sequencing data

2 | Shafi et al

allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates

The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data

processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections

Pre-analysisData preprocessing

Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases

that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format

Read mapping

After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using

Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data

Identifying differential methylation | 3

standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format

Post-alignment analysis

After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)

DM analysis

After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]

To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]

Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two

categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests

Count-based hypothesis tests

Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]

Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]

Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites

Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating

4 | Shafi et al

methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections

Logistic regression-based approaches

Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc

Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output

Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels

Smoothing-based approaches

Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions

One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The

second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites

BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions

Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters

One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets

In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small

Identifying differential methylation | 5

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 2: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

especially in plants and stem cells [5 6] Recent studies havealso shown that the Ten-Eleven translocation (TET) proteins areinvolved in oxidizing 5-mc into 5-hydroxymethylcytosine (5-hmC) 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC)However the abundance level of these methylation variants (5-hmC 5-fC 5-caC) is low compared with that of 5-mc [7]Therefore our survey focuses on 5-mc methylation in CpG con-text considering most of the methods have been developed foranalyzing this type of epigenetic modification When a CpG siteis methylated in the promoter regions it typically represses thetranscriptional activity of that region by restricting the bindingof specific transcription factors (TFs) Alternatively when a CpGsite is unmethylated in promoter regions it allows for the bind-ing of those TFs [8ndash10] Given its regulatory role in cellular activ-ities identifying changes in DNA methylation across multiplebiological conditions is of great interest

The availability of the reference genome and the advancedsequencing technologies have led to methods that providehigh-resolution methylation profiles on a genome scale Basedon the resolution at which the methylation levels are measuredcurrent sequencing-based technologies can be divided in twocategories (i) enrichment-based approaches and (ii) bisulfitesequencing-based approaches [11 12] The former allows us tomeasure the methylation levels at 100ndash200 base resolutionwhile the latter allows us to measure the methylation levels atsingle-base resolution One of the challenges in measuringgenome-level methylation is the amount of biological materialneeded which has only recently reached levels feasible for clin-ical samples [13] Other challenges are related to processingdata from new technologies and integrating them with differenttypes of data in a meaningful way to provide biological insights(eg methylation and gene expression) In this review we focuson bisulfite sequencing-based approaches

Within the past few years many tools have been developedfor differential methylation (DM) analysis using bisulfitesequencing data (Figure 1) but only a few attempts have beenmade to provide a review of these approaches Robinson et al[14] provides a mini review of the approaches that identify DMbriefly discussing their methodologies and current challengesThis review not only includes the approaches that use bisulfitesequencing data but also the approaches that use DNA methy-lation arrays (Illuminarsquos 27k or 450k) and enrichment assays(MeDIP-seq) Yet the number of approaches based on bisulfitesequencing data and the number of features considered foreach approach are low Klein et al [15] evaluates nineapproaches that can possibly be used for DM analysis Howeverthe methods are limited to the scope of analyzing DM in prede-fined regions using only reduced representation bisulfitesequencing (RRBS) data Among the nine approaches only fourof them are originally designed for analyzing DM The other fiveapproaches are general approaches that can be applied for RNA-

Seq and gene expression data Yu and Sun [16] evaluate onlyfive approaches developed for the purpose of identifying differ-entially methylated regions (DMRs) Sun et al [17] briefly sum-marize the commonly used platforms for methylation profilingdata preprocessing techniques and statistical approaches forDM analysis This review provides a well-organized conceptualoverview of approaches that identify DM using bisulfitesequencing data However this survey only includes seven suchapproaches In summary all previous attempts of reviewing theapproaches that identify DM using bisulfite sequencing data arelimited in at least one of the following aspects (i) the total num-ber of approaches covered in the survey (fewer than 10 methodsreviewed) (ii) the applicability (eg only methods dealing withRRBS data) or (iii) a small number of biological features con-sidered To address these issues a comprehensive survey of theapproaches that identify DM using bisulfite sequencing data isgreatly needed

In this article we review 22 different approaches for DM ana-lysis including approaches for identifying differentially methy-lated cytosines (DMCs) DMRs (both predefined and de novoregions) and methylation patterns using bisulfite sequencingdata (whole genome bisulfite sequencing [WGBS] and RRBS) Weclassify these approaches into seven different categories basedon the primary concepts and key techniques used to identifyDM In addition we provide a short overview of several generalhypothesis-based tests which can also be applied for DM ana-lysis In the following sections first we will provide a brief over-view of bisulfite sequencing technology and the workflow ofanalyzing bisulfite sequencing data Next we will provide a sys-tematic review of the approaches highlighting their pros andcons discussing their key characteristics

Bisulfite sequencing

The gold standard for measuring cytosine methylation is bisul-fite sequencing which has the advantage of measuring methy-lation at single-base resolution In this technique DNA istreated with sodium bisulfite which deaminates unmethylatedcytosines (C) to uracils (U) leaving the methylated cytosines un-changed Uracils are read as thymines (T) during the sequencingstep Methylation level at each CpG site is estimated by simplycounting the ratio of C(CthornT) Thus this process allows se-quence-specific discrimination between methylated and unme-thylated CpG sites [18]

Several technologies have been developed for measuringDNA methylation based on bisulfite sequencing conversionThe most comprehensive protocol among them is WGBS whichprovides genome-wide DNA profiling However the applicationof this protocol on the whole genome is expensive when itcomes to studying organisms with large genomes More cost-effective protocols such as RRBS and enhanced RRBS have

Figure 1 Timeline of the approaches that identify DM using bisulfite sequencing data

2 | Shafi et al

allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates

The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data

processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections

Pre-analysisData preprocessing

Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases

that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format

Read mapping

After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using

Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data

Identifying differential methylation | 3

standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format

Post-alignment analysis

After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)

DM analysis

After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]

To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]

Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two

categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests

Count-based hypothesis tests

Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]

Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]

Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites

Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating

4 | Shafi et al

methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections

Logistic regression-based approaches

Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc

Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output

Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels

Smoothing-based approaches

Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions

One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The

second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites

BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions

Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters

One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets

In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small

Identifying differential methylation | 5

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 3: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates

The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data

processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections

Pre-analysisData preprocessing

Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases

that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format

Read mapping

After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using

Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data

Identifying differential methylation | 3

standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format

Post-alignment analysis

After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)

DM analysis

After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]

To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]

Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two

categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests

Count-based hypothesis tests

Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]

Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]

Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites

Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating

4 | Shafi et al

methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections

Logistic regression-based approaches

Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc

Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output

Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels

Smoothing-based approaches

Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions

One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The

second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites

BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions

Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters

One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets

In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small

Identifying differential methylation | 5

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 4: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format

Post-alignment analysis

After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)

DM analysis

After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]

To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]

Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two

categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests

Count-based hypothesis tests

Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]

Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]

Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites

Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating

4 | Shafi et al

methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections

Logistic regression-based approaches

Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc

Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output

Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels

Smoothing-based approaches

Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions

One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The

second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites

BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions

Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters

One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets

In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small

Identifying differential methylation | 5

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 5: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections

Logistic regression-based approaches

Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc

Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output

Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels

Smoothing-based approaches

Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions

One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The

second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites

BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions

Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters

One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets

In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small

Identifying differential methylation | 5

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 6: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated

Beta-binomial-based approaches

Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability

Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc

lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites

The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage

A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing

function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs

An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting

The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration

lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation

lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments

lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the

6 | Shafi et al

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 7: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance

lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds

One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets

lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)

One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data

Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into

account the spatial correlation between the methylation levelsof the CpG sites

Hidden Markov model-based approaches

Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches

One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm

One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]

Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output

One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns

lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are

Identifying differential methylation | 7

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 8: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates

In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets

Entropy-based approaches

Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information

lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions

An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs

The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)

hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions

One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions

A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test

Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data

One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates

Mixed statistical tests-based approaches

Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or

8 | Shafi et al

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 9: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

within predefined genomic regions (ie fixedvariable sizewindows)

One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates

lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites

The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions

lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR

provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs

The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries

Binary segmentation-based approaches

Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates

The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test

Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates

Discussion

In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The

Identifying differential methylation | 9

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 10: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4

Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc

Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each

biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account

Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance

Figure 3 Pros and cons of the seven categories discussed in this survey

10 | Shafi et al

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 11: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

Tab

le1

Sum

mar

yo

fth

eim

po

rtan

tch

arac

teri

stic

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

dan

dre

fere

nce

Co

nce

pt

use

dPr

oto

col

Prim

ary

pu

rpo

seB

iolo

gica

lva

riat

ion

Spat

ial

dis

trib

uti

on

Ad

dit

ion

alco

vari

ates

Erro

rco

rrec

tio

nSe

qu

enci

ng

cove

rage

Iden

tify

deno

vore

gio

n

To

tal

cita

tio

ns

Cit

atio

n

year

1m

eth

ylK

it[5

4]Lo

gist

icre

gres

sio

nB

oth

Iden

tify

DM

Cs

and

ann

ota

te

17

543

75

2eD

MR

[64]

Logi

stic

regr

essi

on

Bo

thId

enti

fyD

MC

san

dD

MR

s

28

83

BSm

oo

th[4

3]Sm

oo

thin

gW

GB

SId

enti

fyD

MR

sw

ith

rep

lica

tes

156

39

4B

iSeq

[66]

Smo

oth

ing

RR

BS

Iden

tify

DM

Rs

wit

hFD

Rco

rrec

tio

n

62

18

6D

SS[6

9]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MLs

for

smal

lsa

mp

les

4316

1

5M

OA

BS

[70]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

Cs

wit

hre

pli

cate

s

49

184

7R

AD

Met

h[7

1]B

eta-

bin

om

ial

WG

BS

Iden

tify

DM

Lsan

dD

MR

s

31

133

8m

eth

ylSi

g[7

2]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MC

san

dD

MR

s

42

174

9D

SS-s

ingl

e[7

3]B

eta-

bin

om

ial

Bo

thId

enti

fyD

MR

sw

ith

ou

tre

pli

cate

s

15

12

10M

AC

AU

[74]

Bet

a-bi

no

mia

lB

oth

Iden

tify

DM

usi

ng

po

pu

la-

tio

nst

ruct

ure

88

11D

SS-g

ener

al[7

5]B

eta-

bin

om

ial

RR

BS

Iden

tify

DM

Ls

3

312

Get

isD

MR

[76]

Bet

a-bi

no

mia

lW

GB

SId

enti

fyD

MR

sd

irec

tly

00

13C

om

Met

[78]

HM

MB

oth

Iden

tify

DM

Rs

248

714

HM

M-F

ish

er[8

0]H

MM

Bo

thId

enti

fyD

Mp

atte

rns

44

15H

MM

-DM

[81]

HM

MB

oth

Iden

tify

DM

Rs

44

16Q

DM

R[8

3]Sh

ann

on

entr

op

yR

RB

SId

enti

fyD

MR

s

61

107

17C

pG

_MPs

[51]

Shan

no

nen

tro

py

WG

BS

Iden

tify

DM

pat

tern

s

30

72

18SM

AR

T[8

4]Sh

ann

on

entr

op

yW

GB

SId

enti

fyce

llty

pe-

spec

ific

met

hyl

atio

nm

arks

99

19C

OH

CA

P[4

6]M

ixed

stat

isti

csR

RB

SId

enti

fyD

MC

san

dco

n-

sist

ent

Cp

Gis

lan

ds

277

7

20D

MA

P[8

5]M

ixed

stat

isti

csB

oth

Iden

tify

DM

Rs

and

DM

Fs

3112

421

swD

MR

[86]

Mix

edst

atis

tics

WG

BS

Iden

tify

DM

Rs

wit

ho

ut

rep

lica

tes

4

32

22m

etil

ene

[87]

Bin

ary

segm

enta

tio

nB

oth

Iden

tify

DM

Rs

inla

rge

gro

up

so

fsa

mp

les

00

For

colu

mn

s5ndash

10

m

ean

sth

atth

em

eth

od

con

sid

ers

the

char

acte

rist

ican

d

mea

ns

that

the

met

ho

dd

oes

no

tco

nsi

der

the

char

acte

rist

ic

For

the

9th

colu

mn

m

ean

sth

atth

em

eth

od

con

sid

ers

seq

uen

cin

gco

vera

gew

hen

cou

nt-

base

dh

ypo

thes

iste

sts

are

per

form

edF

or

the

10th

colu

mn

id

enti

fyde

novo

regi

on

s

mea

ns

that

the

met

ho

dca

nan

d

mea

ns

that

the

met

ho

dca

nn

ot

iden

tify

deno

vore

gio

ns

For

colu

mn

s5ndash

10

mea

ns

the

char

acte

rist

ic

isn

ot

app

lica

ble

To

talc

itat

ion

san

dci

tati

on

sp

erye

arre

pre

sen

tth

en

um

ber

of

cita

tio

ns

and

the

aver

age

nu

mbe

ro

fci

tati

on

sp

erye

arr

esp

ecti

vely

as

sho

wn

on

goo

gle

sch

ola

ras

of

24O

cto

ber

2016

Identifying differential methylation | 11

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 12: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites

Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include

Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent

user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of

CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible

methylation difference

Figure 5 A higher level classification of the approaches discussed in this survey

based on the data type used when modeling the methylation levels of the CpG sites

12 | Shafi et al

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 13: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

Tab

le2

Co

mp

aris

on

of

the

avai

labl

eim

ple

men

tati

on

so

fth

e22

surv

eyed

app

roac

hes

Met

ho

d(t

oo

l)an

dto

olr

efer

ence

Plat

form

Ava

ilab

ilit

yLi

cen

seO

utp

ut

Publ

ish

edd

ate

Up

dat

edd

ate

1m

eth

ylK

it[5

4]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

Art

isti

cv2

DM

Cs

DM

Rs

list

(tab

le)

DM

Cs

DM

Rs

per

chro

mo

som

e(g

rap

h)

9N

ove

mbe

r20

1122

Oct

obe

r20

16

2eD

MR

[54

64]

Rp

acka

geSt

and

alo

ne

Art

isti

cG

PLD

MR

sli

st(t

able

)4

Jan

uar

y20

134

Ap

ril2

014

3B

Smo

oth

(bss

eq)[

43]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eA

rtis

tic

v2D

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

20Ju

ly20

1214

Oct

obe

r20

16

4B

iSeq

[88]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eLG

PLv3

DM

Rs

list

(tab

le)

DM

Rm

ean

met

hyl

atio

n(g

rap

h)

2A

pri

l201

317

Oct

obe

r20

16

6D

SS[6

973

75

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

04Ju

ne

2012

17O

cto

ber

2016

5M

OA

BS

[70]

Cthornthorn

pac

kage

and

Perl

scri

pt

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)12

Jun

e20

1330

May

2015

7R

AD

Met

h[7

1]Cthornthorn

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)27

Mar

ch20

141

May

2014

a

8m

eth

ylSi

g[7

2]R

pac

kage

Stan

dal

on

eG

NU

GPL

v3D

MC

sD

MR

sli

st(t

able

)C

pG

site

sm

eth

ylat

ion

rate

(gra

ph

)17

Jun

e20

1410

Jun

e20

16

9D

SS-s

ingl

e(D

SS)[

697

375

89]

Bic

on

du

cto

rR

pac

kage

Stan

dal

on

eG

NU

GPL

DM

Cs

DM

Rs

list

(tab

le)

DM

R

met

hyl

atio

nl

ocu

s(g

rap

h)

16A

pri

l201

517

Oct

obe

r20

16

10M

AC

AU

[74]

Cthornthorn

pac

kage

and

Rsc

rip

tSt

and

alo

ne

GN

UG

PLD

MC

sli

st(t

able

)5

Jun

e20

159

Dec

embe

r20

1511

DSS

-gen

eral

(DSS

)[69

73

758

9]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLD

MC

sD

MR

sli

st(t

able

)D

MR

m

eth

ylat

ion

lo

cus

(gra

ph

)29

Ap

ril2

015

17O

cto

ber

2016

12G

etis

DM

R[7

6]Cthornthorn

pac

kage

and

Rsc

rip

tsSt

and

alo

ne

GN

UG

PLD

MR

sli

st(t

able

)28

Ap

ril2

016

28Se

pte

mbe

r20

1613

Co

mM

et(B

isu

lfigh

ter)

[78]

Cthornthorn

pac

kage

and

Pyth

on

Stan

dal

on

eC

CA

NS

DM

Rs

list

(tab

le)

12D

ecem

ber

2014

29Se

pte

mbe

r20

1514

HM

M-F

ish

er[8

0]R

scri

pts

Stan

dal

on

eN

on

eD

MR

sli

st(t

able

)D

MR

lo

cus

met

hyl

atio

nle

vel(

grap

h)

25A

pri

l201

429

Febr

uar

y20

16

15H

MM

-DM

[81]

Rsc

rip

tsSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

DM

Rl

ocu

sm

eth

ylat

ion

leve

l(gr

aph

)27

Mar

ch20

1424

Mar

ch20

16

16Q

DM

R[8

3]Ja

vap

acka

geSt

and

alo

ne

web

CLI

Cu

sto

mb

DM

Rs

list

(tab

le)

DM

Rin

UC

SCG

eno

me

Bro

wse

r(g

rap

h)

10M

ay20

1017

Oct

obe

r20

12

17C

pG

_MPs

[51]

Java

pac

kage

and

Perl

scri

pt

Stan

dal

on

ew

ebC

LIN

on

eD

MR

sli

st(t

able

)20

Jun

e20

111

Sep

tem

ber

2015

18SM

AR

T(S

MA

RT

-BS-

Seq

)[84

]Py

tho

np

acka

geSt

and

alo

ne

PSFL

DM

Rs

list

(tab

le)

17M

ay20

1517

May

2015

19C

OH

CA

P[4

6]B

ico

nd

uct

or

Rp

acka

geSt

and

alo

ne

GN

UG

PLv3

DM

Cs

and

DM

Cp

Gis

lan

ds

list

(tab

le)

DM

Cp

Gis

lan

ds

met

hyl

atio

nav

erag

e(g

rap

h)

9Ja

nu

ary

2014

17O

cto

ber

2016

20D

MA

P(m

eth

_pro

gs_d

ist)

[85]

Cp

acka

geSt

and

alo

ne

No

ne

DM

Rs

list

(tab

le)

14M

ay20

1328

Au

gust

2016

21sw

DM

R[8

6]Pe

rlan

dR

scri

pts

Stan

dal

on

eG

NU

GPL

v3D

MR

sli

st(t

able

)D

MR

met

hyl

atio

nle

vel(

grap

h)

6Ja

nu

ary

2013

15Ju

ne

2014

22m

etil

ene

[87]

Cp

acka

geSt

and

alo

ne

GN

UG

PLv2

DM

Rs

list

(tab

le)

8M

ay20

1529

Ap

ril2

016

aR

AD

Met

his

no

wp

art

of

the

Met

hPi

pe

too

lrel

ease

do

n6

Sep

tem

ber

2013

wit

hth

ela

test

up

dat

eo

n21

Oct

obe

r20

16

bC

ust

om

lice

nse

stat

ing

that

the

soft

war

eis

free

of

char

geto

rese

arch

ers

wo

rkin

gat

acad

emic

no

n-p

rofi

to

rgan

izat

ion

so

nn

on

-co

mm

erci

alp

roje

cts

GN

Ug

ener

alp

ubl

icli

cen

seL

GPL

les

ser

gen

eral

pu

blic

lice

nse

CC

AN

Scr

eati

veco

mm

on

sat

trib

uti

on

-No

nC

om

mer

cial

-Sh

areA

like

30

un

po

rted

lice

nse

PSF

Lp

yth

on

soft

war

efo

un

dat

ion

lice

nse

CLI

co

mm

and

lin

ein

terf

ace

Identifying differential methylation | 13

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 14: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

additional filters to remove low coverage CpG sites before esti-mating methylation

Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs

Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc

Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used

When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel

Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation

Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and

RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey

In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools

Conclusion

Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here

Key points

bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis

bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed

bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available

14 | Shafi et al

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 15: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

Funding

National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies

References1 Deaton AM Bird A CpG islands and the regulation of tran-

scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and

histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-

omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22

4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51

5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7

6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80

7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24

8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47

9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4

10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92

11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105

12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36

13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6

14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324

15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807

16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91

17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28

18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64

19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77

20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010

21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4

22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485

23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10

24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20

25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore

26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2

27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203

28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6

29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3

30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337

31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25

32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9

33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232

34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2

35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81

36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2

37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8

38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5

39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259

40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220

41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11

42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85

Identifying differential methylation | 15

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 16: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83

44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78

45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64

46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117

47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51

48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20

49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res

200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and

DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31

51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4

52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83

53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17

54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87

55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211

56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91

57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65

58Gosset WS The probable error of a mean Biometrika190861ndash25

59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976

60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3

61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9

62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53

63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31

64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10

65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8

66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53

67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18

68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19

69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69

70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38

71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215

72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22

73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141

74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650

75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53

76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404

77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41

78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45

79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3

80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67

81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81

82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55

83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58

84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific

16 | Shafi et al

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4
Page 17: A survey of the approaches for identifying differential ... · (e.g. methylation and gene expression). In this review, we focus on bisulfite sequencing-based approaches. Within the

hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94

85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22

86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866

87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62

88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015

89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43

Identifying differential methylation | 17

  • bbx013-TF1
  • bbx013-TF2
  • bbx013-TF3
  • bbx013-TF4