A survey of the approaches for identifying differential
methylation using bisulfite sequencing dataAdib Shafi Cristina Mitrea Tin Nguyen and Sorin DraghiciCorresponding author Sorin Draghici Department of Computer Science and Department of Obstetrics and Gynecology Wayne State University 14thFloor Suite 14200 5057 Woodward Detroit MI 48202 Tel thorn1(313)577-2162 Fax thorn1(313) 577-6868 E-mail sorinwayneedu
Abstract
DNA methylation is an important epigenetic mechanism that plays a crucial role in cellular regulatory systems Recentadvancements in sequencing technologies now enable us to generate high-throughput methylation data and to measuremethylation up to single-base resolution This wealth of data does not come without challenges and one of the keychallenges in DNA methylation studies is to identify the significant differences in the methylation levels of the base pairsacross distinct biological conditions Several computational methods have been developed to identify differential methyla-tion using bisulfite sequencing data however there is no clear consensus among existing approaches A comprehensivesurvey of these approaches would be of great benefit to potential users and researchers to get a complete picture of theavailable resources In this article we present a detailed survey of 22 such approaches focusing on their underlying statis-tical models primary features key advantages and major limitations Importantly the intrinsic drawbacks of theapproaches pointed out in this survey could potentially be addressed by future research
Key words DNA methylation epigenetic modification differentially methylated cytosines (DMCs) differentially methylatedregions (DMRs) bisulfite sequencing
Introduction
Epigenetics is the field of study that provides information onhow where and when genes are switched on and off inside aliving cell DNA methylation is an intensively studied and wellunderstood epigenetic mechanism that plays a vital role inmany processes [1] Due to its role in regulating gene expres-sion DNA methylation is an important part of cellular processessuch as cell development and differentiation Furthermore pat-terns of hypermethylation have been identified in human can-cers which can provide novel insights into the development
and progression of such complex diseases [2] Specifically incancer one of the causes of silenced tumor suppressor genes ishypermethylation
The most studied form of DNA methylation known as5-methylcytosine (5-mc) involves the addition of a methylgroup to the 5-carbon of the cytosine (C) base of a DNA strandAlthough approximately only 5 of the cytosine bases in thehuman genome are methylated cytosine (C) followed by aguanine (G) which is known as a CpG site is methylated70ndash80 of the time [3 4] Methylation can also occur in non-CpGcontext such as CHG and CHH sites (where Hfrac14C T or A)
Adib Shafi is a PhD candidate in the Department of Computer Science at Wayne State University USA His research interests include biological pathwayanalysis finding mechanism using multi-omics data and variant analysisCristina Mitrea is a PhD candidate in the Department of Computer Science at Wayne State University USA Her work is focused on research in data miningtechniques applied to bioinformatics and computational biology Other interests include network discovery and meta-analysis applied to pathwayanalysisTin Nguyen received his PhD from the Computer Science Department at Wayne State University His research interests include computational and statis-tical methods for analyzing high-throughput data His current foci are meta-analysis and multi-omics data integrationSorin Draghici currently holds the Robert J Sokol MD Endowed Chair in Systems Biology as well as appointments as Full Professor with the Departmentof Computer Science and the Department of Obstetrics and Gynecology Wayne State University He is also the head of the Intelligent Systems andBioinformatics Laboratory in the Department of Computer Science His work is focused on research in artificial intelligence machine learning and datamining techniques applied to bioinformatics and computational biology He has published 2 best-selling books on data analysis of high-throughput gen-omics data 8 book chapters and over 190 peer-reviewed journal and conference papersSubmitted 1 September 2016 Received (in revised form) 14 January 2017
VC The Author 2017 Published by Oxford University Press All rights reserved For Permissions please email journalspermissionsoupcom
1
Briefings in Bioinformatics 2017 1ndash17
doi 101093bibbbx013Paper
especially in plants and stem cells [5 6] Recent studies havealso shown that the Ten-Eleven translocation (TET) proteins areinvolved in oxidizing 5-mc into 5-hydroxymethylcytosine (5-hmC) 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC)However the abundance level of these methylation variants (5-hmC 5-fC 5-caC) is low compared with that of 5-mc [7]Therefore our survey focuses on 5-mc methylation in CpG con-text considering most of the methods have been developed foranalyzing this type of epigenetic modification When a CpG siteis methylated in the promoter regions it typically represses thetranscriptional activity of that region by restricting the bindingof specific transcription factors (TFs) Alternatively when a CpGsite is unmethylated in promoter regions it allows for the bind-ing of those TFs [8ndash10] Given its regulatory role in cellular activ-ities identifying changes in DNA methylation across multiplebiological conditions is of great interest
The availability of the reference genome and the advancedsequencing technologies have led to methods that providehigh-resolution methylation profiles on a genome scale Basedon the resolution at which the methylation levels are measuredcurrent sequencing-based technologies can be divided in twocategories (i) enrichment-based approaches and (ii) bisulfitesequencing-based approaches [11 12] The former allows us tomeasure the methylation levels at 100ndash200 base resolutionwhile the latter allows us to measure the methylation levels atsingle-base resolution One of the challenges in measuringgenome-level methylation is the amount of biological materialneeded which has only recently reached levels feasible for clin-ical samples [13] Other challenges are related to processingdata from new technologies and integrating them with differenttypes of data in a meaningful way to provide biological insights(eg methylation and gene expression) In this review we focuson bisulfite sequencing-based approaches
Within the past few years many tools have been developedfor differential methylation (DM) analysis using bisulfitesequencing data (Figure 1) but only a few attempts have beenmade to provide a review of these approaches Robinson et al[14] provides a mini review of the approaches that identify DMbriefly discussing their methodologies and current challengesThis review not only includes the approaches that use bisulfitesequencing data but also the approaches that use DNA methy-lation arrays (Illuminarsquos 27k or 450k) and enrichment assays(MeDIP-seq) Yet the number of approaches based on bisulfitesequencing data and the number of features considered foreach approach are low Klein et al [15] evaluates nineapproaches that can possibly be used for DM analysis Howeverthe methods are limited to the scope of analyzing DM in prede-fined regions using only reduced representation bisulfitesequencing (RRBS) data Among the nine approaches only fourof them are originally designed for analyzing DM The other fiveapproaches are general approaches that can be applied for RNA-
Seq and gene expression data Yu and Sun [16] evaluate onlyfive approaches developed for the purpose of identifying differ-entially methylated regions (DMRs) Sun et al [17] briefly sum-marize the commonly used platforms for methylation profilingdata preprocessing techniques and statistical approaches forDM analysis This review provides a well-organized conceptualoverview of approaches that identify DM using bisulfitesequencing data However this survey only includes seven suchapproaches In summary all previous attempts of reviewing theapproaches that identify DM using bisulfite sequencing data arelimited in at least one of the following aspects (i) the total num-ber of approaches covered in the survey (fewer than 10 methodsreviewed) (ii) the applicability (eg only methods dealing withRRBS data) or (iii) a small number of biological features con-sidered To address these issues a comprehensive survey of theapproaches that identify DM using bisulfite sequencing data isgreatly needed
In this article we review 22 different approaches for DM ana-lysis including approaches for identifying differentially methy-lated cytosines (DMCs) DMRs (both predefined and de novoregions) and methylation patterns using bisulfite sequencingdata (whole genome bisulfite sequencing [WGBS] and RRBS) Weclassify these approaches into seven different categories basedon the primary concepts and key techniques used to identifyDM In addition we provide a short overview of several generalhypothesis-based tests which can also be applied for DM ana-lysis In the following sections first we will provide a brief over-view of bisulfite sequencing technology and the workflow ofanalyzing bisulfite sequencing data Next we will provide a sys-tematic review of the approaches highlighting their pros andcons discussing their key characteristics
Bisulfite sequencing
The gold standard for measuring cytosine methylation is bisul-fite sequencing which has the advantage of measuring methy-lation at single-base resolution In this technique DNA istreated with sodium bisulfite which deaminates unmethylatedcytosines (C) to uracils (U) leaving the methylated cytosines un-changed Uracils are read as thymines (T) during the sequencingstep Methylation level at each CpG site is estimated by simplycounting the ratio of C(CthornT) Thus this process allows se-quence-specific discrimination between methylated and unme-thylated CpG sites [18]
Several technologies have been developed for measuringDNA methylation based on bisulfite sequencing conversionThe most comprehensive protocol among them is WGBS whichprovides genome-wide DNA profiling However the applicationof this protocol on the whole genome is expensive when itcomes to studying organisms with large genomes More cost-effective protocols such as RRBS and enhanced RRBS have
Figure 1 Timeline of the approaches that identify DM using bisulfite sequencing data
2 | Shafi et al
allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates
The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data
processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections
Pre-analysisData preprocessing
Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases
that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format
Read mapping
After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using
Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data
Identifying differential methylation | 3
standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format
Post-alignment analysis
After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)
DM analysis
After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]
To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]
Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two
categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests
Count-based hypothesis tests
Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]
Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]
Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites
Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating
4 | Shafi et al
methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections
Logistic regression-based approaches
Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc
Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output
Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels
Smoothing-based approaches
Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions
One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The
second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites
BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions
Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters
One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets
In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small
Identifying differential methylation | 5
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
especially in plants and stem cells [5 6] Recent studies havealso shown that the Ten-Eleven translocation (TET) proteins areinvolved in oxidizing 5-mc into 5-hydroxymethylcytosine (5-hmC) 5-formylcytosine (5-fC) and 5-carboxylcytosine (5-caC)However the abundance level of these methylation variants (5-hmC 5-fC 5-caC) is low compared with that of 5-mc [7]Therefore our survey focuses on 5-mc methylation in CpG con-text considering most of the methods have been developed foranalyzing this type of epigenetic modification When a CpG siteis methylated in the promoter regions it typically represses thetranscriptional activity of that region by restricting the bindingof specific transcription factors (TFs) Alternatively when a CpGsite is unmethylated in promoter regions it allows for the bind-ing of those TFs [8ndash10] Given its regulatory role in cellular activ-ities identifying changes in DNA methylation across multiplebiological conditions is of great interest
The availability of the reference genome and the advancedsequencing technologies have led to methods that providehigh-resolution methylation profiles on a genome scale Basedon the resolution at which the methylation levels are measuredcurrent sequencing-based technologies can be divided in twocategories (i) enrichment-based approaches and (ii) bisulfitesequencing-based approaches [11 12] The former allows us tomeasure the methylation levels at 100ndash200 base resolutionwhile the latter allows us to measure the methylation levels atsingle-base resolution One of the challenges in measuringgenome-level methylation is the amount of biological materialneeded which has only recently reached levels feasible for clin-ical samples [13] Other challenges are related to processingdata from new technologies and integrating them with differenttypes of data in a meaningful way to provide biological insights(eg methylation and gene expression) In this review we focuson bisulfite sequencing-based approaches
Within the past few years many tools have been developedfor differential methylation (DM) analysis using bisulfitesequencing data (Figure 1) but only a few attempts have beenmade to provide a review of these approaches Robinson et al[14] provides a mini review of the approaches that identify DMbriefly discussing their methodologies and current challengesThis review not only includes the approaches that use bisulfitesequencing data but also the approaches that use DNA methy-lation arrays (Illuminarsquos 27k or 450k) and enrichment assays(MeDIP-seq) Yet the number of approaches based on bisulfitesequencing data and the number of features considered foreach approach are low Klein et al [15] evaluates nineapproaches that can possibly be used for DM analysis Howeverthe methods are limited to the scope of analyzing DM in prede-fined regions using only reduced representation bisulfitesequencing (RRBS) data Among the nine approaches only fourof them are originally designed for analyzing DM The other fiveapproaches are general approaches that can be applied for RNA-
Seq and gene expression data Yu and Sun [16] evaluate onlyfive approaches developed for the purpose of identifying differ-entially methylated regions (DMRs) Sun et al [17] briefly sum-marize the commonly used platforms for methylation profilingdata preprocessing techniques and statistical approaches forDM analysis This review provides a well-organized conceptualoverview of approaches that identify DM using bisulfitesequencing data However this survey only includes seven suchapproaches In summary all previous attempts of reviewing theapproaches that identify DM using bisulfite sequencing data arelimited in at least one of the following aspects (i) the total num-ber of approaches covered in the survey (fewer than 10 methodsreviewed) (ii) the applicability (eg only methods dealing withRRBS data) or (iii) a small number of biological features con-sidered To address these issues a comprehensive survey of theapproaches that identify DM using bisulfite sequencing data isgreatly needed
In this article we review 22 different approaches for DM ana-lysis including approaches for identifying differentially methy-lated cytosines (DMCs) DMRs (both predefined and de novoregions) and methylation patterns using bisulfite sequencingdata (whole genome bisulfite sequencing [WGBS] and RRBS) Weclassify these approaches into seven different categories basedon the primary concepts and key techniques used to identifyDM In addition we provide a short overview of several generalhypothesis-based tests which can also be applied for DM ana-lysis In the following sections first we will provide a brief over-view of bisulfite sequencing technology and the workflow ofanalyzing bisulfite sequencing data Next we will provide a sys-tematic review of the approaches highlighting their pros andcons discussing their key characteristics
Bisulfite sequencing
The gold standard for measuring cytosine methylation is bisul-fite sequencing which has the advantage of measuring methy-lation at single-base resolution In this technique DNA istreated with sodium bisulfite which deaminates unmethylatedcytosines (C) to uracils (U) leaving the methylated cytosines un-changed Uracils are read as thymines (T) during the sequencingstep Methylation level at each CpG site is estimated by simplycounting the ratio of C(CthornT) Thus this process allows se-quence-specific discrimination between methylated and unme-thylated CpG sites [18]
Several technologies have been developed for measuringDNA methylation based on bisulfite sequencing conversionThe most comprehensive protocol among them is WGBS whichprovides genome-wide DNA profiling However the applicationof this protocol on the whole genome is expensive when itcomes to studying organisms with large genomes More cost-effective protocols such as RRBS and enhanced RRBS have
Figure 1 Timeline of the approaches that identify DM using bisulfite sequencing data
2 | Shafi et al
allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates
The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data
processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections
Pre-analysisData preprocessing
Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases
that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format
Read mapping
After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using
Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data
Identifying differential methylation | 3
standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format
Post-alignment analysis
After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)
DM analysis
After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]
To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]
Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two
categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests
Count-based hypothesis tests
Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]
Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]
Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites
Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating
4 | Shafi et al
methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections
Logistic regression-based approaches
Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc
Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output
Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels
Smoothing-based approaches
Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions
One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The
second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites
BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions
Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters
One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets
In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small
Identifying differential methylation | 5
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
allowed for methylation analysis with reduced sequencing re-quirements through a more targeted approach for CpG-rich gen-omic regions that meet specific length requirements [19] Thesetechniques therefore are more affordable for studies with mul-tiple replicates
The overall workflow for bisulfite sequencing data analysisis displayed in Figure 2 The overall pipeline consists of sixmajor elements (i) the input including methylation data(in FASTAFASTQ format) and the reference genome (ii) data
processing and quality control (iii) alignment of short reads tothe reference genome (iv) post-alignment analysis (v) DM ana-lysis and (vi) the output including DMCs DMRs and methylationpatterns The details of each element will be described in thefollowing sections
Pre-analysisData preprocessing
Bisulfite sequencing data consist of short read sequences in theFASTAFASTQ file format Data processing starts with perform-ing quality control operations on the raw sequencing readsincluding quality trimming and adapter trimming Quality trim-ming reduces methylation call errors by trimming the bases
that have poor quality scores whereas adapter trimming re-moves the known adapters from short reads to increase map-ping efficiency Existing tools for quality control include FASTX-Toolkit [20] PRINSEQ [21] SolexaQA [22] Cutadapt [23]Trimmomatic [24] and Trim Galore [25] Both the input and out-put of these tools are files in the FASTAFASTQ format
Read mapping
After quality control bisulfite sequencing reads can be alignedto the reference genome to estimate the methylation levelsSimply aligning these reads by using standard aligners resultsin poor mapping efficiency because the bisulfite treatmentintroduces additional discrepancies between the sequencingreads and the reference genome by converting the unmethy-lated cytosines to thymines Therefore new strategies were pro-posed for bisulfite sequencing read alignment Existing bisulfitesequencing alignment approaches can be divided in two catego-ries three-letter aligners and wildcard aligners Three-letteraligners such as Bismark [26] BS Seeker [27] MethylCoder [28]BRAT [29] and GNUMAP-bs [30] convert all Cs into Ts in the for-ward strand and all Gs into As in the reverse strand of the refer-ence genome Equivalently converted reads are then aligned tothese pre-converted forms of the reference genomes using
Figure 2 The workflow of analyzing DNA methylation using bisulfite sequencing data
Identifying differential methylation | 3
standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format
Post-alignment analysis
After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)
DM analysis
After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]
To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]
Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two
categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests
Count-based hypothesis tests
Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]
Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]
Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites
Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating
4 | Shafi et al
methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections
Logistic regression-based approaches
Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc
Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output
Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels
Smoothing-based approaches
Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions
One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The
second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites
BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions
Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters
One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets
In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small
Identifying differential methylation | 5
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
standard genome aligners such as Bowtie [31] and Bowtie2 [32]In contrast wildcard aligners such as BSMAP [33] RRBSMAP[34] GSNAP [35] and RMAP [36] replace the Cs of the referencegenome with the wildcard letter Y that matches both Cs and Tsin the sequencing reads The alignment results are usuallystored in SAMBAM file format
Post-alignment analysis
After mapping the reads an optional post-alignment step canbe performed to extract meaningful biological information fromthe alignment results before DM analysis Several post-alignment analysis tools have been developed including BiQAnalyzer [37] QUMA [38] BRAT [29] MethyQA [39] BSPAT [40]and MethGo [41] Most of these tools provide summary statis-tics quality assessment and visualization of the methylationdata Some of these tools include extra features such as readmapping (eg BSPAT and BRAT) identifying DNA methylationco-occurrence pattern (eg BSPAT) single nucleotide poly-morphisms and copy number variation calling (eg MethGo)and detecting allele-specific methylation patterns (eg BSPAT)
DM analysis
After obtaining the methylation information of the CpG sitestypically the next downstream analysis is to perform DM ana-lysis which is usually done in the form of identifying DMCs orDMRs Identification of DMCs involves comparing the methyla-tion level at each CpG site across the phenotypes (two or more)and applying statistical tests for hypothesis testingIdentification of DMRs is usually a two-step process (i) the iden-tification of DMCs and (ii) grouping the neighboring DMCs ascontiguous DMRs by certain distance criteria However someapproaches can directly identify DMRs DMCsDMRs occasion-ally can be linked to transcriptional repression of the associatedgenes therefore they provide crucial biological insights thatmay lead to the development of potential drug candidates [1]
To identify putative potential DMCsDMRs from bisulfitesequencing data some characteristics need to be consideredOne such characteristic is the lsquospatial correlationrsquo between themethylation levels of the neighboring CpG sites which plays animportant role in getting an accurate estimation of the methyla-tion levels [3 42] Incorporating spatial correlation in DM ana-lysis can reduce the required sequencing depth and canestimate the methylation status of the missing CpG sites [43]lsquoSequencing depthrsquo is another important characteristic that isdirectly related to the certainty of the methylation scores ofCpG sites Considering sequencing depth while identifyingDMRs is crucial because it can take into account the samplingvariability that occurs during sequencing Another such charac-teristic is lsquobiological variationrsquo among replicates which is cru-cial in identifying the regions that consistently differ betweengroups of samples [44 45] Ignoring biological variation whiledetecting DMRs might lead to a high number of false positivesin the results [14 43 46] This is due to the fact that the methy-lation levels of the CpG sites are heterogeneous not only whenthe cell types are different but also when the cells are of thesame type [47ndash50]
Classical hypothesis testing methods such as Fisherrsquos exacttest (FET) chi-square (v2) test regression approaches t-testmoderated t-test Goemanrsquos global test and analysis of variance(ANOVA) can be used to identify DM using bisulfite sequencingdata [3 46 51 52 53] These approaches can be divided into two
categories based on the data type they use count-based hy-pothesis tests and ratio-based hypothesis tests
Count-based hypothesis tests
Input of these hypothesis testing methods are count valueswhich can be either the number of reads or the number of CpGsites in a predefined genomic region FET is a classical statisticaltest used to determine whether there are nonrandom associ-ations between two categorical variables In the context ofmethylation analysis we can use the data to build a contin-gency table where the two rows represent the two methylationstates and the two columns represent a pair of samples Whenapplying FET for two groups of samples the counts for a methy-lation status within each group are aggregated into a singlenumber [54] Chi-square test is another classical method to testthe relationship between two categorical variables (methylatedversus unmethylated) In contrast with FET it allows for testingacross multiple samples As pointed out by Sun et al [17] andHurlbert et al [55] there are several issues related to the aggre-gation of read counts into a single number while applying testsof independence (FET and v2 test) First the read counts are notindependent they represent different sets of interdependent orcorrelated observations Thus aggregating the counts violatesthe fundamental assumption underlying the test for independ-ence Second due to uneven coverage of each individual sitethe results are biased toward the samples with higher coverageThird by aggregating (summing) the counts some of the biolo-gical variations (eg sample size intra-group variance) is nottaken into account by the hypothesis testing Therefore usingFET and v2 test to compare two groups of samples could lead toa high number of false positives [14 43 46]
Regression approaches (eg Poisson quasi-Poisson negativebinomial regression) are primarily used for detecting differen-tially expressed genes using RNA-Seq data but they can also beapplied in the context of DM analysis [15] For example the readcounts can be modeled using a Poisson distribution and a modi-fied Wald test can be used to detect DM as the difference be-tween two Poisson means [56 57]
Ratio-based hypothesis testsThese hypothesis tests use methylation percentage (methyla-tion ratio) instead of count values For a particular CpG sitemethylation percentage is calculated by taking the ratio be-tween the methylated read counts and the total read counts ofthat site To compare the methylation difference level betweentwo groups (phenotypes) of samples classical tests such ast-test [58 59] moderated t-test (limma) [60] or Goemanrsquos globaltest [61] can be used While t-test is a classical approach to com-pare the means limma and Goemanrsquos test are empiricalBayesian approaches that were primarily designed to detect dif-ferentially expressed genes using microarray data When ana-lyzing methylation levels across multiple groups of samplesANOVA [62] can be used instead of multiple pair-wise compari-sons Compared with count-based hypothesis tests the ratio-based tests take into account the biological variation acrossmultiple replicates However because they only take into ac-count the ratio of the reads (methylated reads versus all reads)they ignore the sequencing depth within the CpG sites
Although classical hypothesis testing methods are some-what useful straightforward and easy to use they are not effi-cient in more sophisticated methylation analysis such asidentifying de novo regions considering spatial correlationamong the methylation levels of the CpG sites and estimating
4 | Shafi et al
methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections
Logistic regression-based approaches
Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc
Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output
Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels
Smoothing-based approaches
Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions
One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The
second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites
BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions
Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters
One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets
In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small
Identifying differential methylation | 5
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
methylation levels of missing CpG sites Over the past fewyears several approaches have been developed to address thesechallenges which are discussed and summarized in the follow-ing subsections
Logistic regression-based approaches
Approaches in this category model the read counts of the CpGsites by using logistic regression to identify DM One of thepopular approaches in this category is lsquomethylKitrsquo [54] whichuses logistic regression to model the methylation proportion ata given base or region when biological replicates are availableIn the absence of biological replicates methylKit uses FET toidentify DM P-values are corrected using the false discoveryrate (FDR) approach or the sliding linear model approach [63]MethylKit is commonly used to identify DMCs from predefinedregions (RRBS data) However it can also be used to identifyDMRs from WGBS data based on user-defined tiling windowsMajor contribution of methylKit is that it can take into accountthe sequencing coverage It can incorporate additional covari-ates into the model and work with CHG or CHH methylation Italso provides functionalities such as sample-wise methylationsummary sample clustering annotation and visualization ofDM etc
Another method named lsquoeDMRrsquo [64] was proposed as an ex-tension of methylKit eDMR models the distances between theneighboring CpG sites using a bimodal normal distribution andestimates DMR boundaries using a weighted cost function Afterestimating the regional boundaries DMRs are filtered based onthe mean methylation difference the number of DMCs and thenumber of CpG sites Significance of the DMRs are calculated bycombining the P-values of the DMCs using Stouffer-Liptakmethod [65] The P-values for DMRs are then corrected for mul-tiple comparisons using the FDR method eDMR provides a listof DMRs and their annotation as output
Approaches in this category take sequencing coverage intoaccount They can incorporate additional covariates into themodel as well However they do not consider the biologicalvariation among the replicates Although eDMR estimates thesignificance of the identified regions based on spatial auto cor-relation it does not consider the spatial correlation among theCpG sites when estimating the methylation levels
Smoothing-based approaches
Approaches in this category assume that methylation levels ofthe CpG sites vary smoothly across the genome They performlsquosmoothingrsquo across the samples or predefined regions which isa technique to estimate the methylation levels of the CpG sitesby borrowing information from their neighbors Group differ-ences across different conditions are computed based on theestimated methylation values of the CpG sites Finally differentstatistical tests are used to identify the differentially methy-lated sites or regions
One of the most commonly used smoothing-basedapproaches is lsquoBSmoothrsquo [43] which relies on smoothing acrossthe genome within each sample It looks for group differencesvia CpG-wise t-tests to identify DMRs between two groups TheBSmooth algorithm begins with aligning the sequencing readsto the reference genome Two alternative pipelines are availablefor the users to align the reads The first pipeline which sup-ports gaped alignment and the alignment of the paired-endbisulfite-treated reads is based on in silico bisulfite conversionthat uses the lsquoBowtie-2rsquo aligner to align the reads [32] The
second pipeline is based on a newly developed aligner namedlsquoMermanrsquo which supports the alignment of the colorspacebisulfite reads After aligning the reads sample-specific qualityassessment metrics are compiled Local likelihood smoothing isapplied within a smoothing window across the samples to esti-mate the methylation levels of the CpG sites A signal-to-noisestatistic similar to t-test is used to identify the DMCs FinallyDMRs are defined by merging the consecutive DMCs based onsome defined criteria such as a cutoff value of the t-statisticmaximum distance between the CpG sites and minimum num-ber of CpG sites
BSmooth was the first approach primarily developed forDMR identification that takes into account the biological vari-ation among replicates It reduces the required sequencingcoverage by applying the local likelihood smoothing approachacross the samples It can also identify de novo regions fromWGBS data sets On the other hand BSmooth lacks suitableerror measurement criteria within the identified DMRs As a re-sult there is no way to check whether the identified CpG sitesinside the predicted DMRs are true DMCs or selected errone-ously BSmooth predicts methylation values of the CpG sitesbased on the last observed slope Hence for the genomic re-gions that are not covered by the reads previously observedmethylation level will continue resulting in a biased estimationof the methylation level (ie extrapolated methylation values of0 and 1) [66] BSmooth is not applicable to those data sets thatdo not have biological replicates In addition BSmooth is lim-ited to comparisons between two groups of conditions
Another approach in this category lsquoBiSeqrsquo performs thesmoothing of methylation data across defined candidate re-gions instead of across the samples (like BSmooth) [66] Thepipeline begins with defining CpG clusters within the genomebased on a minimum number of lsquofrequently covered CpG sitesrsquo(CpG sites that are covered by the majority of samples) and aproximity distance threshold defined by the user A smoothingfunction is modeled for each defined cluster While modelingthe smoothing function the coverage information for each CpGsite is taken into account to make sure that the CpG site withhigh coverage has a greater impact on the estimated methyla-tion level than the CpG site with low coverage Group effects ofthe CpG sites are modeled using beta regression with probit linkfunction DMCs are identified using Wald test procedure Nexta hierarchical testing procedure is applied to identify significantclusters containing at least one DMC While testing the targetregions weighted FDR is applied to take into account the size ofindividual clusters [67] A location-wise FDR approach is appliedto trim the CpG sites that are not differentially methylatedwithin the selected significant clusters
One of the major contributions of BiSeq approach is that itprovides region-wise error control measurement to test the tar-get regions This approach is also capable of adding additionalcovariates to the regression model In contrast one of the limi-tations of the BiSeq approach is that it is only suitable for ana-lyzing experiments that have predefined regions such as RRBSdata sets
In general smoothing-based approaches have the advantageof considering the spatial correlation between the methylationlevels of the CpG sites By performing smoothing the requiredsequencing coverage and the variance of the methylation levelscan be reduced [43] Furthermore they can estimate the methy-lation levels of missing CpG sites On the other hand smooth-ing-based approaches cannot detect the low CpG densityregions where methylation has sharp changes such as tran-scription factor binding sites (TFBS) TFBS are usually small
Identifying differential methylation | 5
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
(ielt50 bp) which might consist of a single CpG that is differen-tially methylated [68] Thus biological events involving a singleCpG site might not be detected by the smoothing approaches Inaddition these approaches are not appropriate for biologicalsystems whose true methylation levels of the CpG sites are notspatially correlated
Beta-binomial-based approaches
Approaches in this category characterize the methylation readcounts as a beta-binomial distribution In the absence of anybiological or technical variation methylation proportion of aparticular CpG site follows a binomial distribution becausesequencing reads over a CpG site can be either methylated orunmethylated Whenever biological and technical variation arepresent in the data methylation proportions of the CpG sitesare assumed to follow a beta distribution Therefore in the pres-ence of biological replicates an appropriate statistical model formethylation analysis is the beta-binomial model as it can takeinto account both sampling and biological variability
Over the past few years several beta-binomial-basedapproaches have been developed to identify DM such as DSS[69] MOABS [70] RADMeth [71] methylSig [72] DSS-single [73]MACAU [74] DSS-general [75] and GetisDMR [76] Theseapproaches differ from each other in the way they estimate re-gression parameters calculate P-values estimate DMR bounda-ries etc
lsquoDSSrsquo is one of the approaches in this category that relies ona beta-binomial hierarchical model to identify DM using bisul-fite sequencing data In this model the prior distribution is con-structed from the whole genome which is either methylated orunmethylated True methylation proportions of the CpG sitesamong the replicates are then modeled using the beta distribu-tion parameterized by group mean and a dispersion parameterThe biological variability is captured by the beta distributionwhereas the sampling variability is captured by the binomialdistribution Variation across the methylation proportion of theCpG sites relative to the group mean is captured by the disper-sion parameter which is estimated by an empirical Bayes ap-proach When the sample size is small a shrinkage approach isused to estimate the dispersion parameter to improve the over-all performance Differentially methylated CpG sites are deter-mined by using P-values from the Wald test which isperformed by comparing the mean methylation levels betweentwo groups Lastly candidate DMRs are defined by applyinguser-specified thresholds on DMR characteristics among whichare P-value minimum length and minimum number of CpGsites
The key contribution of the DSS approach is the shrinkageprocedure that improves the dispersion parameter estimationFor this reason this approach is particularly useful when thesample size is small By applying the Wald test procedure thisapproach takes into consideration the biological variation andsequencing coverage
A more recent method named lsquoDSS-singlersquo is an improvedversion of the DSS approach which can take into account thespatial correlation among the CpG sites across the genome Inaddition DSS-single considers the within-group variation with-out biological replicates by using the neighboring CpG sites aslsquopseudo-replicatesrsquo Similar to DSS DSS-single captures thetechnical variability using binomial distribution and the biolo-gical variability using beta distribution The beta distribution isparameterized with the group mean and dispersion parameterDSS-single estimates the group mean using a smoothing
function and the dispersion parameter using an empirical Bayesprocedure Hypothesis testing is performed using the Wald testto identify the DMCs Later user-defined thresholds are appliedto define the DMR boundaries and select candidate DMRs
An even more recent variation of DSS approach namedlsquoDSS-generalrsquo identifies differentially methylated loci (DML)from bisulfite sequencing data under general experiment de-sign DSS-general identifies DML by modeling the methylationcount data for each locus using the beta-binomial regressionwith the lsquoarcsinersquo link function The lsquoarcsinersquo link function isapplied to perform a data transformation that decreases the de-pendency of the data variance on the mean and prepares it forthe next step Due to this data transformation the regressioncoefficient and the variance matrix can be estimated by apply-ing the generalized least square method as opposed to thebeta-binomial generalized linear model or logistic regressionwhich are limited when values are separable (eg values forunmethylated sites are close to 0 values for methylated sitesare close to 1) Finally Wald test is used to perform hypothesistesting
The key advantage of DSS-general approach is that it is ap-plicable to bisulfite sequencing data with multiple groups orcovariates In addition it uses lsquoarcsinersquo link function which ismore efficient than other widely used lsquologitrsquo and lsquoprobitrsquo func-tions because it estimates the regression parameters in oneiteration
lsquoMOABSrsquo is another approach that relies on beta-binomialassumption to identify DM Similar to DSS the prior distributionis constructed from the whole genome resulting in a bimodaldistribution The posterior distribution follows a beta distribu-tion which is estimated using an empirical Bayes approachWhen biological replicates are available the posterior distribu-tion is generated using the maximum likelihood approach Thesignificance of the DM between two samples is represented by asingle metric named lsquocredible methylation differencersquo whichincorporates both the biological and statistical significance ofthe DM MOABS can also work with CHG or CHH methylation
lsquoRADMethrsquo is another analysis pipeline that relies on thebeta-binomial assumption RADMeth uses a beta-binomial re-gression approach using lsquologitrsquo link function to model themethylation levels of the CpG sites across the samplesRegression parameters are estimated using a standard max-imum likelihood approach In the beta-binomial regressionmodel RADMeth incorporates the experimental factors using amodel matrix The DM of a particular site is determined by com-paring two fitted regression models (ie reduced model withoutfactors and full model with factors) using the log-likelihoodratio Subsequently P-values of the neighboring CpG sites arecombined using the weighted Z-test (ie Stouffer-Liptak test[77]) to obtain the DMRs The key contribution of this approachis the ability to analyze WGBS data in multiple factorexperiments
lsquoMethylSigrsquo is another analysis pipeline that uses beta-binomial model across the samples to identify either DMCs orDMRs The pipeline begins with taking the number of Cs and Tsas input The approach uses the beta-binomial model to esti-mate the methylation levels at each CpG site or region whichinvolves the two following steps (i) estimate the dispersion par-ameter for each CpG site or region which accounts for biologicalvariation among the samples within a group and (ii) calculatethe group methylation level at each CpG site or region using theestimated dispersion parameters In each step local informa-tion can be incorporated from nearby CpG sites or regions to in-crease statistical power The significance level of the
6 | Shafi et al
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
methylation difference is calculated using the likelihood ratiotest Similar to DSS MethylSig is useful when the sample size issmall MethylSig uses local information and a maximum likeli-hood estimator to compute both the methylation level and thevariance
lsquoMACAUrsquo is based on binomial mixed model (BMM) thattakes into account the population structures from a data setThis model is a generalized beta-binomial model consisting ofan extra term to model the population structure In the absenceof that extra term this model can be reduced to a beta-binomialmodel In this approach the prior distribution is constructedfrom a BMM whereas the posterior distribution is constructedfrom a log-normal distribution Model parameters are estimatedby using a Markov chain Monte Carlo (MCMC) algorithm-basedapproach Hypothesis testing is performed by using Wald testFinally DMRs are constructed by merging the DMCs using em-pirical thresholds
One advantage of this approach is that it can add a predictorvariable of interest in the model to check the association withany genetic background In addition to considering biologicalvariability among the replicates and the sampling variabilityamong the sequencing reads this method also takes into con-sideration the population variability Furthermore it can beapplied to both WGBS and RRBS data sets
lsquoGetisDMRrsquo a recent beta-binomial-based approach identi-fies variable-size DMRs directly from WGBS data by using a localGetis-Ord statistic which is commonly used to identify statistic-ally significant spatial clusters (hotspots) By incorporating thisstatistic into DM analysis GetisDMR accounts for spatial correl-ation among the methylation levels of the CpG sites along withthe biological and sampling variability When biological repli-cates are available beta-binomial regression with logistic linkfunction is used to model the methylation level of each CpGsite Model parameters are estimated by using the maximumlikelihood function Hypothesis testing is performed by usingthe likelihood ratio test In the absence of biological replicatesmethylation levels are modeled by using binomial distributionand hypothesis testing is performed by using FET P-valuesfrom the hypothesis testing are further used to calculatez-scores Finally a local Getis-Ord statistic is used based on thez-scores to identify DMRs using the information from the neigh-boring CpG sites The Getis-Ord statistic uses the distribution ofthe data (ie z-scores) to compute a score of the nonrandom as-sociation between a data point and its neighbors where a posi-tive score shows a positive association and a negative scoreshows a negative association This statistic is then used to iden-tify data regions with points that exhibit nonrandom associ-ations (ie DMRs)
One of the primary strengths of GetisDMR is that it can de-tect DMRs with variable length instead of depending on user-specified threshold parameters It can take into account thespatial correlation between the neighboring CpG sitesAdditionally it can incorporate additional confounding factorsinto the model Furthermore it can work with multiple groupswith or without biological replicates One drawback of this ap-proach is that it cannot work with enriched regions such asRRBS data
Beta-binomial-based approaches are useful because theytake into account both sampling variability among the readcounts and biological variability among the replicatesFurthermore these approaches are able to identify DM at sin-gle-base resolution from low CpG-density regions (eg TFBS)On the other hand most of the beta-binomial-based approaches(except DSS-single MACAU and GetisDMR) do not take into
account the spatial correlation between the methylation levelsof the CpG sites
Hidden Markov model-based approaches
Approaches in this category use hidden Markov model (HMM) toidentify differentially methylated patterns from bisulfitesequencing data These approaches model the methylation lev-els of the CpG sites as methylation states (ie hypermethyla-tion hypomethylation and no change) instead of continuousmethylation values Transition probabilities among the methy-lation states represent the distance distribution among theDMCs whereas emission probabilities represent the likelihoodof DM for the CpG sites High transition probabilities and lowtransition probabilities are used to model the neighboring CpGsites that have high similarities and low similarities within theirmethylation levels respectively Parameters are estimated usu-ally by using established learning algorithms whereas potentialDMRs are identified using different statistical approaches
One of the approaches in this category named lsquoComMetrsquo [64]included in the Bisulfighter methylation analysis suite [78 79]combines all the samples within a group into one sample andidentifies the DMRs by comparing a pair of two samples Thismethod captures the probability distribution of distances be-tween the neighboring DMCs and adjusts the DMC chaining cri-teria automatically for each data set Transition probabilitiesare estimated using an expectation maximization algorithmwhereas emission probabilities are estimated from a beta-binomial mixture model Parameters of the beta-binomialmodel are estimated by incorporating an unsupervised learningalgorithm DMRs are identified by using a dynamic program-ming algorithm
One of the advantages of ComMet is that it does not requirebiological replicates to identify DMRs It takes into account thesequencing coverage and the spatial distribution of the neigh-boring CpG sites On the other hand one of the limitations ofthis approach is that it does not take into account the biologicalvariation across replicates which might lead to higher numberof false positives in the results [14 43 46]
Another approach in this category is lsquoHMM-Fisherrsquo [80]which estimates the methylation status of the CpG sites foreach sample instead of combining all the samples Similar toComMet HMM-Fisher models both the similarity and dissimi-larity of the methylation levels of the neighboring CpG sitesusing transition probability HMM-Fisher estimates the transi-tion probabilities using a Dirichlet distribution whereas emis-sion probabilities are computed using a truncated normaldistribution After estimating the methylation levels of all theCpG sites for each sample differentially methylated CpG sitesare identified using FET Identified DMCs are further groupedinto DMRs if the distance between the CpG sites is lt100 basesNon-consecutive CpG sites are reported as DMCs in the output
One of the major contributions of HMM-Fisher is that it canidentify DMRs of variable size instead of depending on user-defined boundary thresholds It takes the biological variationamong the replicates into account and can provide both DMCsand DMRs as output It can also be used to identify sample-wisemethylation patterns
lsquoHMM-DMrsquo [81] is another approach that uses HMM to iden-tify DM HMM-DM directly estimates the DM states of the CpGsites for each sample across the groups In this approach thetransition probability of each CpG site only depends on themethylation state of the immediate previous CpG site LikeHMM-Fisher and ComMet the transition probabilities are
Identifying differential methylation | 7
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
estimated from a Dirichlet distribution In contrast emissionprobabilities are estimated from a beta distribution DM statesfor the CpG sites are estimated using the MCMC methodFinally consecutive CpG sites with same methylation status aregrouped together based on user-defined thresholds to formDMRs Similar to HMM-Fisher HMM-DM can identify variablesize DMRs from WGBS and RRBS data It also takes into accountthe biological variation among the replicates
In general one of the key advantages of HMM-basedapproaches is that they can identify DMRs with variable size incontrast to the approaches that use a fixed window size Theyconsider the spatial correlation of the CpG sites by borrowingmethylation information from their neighboring sites Theseapproaches can also identify independent DMCs or short DMRstherefore they can identify sharp methylation changes amongthe CpG sites In addition all the three approaches discussedabove are applicable to both WGBS and RRBS data sets
Entropy-based approaches
Entropy-based approaches identify the methylation differenceacross multiple samples using Shannon entropy [82] which is aquantitative measure of the variation or change in a series ofevents Approaches in this category are capable of providingsample-specific methylation information
lsquoQDMRrsquo [83] was the first approach that used Shannon en-tropy [82] for the purpose of identifying DMRs from bisulfitesequencing data It quantitatively identifies DMRs from prede-fined regions based on the average methylation levels of theCpG sites of the regions The probability that a sample is methy-lated at a specific location is calculated by taking the ratio of themethylation level of that sample and the total methylation levelacross all samples The original entropy formula can be used tomeasure the methylation difference across samples wherelower entropy represents higher methylation differenceHowever this way of calculating entropy is biased towardhypermethylation in minor samples Therefore QDMR intro-duces a one-step Tukey biweight weighted mean to make theirapproach less sensitive to such outliers Finally a region is dif-ferentially methylated if the weighted entropy for that region issmaller than a certain cutoff which is determined by using aprobability model QDMR takes into account the biological vari-ability across the samples In addition to the list of DMRs QDMRprovides quantification visualization and annotation of theDMRs for each sample One of the limitations of this approachis that it can identify DMRs only from predefined regions(RRBS) therefore it is unable to identify de novo regions
An improved approach in this category lsquoCpG_MPsrsquo [51] hasbeen proposed from the same research group which can iden-tify methylation patterns across paired or multiple samplesusing WGBS data This approach identifies de novo methylatedand unmethylated regions using hotspot extension algorithmbased on the methylation status of the neighboring CpG sitesIt combines a combinatorial algorithm with Shannon entropyto identify DMRs
The overall workflow of CpG_MPs is divided into four mod-ules The first module normalizes the sequencing reads of theCpG sites into methylation levels The second module categor-izes the methylation states of the CpG sites based on their nor-malized methylation levels into four categories such asunmethylated CpGs partially unmethylated CpGs methylatedCpGs and partially methylated CpGs CpGs are then scannedfrom 50 to 30 end to extract a certain number of methylated(unmethylated) CpGs to create methylated (unmethylated)
hotspots Next the hotspots are extended both upstream anddownstream to incorporate partially methylated or partiallyunmethylated CpGs into their corresponding hotspotsNeighboring regions with the same patterns are then combinedbased on a given threshold Also the mean value and the stand-ard deviation of the methylation levels of the CpG sites withineach region are computed The third module identifiesconservatively unmethylated regions conservatively methy-lated regions and DMRs by using a combinatorial algorithmwith Shannon entropy At first the identified methylated andunmethylated regions are mapped to the reference genome andthen overlapping regions (ORs) are recorded in the referencegenome Next the hotspot extension technique is used tomerge the neighboring ORs with the same methylation patternsacross multiple samples A modified Shannon entropy-basedmethod is used to identify the regions that are significant acrossmultiple samples The fourth module analyzes sequencing fea-tures and visualizes the identified regions
One key advantage of CpG_MPs is that it determines theDMR boundaries by applying combinatorial algorithm instead ofdepending on empirical thresholds to identify DMRs hence itcan detect variable-length boundaries It can also be used toidentify methylation patterns for each sample In additionCpG_MPs considers biological variation among the replicatesHowever CpG_MPs does not include any error control measure-ment among the identified regions
A more recent approach lsquoSMARTrsquo [84] extends the weightedentropy concept introduced by QDMR to determine cell type-specific methylation patterns from a large number of DNAmethylomes The input of SMART is the sample-wise methyla-tion status of the CpG sites SMART first quantifies the methyla-tion specificity across the samples using Shannon entropy witha one-step Tukey biweight weighted mean Next it incorporatesmethylation similarities between neighboring CpG sites by esti-mating the methylation level of the sites based on Euclideandistance These similarity metrics and methylation specificitystates are then used to segment the genome into groups of CpGsites Finally a group of CpG sites is called hypermethylated(hypomethylated) if the methylation levels of that group is sig-nificantly higher (lower) than the average methylation levels ofall samples determined by one sample t-test
Major contribution of SMART is that it can identify cell type-specific methylation marks (ie HyperMark and HypoMark)from a large sample cohort Instead of depending on user-defined thresholds it determines DMR boundaries of variablesizes by quantifying the methylation levels of the CpG sites Italso provides functional annotation of the identified methyla-tion marks It considers the biological variation among the repli-cates and spatial correlation among the methylation levels ofthe CpG sites across the genome In addition it can be appliedto both WGBS and RRBS data
One of the key benefits of the entropy-based approaches isthat they can directly identify DMRs without identifying DMCsAs a result entropy-based approaches that can detect de novoregions (ie CpG_MPs and SMART) do not depend on empiricalboundary estimations Furthermore these approaches take intoaccount the biological variation within replicates
Mixed statistical tests-based approaches
Approaches in this category rely on established statistical testssuch as FET t-test and ANOVA to identify DMCsDMRs Thesestatistical tests are applied to CpG sites across the samples or
8 | Shafi et al
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
within predefined genomic regions (ie fixedvariable sizewindows)
One of the approaches in this category lsquoCOHCAPrsquo [46] iden-tifies differentially methylated CpG islands from two or moregroups using predefined regions It also provides integrationwith gene expression data and visualization of the results Thepipeline starts with taking aligned read counts (eg output ofBismark aligner [26]) as input CpG sites are marked as methy-lated or unmethylated based on a user-defined threshold P-val-ues of the CpG sites are first calculated by using differentstatistical approaches (ie FET ANOVA and t-test) based on thechosen experimental design Later the P-values are correctedusing the FDR approach CpG sites are filtered based on P-valueof the CpG site average methylation proportion across all thesamples and FDR value CpG islands with a minimum number offiltered CpG sites are considered as candidate DMRs In the lsquoaver-age by CpG sitersquo pipeline P-values of the CpG sites within candi-date DMRs are calculated by the previously selected statisticalmethod In the lsquoaverage by CpG islandrsquo pipeline beta values ofthe filtered CpG sites within each candidate DMR are averagedand then a P-value is calculated based on the averaged betavalue The major contribution of COHCAP is that it provides inte-gration of gene expression data with DM analysis In addition ittakes into account the biological variation among the replicates
lsquoDMAPrsquo [85] another approach in this category is afragment-based approach primarily designed for the RRBSprotocol to identify differentially methylated fragments (DMFs)Nonetheless this approach can also detect DMRs from WGBSdata In addition to the identification of DMRsDMFs DMAP pro-vides information about nearby genes and CpG sites
The input of DMAP is methylated read counts in Bismarkaligner [26] format To identify candidate genomic regions fromWGBS data DMAP defines fixed-size windows (ie default1000 bp) For RRBS data it defines fragments of variable sizes(40ndash220 bp) Next a P-value is calculated for each region or frag-ment based on the methylated CpG counts using a chosen stat-istical test (v2 test FET and ANOVA) FET is recommended forpairwise comparison v2 test is recommended for testing vari-ability across multiple samples and ANOVA is recommendedfor comparing groups of samples Candidate regions are se-lected as DMRs (for WGBS data) and DMFs (for RRBS data) basedon a user-defined P-value threshold Options to correct for mul-tiple comparisons are also provided The output is a list of can-didate regionsfragments with their P-values and informationregarding the statistical test that was applied FurthermoreDMAP provides gene annotation features of the identified re-gionsfragments Major contribution of this approach is that itcan detect variable-size fragments (DMFs) from predefinedregions
lsquoswDMRrsquo [86] another approach in this category integratesmultiple commonly used statistical approaches to identifyDMRs from WGBS data The pipeline begins with taking themethylated read counts of each CpG site (preferably from theBismark aligner [26]) as input which are later converted tomethylation ratios Next it divides the genome into multipleoverlapping fragments or windows of equal length based onuser-defined thresholds A statistical approach is chosen from alist of commonly used approaches (ie FET t-test v2 WilcoxonANOVA and KruskalndashWallis test) to perform hypothesis testingwithin each window across two or more samples For two sam-ples methylation levels of the CpG sites are compared using t-test Wilcoxon test v2 test or FET For more than two samplesmethylation levels are compared using either ANOVA orKruskalndashWallis test Therefore for each window swDMR
provides a P-value generated using the selected statistical testThe resulting P-values are corrected for multiple comparisonsusing the FDR approach The regions with corrected P-valueslower than a predefined threshold are selected as potentialDMRs Using an extension function two potential DMRs aremerged if the distance between them is less than a predefinedthreshold The merged DMRs are tested with the previously se-lected statistical test and P-values are corrected with respect tothe new DMR boundaries Finally the merged DMRs with thecorrected P-values less than the user-defined threshold are se-lected as candidate DMRs swDMR approach can be used with-out biological replicates and can work with CHG or CHHmethylation It also provides functionalities such as DMR clus-ter analysis visualization and annotation of DMRs
The key advantage of the approaches in this category is thatthey provide flexibility in selecting different statistical testsand methods for multiple test correction In contrast theseapproaches do not take into account the spatial correlation be-tween the methylation levels of the neighboring CpG sites Inaddition these approaches either work on predefined regions ordivide the genome into windows of fixedvariable size Hencethey miss the low CpG density regions where methylation hassharp changes such as TFBS that can contain a single differen-tially methylated CpG site [68] Importantly they depend onuser-defined thresholds to estimate the DMR boundaries
Binary segmentation-based approaches
Approaches in this category use binary segmentation algorithm torecursively divide the genome to identify candidate regions frombisulfite sequencing data The only approach in this categorylsquometilenersquo [87] uses a circular binary segmentation algorithm toidentify DMRs It can be used to analyze both WGBS and RRBS ex-periments across multiple samples with or without replicates
The pipeline starts with a pre-segmentation step that div-ides the genome into primary regions based on the availablemethylation information The pre-segmented regions are theniteratively segmented using a circular binary segmentation al-gorithm to identify a window with the maximum mean differ-ence signal The segmentation is terminated when a segmenthas less number of CpGs than a predefined threshold or itdoes not show any improvement in the two-dimensionalKolmogorovndashSmirnov test results The identified window ismarked as a potential DMR The output of metilene is a list ofDMRs with their P-values adjusted P-values and the P-valuefrom a MannndashWhitney U test
Metilene can detect de novo regions of various lengths with-out relying on user-defined boundary thresholds It takes intoaccount the variation among biological replicates In addition itcan predict methylation levels of the missing CpG sites usingbeta distribution One of the limitations of metilene is that theresult greatly depends on the minimum segment size param-eter which can lead to false negatives (if it is too high) or falsepositives (if it is too low) In addition it does not consider thespatial correlation of the methylation levels of the CpG sitesacross biological replicates
Discussion
In this survey we briefly summarize 22 approaches that identifyDM using bisulfite sequencing data focusing on their importantfeatures such as concept used protocol used biological vari-ability spatial distribution additional covariates error correc-tion sequencing coverage and identifying de novo regions The
Identifying differential methylation | 9
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
approaches are categorized into seven different categoriesbased on their primary concepts or techniques used to identifyDM Some of the approaches involve multiple concepts to iden-tify DM hence they could be assigned to multiple categoriesOn such cases we categorize the approach based on the conceptthat the authors highlighted Pros and cons of these categoriesare summarized in Figure 3 The important features of theapproaches covered in this survey are summarized in Table 1Moreover the workflow of the approaches including the infor-mation about genome segmentation difference quantificationand DMR calling are described in Figure 4
Note that there are other possible ways to categorize theseapproaches For instance this can be done based on the datatype used to estimate the methylation levels of the CpG sites(count data ratio data and both count and ratio data) In thatcase the methods will be distributed among the categories asfollows (i) count data MethylKit eDMR DSS DSS-single DSS-general MOABS RADmeth MethylSig MACAU GetisDMRComMet (ii) ratio data BSmooth BiSeq qDMR CpG_MPsSMART HMM-Fisher HMM-DM COHCAP metilene (iii) bothcount and ratio data DMAP swDMR A graphical representationof this classification is shown in Figure 5 Similarly theapproaches can be categorized based on the number of groupsallowed (one group of samples two groups without replicatesand two groups with replicates) based on the protocol used(WGBS RRBS and both WGBS and RRBS) etc
Biological variability within the replicates is a crucial factorto consider because it can reduce the number of false positivesin the results [14 43 46] If an approach takes into account each
biological replicate within a group separately when modelingthe methylation levels of the CpG sites then biological variabil-ity is considered On the other hand biological variability is lostif an approach combines the read counts of the CpG sites acrossthe replicates Although classical hypothesis testing methods(eg t-test and ANOVA) take biological variation into accountBSmooth was the first approach primarily developed for DMRidentification that takes into account the biological variationamong replicates Within the surveyed approaches smoothing-based approaches beta-binomial-based approaches entropy-based approaches etc (see Table 1 for full list) take the biolo-gical variation among the replicates into account
Spatial correlation is another factor to consider which pro-vides a better estimation of the methylation levels of the CpGsites by borrowing information from their neighbors A commonway of considering spatial correlation is to perform lsquosmoothingrsquooperation before the detection of DM In this survey smooth-ing-based approaches (BSmooth and BiSeq) and a few beta-bi-nomial-based approaches (DSS-single MACAU and GetisDMR)fall into this category Performing smoothing when identifyingDMRs can reduce the required sequencing depth and estimatethe methylation status of missing CpG sites [43] Additionallysmoothing procedure helps to identify relatively longer DMRsHowever this procedure is only applicable for the genomewhose methylation profile is known to be smooth Also smooth-ing is not suitable for the data sets whose CpG sites are sparse(commonly seen in RRBS protocol) due to extrapolated methyla-tion values of 0 and 1 Besides smoothing other techniques canbe applied to take spatial correlation into account For instance
Figure 3 Pros and cons of the seven categories discussed in this survey
10 | Shafi et al
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
Tab
le1
Sum
mar
yo
fth
eim
po
rtan
tch
arac
teri
stic
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
dan
dre
fere
nce
Co
nce
pt
use
dPr
oto
col
Prim
ary
pu
rpo
seB
iolo
gica
lva
riat
ion
Spat
ial
dis
trib
uti
on
Ad
dit
ion
alco
vari
ates
Erro
rco
rrec
tio
nSe
qu
enci
ng
cove
rage
Iden
tify
deno
vore
gio
n
To
tal
cita
tio
ns
Cit
atio
n
year
1m
eth
ylK
it[5
4]Lo
gist
icre
gres
sio
nB
oth
Iden
tify
DM
Cs
and
ann
ota
te
17
543
75
2eD
MR
[64]
Logi
stic
regr
essi
on
Bo
thId
enti
fyD
MC
san
dD
MR
s
28
83
BSm
oo
th[4
3]Sm
oo
thin
gW
GB
SId
enti
fyD
MR
sw
ith
rep
lica
tes
156
39
4B
iSeq
[66]
Smo
oth
ing
RR
BS
Iden
tify
DM
Rs
wit
hFD
Rco
rrec
tio
n
62
18
6D
SS[6
9]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MLs
for
smal
lsa
mp
les
4316
1
5M
OA
BS
[70]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
Cs
wit
hre
pli
cate
s
49
184
7R
AD
Met
h[7
1]B
eta-
bin
om
ial
WG
BS
Iden
tify
DM
Lsan
dD
MR
s
31
133
8m
eth
ylSi
g[7
2]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MC
san
dD
MR
s
42
174
9D
SS-s
ingl
e[7
3]B
eta-
bin
om
ial
Bo
thId
enti
fyD
MR
sw
ith
ou
tre
pli
cate
s
15
12
10M
AC
AU
[74]
Bet
a-bi
no
mia
lB
oth
Iden
tify
DM
usi
ng
po
pu
la-
tio
nst
ruct
ure
88
11D
SS-g
ener
al[7
5]B
eta-
bin
om
ial
RR
BS
Iden
tify
DM
Ls
3
312
Get
isD
MR
[76]
Bet
a-bi
no
mia
lW
GB
SId
enti
fyD
MR
sd
irec
tly
00
13C
om
Met
[78]
HM
MB
oth
Iden
tify
DM
Rs
248
714
HM
M-F
ish
er[8
0]H
MM
Bo
thId
enti
fyD
Mp
atte
rns
44
15H
MM
-DM
[81]
HM
MB
oth
Iden
tify
DM
Rs
44
16Q
DM
R[8
3]Sh
ann
on
entr
op
yR
RB
SId
enti
fyD
MR
s
61
107
17C
pG
_MPs
[51]
Shan
no
nen
tro
py
WG
BS
Iden
tify
DM
pat
tern
s
30
72
18SM
AR
T[8
4]Sh
ann
on
entr
op
yW
GB
SId
enti
fyce
llty
pe-
spec
ific
met
hyl
atio
nm
arks
99
19C
OH
CA
P[4
6]M
ixed
stat
isti
csR
RB
SId
enti
fyD
MC
san
dco
n-
sist
ent
Cp
Gis
lan
ds
277
7
20D
MA
P[8
5]M
ixed
stat
isti
csB
oth
Iden
tify
DM
Rs
and
DM
Fs
3112
421
swD
MR
[86]
Mix
edst
atis
tics
WG
BS
Iden
tify
DM
Rs
wit
ho
ut
rep
lica
tes
4
32
22m
etil
ene
[87]
Bin
ary
segm
enta
tio
nB
oth
Iden
tify
DM
Rs
inla
rge
gro
up
so
fsa
mp
les
00
For
colu
mn
s5ndash
10
m
ean
sth
atth
em
eth
od
con
sid
ers
the
char
acte
rist
ican
d
mea
ns
that
the
met
ho
dd
oes
no
tco
nsi
der
the
char
acte
rist
ic
For
the
9th
colu
mn
m
ean
sth
atth
em
eth
od
con
sid
ers
seq
uen
cin
gco
vera
gew
hen
cou
nt-
base
dh
ypo
thes
iste
sts
are
per
form
edF
or
the
10th
colu
mn
id
enti
fyde
novo
regi
on
s
mea
ns
that
the
met
ho
dca
nan
d
mea
ns
that
the
met
ho
dca
nn
ot
iden
tify
deno
vore
gio
ns
For
colu
mn
s5ndash
10
mea
ns
the
char
acte
rist
ic
isn
ot
app
lica
ble
To
talc
itat
ion
san
dci
tati
on
sp
erye
arre
pre
sen
tth
en
um
ber
of
cita
tio
ns
and
the
aver
age
nu
mbe
ro
fci
tati
on
sp
erye
arr
esp
ecti
vely
as
sho
wn
on
goo
gle
sch
ola
ras
of
24O
cto
ber
2016
Identifying differential methylation | 11
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
eDMR uses autocorrelation of the methylation data HMM-basedapproaches (ComMet HMM-Fisher and HMM-DM) use HMMCpG_MPs uses hotspot extension algorithm and SMART usesEuclidean distance based on methylation similarity to take intoaccount spatial correlation of the CpG sites
Sequencing coverage is another important factor that affectsthe accuracy of the methylation estimation Count-based hy-pothesis tests (eg FET v2 test) take into account sequencingcoverage by simply pooling the read counts however thesetests require grouping of read counts and this is biased towardthe samples with higher sequencing coverage For other DManalysis approaches consideration of coverage information isnot merely dependent on the hypothesis tests but dependenton whether coverage information is incorporated when model-ing the methylation levels of the CpG sites For example HMM-Fisher uses methylation ratios to estimate the methylationstatus at each CpG sites and then applies FET on the count ofthe methylation states to identify DMCs Therefore HMM-Fisher does not take into account read coverage despite usingFET as the hypothesis test Among the surveyed approachesBiSeq ComMet DMAP swDMR logistic regression-based andbeta-binomial-based approaches are able to take the coverageinformation into account Some approaches also include
Figure 4 The workflow of 22 approaches developed for DM analysis t-test denotes a signal-to-noise statistic similar to the classical t-test Predefined criteria represent
user-defined thresholds such as P-value cutoff of the DMCs length of the DMRs distance between neighbor DMRs minimum number of DMCs per DMR cutoff value of
CDIF (only for MOABS) etc FET denotes Fisherrsquos exact test HMM denotes hidden Markov model MCMC denotes Markov Chain Monte Carlo and CDIF denotes credible
methylation difference
Figure 5 A higher level classification of the approaches discussed in this survey
based on the data type used when modeling the methylation levels of the CpG sites
12 | Shafi et al
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
Tab
le2
Co
mp
aris
on
of
the
avai
labl
eim
ple
men
tati
on
so
fth
e22
surv
eyed
app
roac
hes
Met
ho
d(t
oo
l)an
dto
olr
efer
ence
Plat
form
Ava
ilab
ilit
yLi
cen
seO
utp
ut
Publ
ish
edd
ate
Up
dat
edd
ate
1m
eth
ylK
it[5
4]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
Art
isti
cv2
DM
Cs
DM
Rs
list
(tab
le)
DM
Cs
DM
Rs
per
chro
mo
som
e(g
rap
h)
9N
ove
mbe
r20
1122
Oct
obe
r20
16
2eD
MR
[54
64]
Rp
acka
geSt
and
alo
ne
Art
isti
cG
PLD
MR
sli
st(t
able
)4
Jan
uar
y20
134
Ap
ril2
014
3B
Smo
oth
(bss
eq)[
43]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eA
rtis
tic
v2D
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
20Ju
ly20
1214
Oct
obe
r20
16
4B
iSeq
[88]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eLG
PLv3
DM
Rs
list
(tab
le)
DM
Rm
ean
met
hyl
atio
n(g
rap
h)
2A
pri
l201
317
Oct
obe
r20
16
6D
SS[6
973
75
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
04Ju
ne
2012
17O
cto
ber
2016
5M
OA
BS
[70]
Cthornthorn
pac
kage
and
Perl
scri
pt
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)12
Jun
e20
1330
May
2015
7R
AD
Met
h[7
1]Cthornthorn
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)27
Mar
ch20
141
May
2014
a
8m
eth
ylSi
g[7
2]R
pac
kage
Stan
dal
on
eG
NU
GPL
v3D
MC
sD
MR
sli
st(t
able
)C
pG
site
sm
eth
ylat
ion
rate
(gra
ph
)17
Jun
e20
1410
Jun
e20
16
9D
SS-s
ingl
e(D
SS)[
697
375
89]
Bic
on
du
cto
rR
pac
kage
Stan
dal
on
eG
NU
GPL
DM
Cs
DM
Rs
list
(tab
le)
DM
R
met
hyl
atio
nl
ocu
s(g
rap
h)
16A
pri
l201
517
Oct
obe
r20
16
10M
AC
AU
[74]
Cthornthorn
pac
kage
and
Rsc
rip
tSt
and
alo
ne
GN
UG
PLD
MC
sli
st(t
able
)5
Jun
e20
159
Dec
embe
r20
1511
DSS
-gen
eral
(DSS
)[69
73
758
9]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLD
MC
sD
MR
sli
st(t
able
)D
MR
m
eth
ylat
ion
lo
cus
(gra
ph
)29
Ap
ril2
015
17O
cto
ber
2016
12G
etis
DM
R[7
6]Cthornthorn
pac
kage
and
Rsc
rip
tsSt
and
alo
ne
GN
UG
PLD
MR
sli
st(t
able
)28
Ap
ril2
016
28Se
pte
mbe
r20
1613
Co
mM
et(B
isu
lfigh
ter)
[78]
Cthornthorn
pac
kage
and
Pyth
on
Stan
dal
on
eC
CA
NS
DM
Rs
list
(tab
le)
12D
ecem
ber
2014
29Se
pte
mbe
r20
1514
HM
M-F
ish
er[8
0]R
scri
pts
Stan
dal
on
eN
on
eD
MR
sli
st(t
able
)D
MR
lo
cus
met
hyl
atio
nle
vel(
grap
h)
25A
pri
l201
429
Febr
uar
y20
16
15H
MM
-DM
[81]
Rsc
rip
tsSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
DM
Rl
ocu
sm
eth
ylat
ion
leve
l(gr
aph
)27
Mar
ch20
1424
Mar
ch20
16
16Q
DM
R[8
3]Ja
vap
acka
geSt
and
alo
ne
web
CLI
Cu
sto
mb
DM
Rs
list
(tab
le)
DM
Rin
UC
SCG
eno
me
Bro
wse
r(g
rap
h)
10M
ay20
1017
Oct
obe
r20
12
17C
pG
_MPs
[51]
Java
pac
kage
and
Perl
scri
pt
Stan
dal
on
ew
ebC
LIN
on
eD
MR
sli
st(t
able
)20
Jun
e20
111
Sep
tem
ber
2015
18SM
AR
T(S
MA
RT
-BS-
Seq
)[84
]Py
tho
np
acka
geSt
and
alo
ne
PSFL
DM
Rs
list
(tab
le)
17M
ay20
1517
May
2015
19C
OH
CA
P[4
6]B
ico
nd
uct
or
Rp
acka
geSt
and
alo
ne
GN
UG
PLv3
DM
Cs
and
DM
Cp
Gis
lan
ds
list
(tab
le)
DM
Cp
Gis
lan
ds
met
hyl
atio
nav
erag
e(g
rap
h)
9Ja
nu
ary
2014
17O
cto
ber
2016
20D
MA
P(m
eth
_pro
gs_d
ist)
[85]
Cp
acka
geSt
and
alo
ne
No
ne
DM
Rs
list
(tab
le)
14M
ay20
1328
Au
gust
2016
21sw
DM
R[8
6]Pe
rlan
dR
scri
pts
Stan
dal
on
eG
NU
GPL
v3D
MR
sli
st(t
able
)D
MR
met
hyl
atio
nle
vel(
grap
h)
6Ja
nu
ary
2013
15Ju
ne
2014
22m
etil
ene
[87]
Cp
acka
geSt
and
alo
ne
GN
UG
PLv2
DM
Rs
list
(tab
le)
8M
ay20
1529
Ap
ril2
016
aR
AD
Met
his
no
wp
art
of
the
Met
hPi
pe
too
lrel
ease
do
n6
Sep
tem
ber
2013
wit
hth
ela
test
up
dat
eo
n21
Oct
obe
r20
16
bC
ust
om
lice
nse
stat
ing
that
the
soft
war
eis
free
of
char
geto
rese
arch
ers
wo
rkin
gat
acad
emic
no
n-p
rofi
to
rgan
izat
ion
so
nn
on
-co
mm
erci
alp
roje
cts
GN
Ug
ener
alp
ubl
icli
cen
seL
GPL
les
ser
gen
eral
pu
blic
lice
nse
CC
AN
Scr
eati
veco
mm
on
sat
trib
uti
on
-No
nC
om
mer
cial
-Sh
areA
like
30
un
po
rted
lice
nse
PSF
Lp
yth
on
soft
war
efo
un
dat
ion
lice
nse
CLI
co
mm
and
lin
ein
terf
ace
Identifying differential methylation | 13
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
additional filters to remove low coverage CpG sites before esti-mating methylation
Identifying de novo regions is another important feature ofthe approaches that identify DM Approaches that identify denovo regions use various techniques such as merging DMCsusing empirical thresholds entropy-based algorithms and bin-ary segmentation to estimate DMR boundaries (see Figure 4)While empirical thresholds allow for more flexibility to theusers proper tuning of these parameters is necessary to get ro-bust results Some of the approaches in addition to the list ofDMRs provide information such as the list of DMCs genetic an-notations and visualization of the DMRs
Error control is another important factor in DM analysis as itreduces the number of false positives in the results Approachescontrol errors by correcting P-values for each CpG site acrossthe genome correcting P-values for each region correcting theP-values within the identified regions etc
Identification of the fittest approach among all that are avail-able is a challenging task in DM analysis If biological replicatesare available beta-binomial approaches are suitable becausethey take both coverage information and biological variabilityamong the replicates into account In addition they can identifylow CpG density regions where methylation has sharp changes(eg TFBS) Within the beta-binomial-based approaches DSS-single MACAU and GetisDMR take spatial correlation into ac-count Therefore these three approaches are more appropriate ifthe methylation levels of the CpG sites are known to be spatiallycorrelated and biological replicates are available Smoothing-based approaches entropy-based approaches HMM-FisherHMM-DM and metilene can also be applied when biological repli-cates are available Similarly if the methylation levels of the CpGsites are known to be spatially correlated approaches that takespatial distribution into consideration such as smoothing-basedapproaches HMM-based approaches DSS-single MACAUGetisDMR CpG_MPs and SMART should be used
When sample size is small in the data set DSS MethylSigand HMM-Fisher are appropriate While DSS uses informationfrom all CpG sites and an empirical Bayes estimate to achievevariation shrinkage methylSig uses local information and amaximum likelihood estimator to compute both the methyla-tion level and the variance HMM-Fisher on the other handcombines two CpG sites while conducting FET if the distance be-tween them is lt100 bases If multiple experimental factors areavailable in the data set approaches such as methylKit eDMRBiSeq RADMeth MACAU DSS-general and GetisDMR are moreappropriate because they allow additional covariates in theirmodel
Suitable approaches can also be chosen based on their pri-mary purposes For example QDMR CpG_MPs or HMM-Fishercan be used to identify methylation patterns from a single sam-ple To identify cell type-specific methylation marks from largesample cohorts SMART is a suitable choice To identify DM pat-terns (hypermethylation and hypomethylation) across twogroups of samples HMM-Fisher and HMM-DM are more appro-priate Approaches can be chosen based on the input data typeas well For instance if the data protocol is RRBS and the pur-pose is to identify DMRs then QDMR BiSeq DSS-general orCOHCAP can be applied To work with CHG or CHH methylationmethylKit eDMR MOABS DSS RADMeth and swDMR are rec-ommended because they are not limited to CpG methylation
Comparison of some of the approaches can be found fromtwo existing review papers Klein et al [15] and Yu and Sun [16]Klein et al compared four tools that are originally developed forDM analysis BiSeq [88] COHCAP [46] methylKit [54] and
RADMeth [71] This review evaluates the trade-off between thesensitivity and specificity for individual methods using the re-ceiver operator characteristic (ROC) based on the regional P-val-ues of the identified regions The performance of each methodis then assessed by computing and comparing the area underthe ROC curve According to this review BiSeq and RADMethoutperform COHCAP and methylKit Yu and Sun [16] comparedBSmooth methylKit BiSeq HMM-Fisher and HMM-DMAccording to this review HMM-Fisher and HMM-DM achievedhigher sensitivity and specificity than the other three methodsTo assess the performance of all of the available approaches abenchmark analysis is needed Due to the complex nature ofthe methylation data and lack of a gold standard for perform-ance evaluation and standardized format of the input databuilding a benchmark for assessing the efficiency of theseapproaches is a challenging task and out of the scope of thissurvey
In addition to the conceptual overview we also summarizedthe implementations of the approaches in Table 2 The sum-mary includes platform information license information out-put format published date and last update date While this is acondensed view of the capabilities of these tools it could still beexpanded to include information such as consistency in the in-put and output formats Such details as well as a simulatednoise-free data set with known results are further requirementstoward creating a comprehensive benchmark for assessing thepractical performance of DM detection tools
Conclusion
Epigenetic modifications are thought to play a role in develop-mental disorders and cancer are likely to be influenced by en-vironmental factors and are known to regulate gene expressionIdentification of DM using bisulfite sequencing data is a crucialstep in the analysis of epigenetic data Several statistical meth-ods have been developed to address this challenge In thisstudy we survey 22 methods that identify DM from bisulfitesequencing data All the approaches surveyed in this articlewere developed within the past 5 years which shows greatinterest for progress in this area Our main objective in this sur-vey is to provide the community a comprehensive view of theexisting approaches that identify DM from bisulfite sequencingdata To do that we classify the approaches into seven catego-ries based on their primary concepts and features We summar-ize the distinguishing characteristics benefits and limitationsof each approach and category This survey is intended to helppotential users to choose the best DM analysis method based ontheir requirements It will help the researchers to design experi-ments to generate data that are better suited for the commu-nity In addition this survey will guide the developers todevelop new efficient statistical models that identify DM byconsidering key characteristics described here
Key points
bull Identification of the fittest approach among all thatare available is a challenging task in DM analysis
bull A comprehensive benchmark of the available approachesthat identify DM is greatly needed
bull Due to the high computation cost only a few web-based implementations of the approaches are cur-rently available
14 | Shafi et al
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
Funding
National Institutes of Health (RO1 DK089167 STTRR42GM087013) National Science Foundation (DBI-0965741)and Robert J Sokol MD Endowment in Systems Biology (toSD) Any opinions findings conclusions or recommenda-tions expressed in this material are those of the authors anddo not necessarily reflect the views of any of the fundingagencies
References1 Deaton AM Bird A CpG islands and the regulation of tran-
scription Genes Dev 201125(10)1010ndash222 Esteller M Cancer epigenomics DNA methylomes and
histone-modification maps Nat Rev Genet 20078(4)286ndash983 Lister R Pelizzola M Dowen RH et al Human DNA methyl-
omes at base resolution show widespread epigenomic differ-ences Nature 2009462(7271)315ndash22
4 Krueger F Kreck B Franke A et al DNA methylome analysisusing short bisulfite sequencing data Nat Methods20129(2)145ndash51
5 Feng S Jacobsen SE Reik W Epigenetic reprogramming inplant and animal development Science 2010330(6004)622ndash7
6 Lindroth AM Cao X Jackson JP et al Requirement ofCHROMOMETHYLASE3 for maintenance of CpXpG methyla-tion Science 2001292(5524)2077ndash80
7 Breiling A Lyko F Epigenetic regulatory functions of DNAmodifications 5-methylcytosine and beyond EpigeneticsChromatin 20158(1)24
8 Hendrich B Bird A Identification and characterization of afamily of mammalian methyl-CpG binding proteins Mol CellBiol 199818(11)6538ndash47
9 Bird AP Wolffe AP Methylation-induced repressionndashbeltsbraces and chromatin Cell 199999(5)451ndash4
10 Jones PA Functions of DNA methylation islands startsites gene bodies and beyond Nature Rev Genet201213(7)484ndash92
11Harris RA Wang T Coarfa C et al Comparison of sequencing-based methods to profile DNA methylation and identificationof monoallelic epigenetic modifications Nat Biotechnol201028(10)1097ndash105
12Taiwo O Wilson GA Morris T et al Methylome analysis usingMeDIP-seq with low DNA concentrations Nat Protoc20127(4)617ndash36
13Gu H Bock C Mikkelsen TS et al Genome-scale DNA methy-lation mapping of clinical samples at single-nucleotide reso-lution Nat Methods 20107(2)133ndash6
14Robinson MD Kahraman A Law CW et al Statistical methodsfor detecting differentially methylated loci and regions FrontGenet 20145324
15Klein HU Hebestreit K An evaluation of methods to test pre-defined genomic regions for differential methylation in bisul-fite sequencing data Brief Bioinform 201617769ndash807
16Yu X Sun S Comparing five statistical methods of differentialmethylation identifi- cation using bisulfite sequencing dataStat Appl Genet Mol Biol 201615(2)173ndash91
17Sun Z Cunningham J Slager S et al Base resolution methyl-ome profiling considerations in platform selection data pre-processing and analysis Epigenomics 20157(5)813ndash28
18Clark SJ Statham A Stirzaker C et al DNA methylation bisul-phite modification and analysis Nat Protoc 20061(5)2353ndash64
19Meissner A Gnirke A Bell GW et al Reduced representationbisulfite sequencing for comparative high-resolution DNAmethylation analysis Nucleic Acids Res 200533(18)5868ndash77
20 FASTX-Toolkit FASTQA short-reads pre-processing toolshttphannonlabcshledufastx_toolkit 2010
21Schmieder R Edwards R Quality control and preprocessingof metagenomic datasets Bioinformatics 201127(6)863ndash4
22Cox MP Peterson DA Biggs PJ SolexaQA at-a-glance qualityassessment of Illumina second-generation sequencing dataBMC Bioinformatics 201011(1)485
23Martin M Cutadapt removes adapter sequences from high-throughput sequencing reads EMBnet J 201117(1)10
24Bolger AM Lohse M Usadel B Trimmomatic a exible trimmerfor Illumina sequence data Bioinformatics 201430(15)2114ndash20
25 Trim Galore httpwwwbioinformaticsbabrahamacukprojectstrim_galore
26Krueger F Andrews SR Bismark a exible aligner and methy-lation caller for bisulfite-seq applications Bioinformatics201127(11)1571ndash2
27Chen PY Cokus SJ Pellegrini M BS seeker precise mappingfor bisulfite sequencing BMC Bioinformatics 201011(1)203
28Pedersen B Hsieh TF Ibarra C et al MethylCoder softwarepipeline for bisulfitetreated sequences Bioinformatics201127(17)2435ndash6
29Harris EY Ponts N Levchuk A et al BRAT bisulfite-treatedreads analysis tool Bioinformatics 201026(4)572ndash3
30Hong C Clement NL Clement S et al Probabilistic alignmentleads to improved accuracy and read coverage for bisulfitesequencing data BMC Bioinformatics 201314(1)337
31Langmead B Trapnell C Pop M et al Ultrafast and memory-efficient alignment of short DNA sequences to the humangenome Genome Biol 200910(3)R25
32Langmead B Salzberg SL Fast gapped-read alignment withBowtie 2 Nat Methods 20129(4)357ndash9
33Xi Y Li W BSMAP whole genome bisulfite sequenceMAPping program BMC Bioinformatics 200910232
34Xi Y Bock C Muller F et al RRBSMAP a fast accurate anduser-friendly alignment tool for reduced representationbisulfite sequencing Bioinformatics 201228(3)430ndash2
35Wu TD Nacu S Fast and SNP-tolerant detection of complexvariants and splicing in short reads Bioinformatics201026(7)873ndash81
36Smith AD Chung WY Hodges E et al Updates to the RMAPshort-read mapping software Bioinformatics 200925(21)2841ndash2
37Bock C Reither S Mikeska T et al BiQ analyzer visualizationand quality control for DNA methylation data from bisulfitesequencing Bioinformatics 200521(21)4067ndash8
38Kumaki Y Oda M Okano M QUMA quantification tool formethylation analysis Nucleic Acids Res 200836(Suppl2)W170ndash5
39Sun S Noviski A Yu X MethyQA a pipeline for bisulfite-treated methylation sequencing quality assessment BMCBioinformatics 201314(1)259
40Hu K Ting AH Li J BSPAT a fast online tool for DNA methyla-tion co-occurrence pattern analysis based on high-throughputbisulfite sequencing data BMC Bioinformatics 201516(1)220
41Liao WW Yen MR Ju E et al MethGo a comprehensive toolfor analyzing wholegenome bisulfite sequencing data BMCGenomics 201516(12)S11
42Eckhardt F Lewin J Cortese R et al DNA methylation profil-ing of human chromosomes 6 20 and 22 Nat Genet200638(12)1378ndash85
Identifying differential methylation | 15
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
43Hansen KD Langmead B Irizarry RA BSmooth from wholegenome bisulfite sequencing reads to differentially methy-lated regions Genome Biol 201213(10)R83
44 Jaffe AE Feinberg AP Irizarry RA et al Significance analysisand statistical dissection of variably methylated regionsBiostatistics 201213(1)166ndash78
45Feinberg AP Irizarry RA Stochastic epigenetic variation as adriving force of development evolutionary adaptation anddisease Proc Natl Acad Sci USA 2010107(Suppl 1)1757ndash64
46Warden CD Lee H Tompkins JD et al COHCAP an integrativegenomic pipeline for single-nucleotide resolution DNAmethylation analysis Nucleic Acids Res 201341(11)e117
47Cameron EE Baylin SB Herman JG p15INK4B CpG islandmethylation in primary acute leukemia is heterogeneous andsuggests density as a critical factor for transcriptional silenc-ing Blood 199994(7)2445ndash51
48Smallwood SA Lee HJ Angermueller C et al Single-cellgenome-wide bisulfite sequencing for assessing epigeneticheterogeneity Nat Methods 201411(8)817ndash20
49Varley KE Mutch DG Edmonston TB et al Intra-tumor het-erogeneity of MLH1 promoter methylation revealed by deepsingle molecule bisulfite sequencing Nucleic Acids Res
200937(14)4603ndash1250Singer ZS Yong J Tischler J et al Dynamic heterogeneity and
DNA methylation in embryonic stem cells Mol Cell201455(2)319ndash31
51Su J Yan H Wei Y et al CpG_MPs identification of CpG methy-lation patterns of genomic regions from high-throughputbisulfite sequencing data Nucleic Acids Res 201341(1)e4
52Bibikova M Chudin E Wu B et al Human embryonic stemcells have a unique epigenetic signature Genome Res200616(9)1075ndash83
53Byun HM Siegmund KD Pan F et al Epigenetic profiling ofsomatic tissues from human autopsy specimens identifiestissue-and individual-specific DNA methylation patternsHum Mol Genet 200918(24)4808ndash17
54Akalin A Kormaksson M Li S et al methylKit a comprehen-sive R package for the analysis of genome-wide DNA methy-lation profiles Genome Biol 201213(10)R87
55Hurlbert SH Pseudoreplication and the design of ecologicalfield experiments Ecol Monogr 198454(2)187ndash211
56Soneson C Delorenzi M A comparison of methods for differ-ential expression analysis of RNA-seq data BMCBioinformatics 201314(1)91
57Tony Ng HK Tang ML Testing the equality of two Poissonmeans using the rate ratio Stat Med 200524(6)955ndash65
58Gosset WS The probable error of a mean Biometrika190861ndash25
59Pearson ES Hartley HO Biometrika tables for statisticians (vol2) Biometrika Trust page 385 1976
60Smyth GK Linear models and empirical Bayes methods forassessing differential expression in microarray experimentsStat Appl Genet Mol Biol 20043(1)Article3
61Goeman JJ Van De Geer SA De Kort F et al A global test forgroups of genes testing association with a clinical outcomeBioinformatics 200420(1)93ndash9
62Gelman A Analysis of variancemdashwhy it is more importantthan ever Ann Stat 200533(1)1ndash53
63Wang HQ Tuominen LK Tsai CJ SLIM a sliding linear model forestimating the proportion of true null hypotheses in datasetswith dependence structures Bioinformatics 201127(2)225ndash31
64Li S Garrett-Bakelman FE Akalin A et al An optimized algo-rithm for detecting and annotating regional differentialmethylation BMC Bioinformatics 201314(Suppl 5)S10
65Pedersen BS Schwartz DA Yang IV et al Comb-p softwarefor combining analyzing grouping and correcting spatiallycorrelated P-values Bioinformatics 201228(22)2986ndash8
66Hebestreit K Dugas M Klein HU Detection of significantlydifferentially methylated regions in targeted bisulfitesequencing data Bioinformatics 201329(13)1647ndash53
67Benjamini Y Hochberg Y Multiple hypotheses testing withweights Scand J Stat 199724(3)407ndash18
68Rhee HS Franklin Pugh B Comprehensive genome-wide pro-tein-DNA interactions detected at single-nucleotide reso-lution Cell 2011147(6)1408ndash19
69Feng H Conneely KN Wu H A Bayesian hierarchical modelto detect differentially methylated loci from single nucleotideresolution sequencing data Nucleic Acids Res 201442(8)e69
70Sun D Xi Y Rodriguez B et al MOABS model based analysisof bisulfite sequencing data Genome Biol 201415(2)R38
71Dolzhenko E Smith AD Using beta-binomial regression forhigh-precision differential methylation analysis in multifac-tor whole-genome bisulfite sequencing experiments BMCBioinformatics 201415(1)215
72Park Y Figueroa ME Rozek LS et al MethylSig a whole gen-ome DNA methylation analysis pipeline Bioinformatics2014302414ndash22
73Wu H Xu T Feng H et al Detection of differentially methy-lated regions from whole-genome bisulfite sequencing datawithout replicates Nucleic Acids Res 201543(21)e141
74Lea AJ Tung J Zhou X A flexible efficient binomial mixedmodel for identifying differential DNA methylation in bisul-fite sequencing data PLoS Genet 201511(11)e1005650
75Park Y Wu H Differential methylation analysis for BS-seqdata under general experimental design Bioinformatics201632(10)1446ndash53
76Wen Y Chen F Zhang Q et al Detection of differentially methy-lated regions in whole genome bisulfite sequencing data usinglocal Getis-Ord statistics Bioinformatics 2016323396ndash404
77Zaykin DV Optimally weighted Z-test is a powerful methodfor combining probabilities in meta-analysis J Evol Biol201124(8)1836ndash41
78Saito Y Tsuji J Mituyama T Bisulfighter accurate detectionof methylated cytosines and differentially methylated re-gions Nucleic Acids Res 2014e45
79Saito Y Mituyama T Detection of differentially methylatedregions from bisulfite-seq data by hidden Markov modelsincorporating genome-wide methylation level distributionsBMC Genomics 201516(12)S3
80Sun S Yu X HMM-Fisher identifying differential methylationusing a hidden Markov model and Fisherrsquos exact test StatAppl Genet Mol Biol 201615(1)55ndash67
81Yu X Sun S HMM-DM identifying differentially methylatedregions using a hidden Markov model Stat Appl Genet Mol Biol201615(1)69ndash81
82Shannon CE A mathematical theory of communication ACMSIGMOBILE Mobile Comput Commun Rev 20015(1)3ndash55
83Zhang Y Liu H Lv J et al QDMR a quantitative method foridentification of differentially methylated regions by entropyNucleic Acids Res 201139(9)e58
84Liu H Liu X Zhang S et al Systematic identification and anno-tation of human methylation marks based on bisulfite sequenc-ing methylomes reveals distinct roles of cell type-specific
16 | Shafi et al
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-
hypomethylation in the regulation of cell identity genes NucleicAcids Res 201644(1)75ndash94
85Stockwell PA Chatterjee A Rodger EJ et al DMAP differentialmethylation analysis package for RRBS and WGBS dataBioinformatics 201430(13)1814ndash22
86Wang Z Li X Jiang Y et al swDMR a sliding windowapproach to identify differentially methylated regionsbased on whole genome bisulfite sequencing PloS One201510(7)e0132866
87 Juhling F Kretzmer H Bernhart SH et al metilene fast andsensitive calling of differentially methylated regions frombisulfite sequencing data Genome Res 201626(2)256ndash62
88Hebestreit K Klein HU BiSeq processing and analyzingbisulfite sequencing data R package version 1140 2015
89Wu H Wang C Wu Z A new shrinkage estimator for disper-sion improves differential expression detection in RNA-seqdata Biostatistics 201314(2)232ndash43
Identifying differential methylation | 17
- bbx013-TF1
- bbx013-TF2
- bbx013-TF3
- bbx013-TF4
-