normalization of illumina 450 dna methylation data

Analysis of normalization techniques for Illumina Infinium 450k DNA methylation databeta-mixture quantile normalization

Related papers• A beta-mixture quantile normalization method for correcting probe

design bias in Illumina Infinium 450k DNA methylation data: Andrew E. Teschendorff, Francesco Marabita, Matthias Lechner, Thomas Bartlett, Jesper Tegner, David Gomez-Cabrero, and Stephan Beck (2012)

• Evaluation of the infinium methylation 450k technology. Epigenomics, 3, 771–784: Dedeurwaerder,S. et al. (2011)

• Complete pipeline for infinium human methylation 450k beadchip data processing using subset quantile normalization for accurate dna methylation estimation. Epigenomics, 4, 325–341: Touleimat,N. and Tost,J. (2012)

• Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587: Du,P. et al. (2010)

• Applications of beta-mixture models in bioinformatics Yuan Ji1 et al. (2005)

• SWAN: Subset quantile Within Array Normalization for Illumina Infinium HumanMethylation 450k beadchips Maksimovic et al. 2012

• The minfi User's Guide Analyzing Illumina 450k Methylation Arrays Hansen et al. (2011)

Background information

• DNA methylation – addition of a methyl group which affects gene expression.

• Beta Value (β): β = M/(M + U + α)

• Measure of methylation for each CpG

• Where M = Methylated intensity and U = Unmethylated intensity

• 27k array design (old)

• Infinium I assays only: M and U same color, different beads

• 450k array design (new)

• Hybrid of the Infinium I and Infinium II assays. Two different assays on the same array

• Infinium II assays: M and U different color, same bead. Single probe pair for each CpG site

• Allows (for 12 samples in parallel) assessment of the methylation status of more than 480,000 cytosines distributed over the whole genome.

• Covers 99% of all RefSeq genes, average of 17 probes per gene.

Illumina Infinium 450k DNA methylation Beadchip

• Useful tool in EWAS studies

• Can provide more insight than 27k DNA methylation Beadchip.

• Problem: Two different designs causes the methylation vales derived from these two designs to exhibit different distributions

• β-values obtained from Infinium II probes were less accurate and reproducible than those obtained from Infinium I probes. Confirmed in at least two papers

• “Evaluation of the Infinium Methylation 450K technology” Dedeurwaeder et al., 2011

• “Complete pipeline for Infinium® Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation” Touleimat,N. and Tost,J. (2012)

• Inf1 probes can report for a wider range of β-values, reflecting all possible methylation states even after adjustment for differences in biological characteristics such as CpG density

• Because of this Inf2 probes may not be able to report with the same sensitivity as Inf1probes as shown in the following graphs.

Infinium I vs Infinium II β Values

Dedeurwaeder et al., 2011

Infinium I vs Infinium II β Values

Touleimat,N. and Tost,J. (2012)

How to account for variation?

• Should account for extra source of variation between probe type1 and probe type2 by normalizing each.

• Normalization means adjusting values measured on different scales to a notionally common scale. In more complicated cases, normalization may refer to more sophisticated adjustments where the intention is to bring the entire probability distributions of adjusted values into alignment.

• Several methods have been developed to normalize between probe1 and probe2 data

• Peak-based-correction (PBC) – Adjust type2 probes based on type1 probe peak values

• Subset Quantile Normalization (SQN) – Adjust type2 probes quantile rank based on similar type1 probe’s quantile rank

• Beta-MIxture Quantile dilation (BMIQ) – Adjust type2 probe distribution based on type1 distribution

Normalization technique PBC

• Peak-based-correction (PBC) - Proposed by Dedeurwaeder et al., 2011, to rescale the Infinium2 data on the basis of infinium1 density distribution modes. 4 Steps to PBC:

1) Convert βvalues to Mvalues: Mvalue = log2(βvalue/(1 – βvalue)

2) Determine peaks from Infinium I and II independently using kernel density estimation with Gaussian smoothing function and a band-width = 0.5. Unmethylated peak summits were computed as SU = argmax (density Mvalue) for negative Mvalues for both Infinium I and II. Similarly, methylated peak summits were computed as SM = argmax (density Mvalue) for positive Mvalues


3) Rescale raw Mvalues using peak summits as reference to get corrected Mvalues

• The corrected Mvalues were then obtained by rescaling independently negative and positive Mvalues using the distance between the peak summits and zero.

• For negative Mvalues the corrected Mvalues were computed as follows: corrected Mvalue = Mvalue/σU where σU is the distance between the peak summit and zero (σU = 0 - SU).

• Corrected positive Mvalues were computed using the formula: corrected Mvalue = Mvalue/σM with σM = SM - 0.


4) Rescale corrected M-values to match Infinium I range, then convert back to β-values

• To convert back the corrected M-values to β-values, the M-values were first rescaled to match Infinium I range. Negative M-values were rescaled by the Infinium I σU (rescaled M-value = corrected M-value. σU) and positive M-values by the Infinium I σM (rescaled M-value = corrected M-value. σM). Finally, rescaled M-values were converted to β-values by means of the relation β-value = 2M-value/(2M-

value + 1)

(A) Bar plots indicating the range of b‐values generated for HCT116 wild‐type (WT) sample (r3) with the Infinium I and Infinium II assays. (B) Density plots of the beta‐values for the two Infinium assay types considered for HCT116 WT sample (r3). (C) Box plots of probe‐wise variance between the three replicates of HCT116 WT (r1, r2 and r3) probes (outliers not drawn). On the left part of the figure, b‐values have undergone no correction (raw data); on the right part, they have been subjected to the peak‐based correction.

Data: eight tumor samples, eight normal breast tissuesamples


• PBC efficiently corrects for InfI/Inf2 shift and improves results.

• PBC implemented in R package Illumina Methylation Analyzer (IMA)

• However, two recent studies have exposed potential problems with PBC

• PBC depends on bimodal shape of methylation density profiles. It breaks down when the methylation density distribution does not exhibit well-defined peaks/modes (Maksimovic et al., 2012, Touleimat and Tost, 2012)

• One proposed solution is Subset Quantile Normalization (SQN ) to correct for this. (Touleimat and Tost, 2012)

• Another solutions is the technique is Beta-MIxture Quantile Normalization (BMIQ) (Teschendorff et al. 2012)

Normalization technique SQN

• In general, β-values distributions should be normalized using standard approaches, such as quantile normalization for inter-sample normalization. However, three constraints prohibit such a straightforward approach for the two different assays on the 450k beadchip:

1) The number of InfI (28%) and InfII (72%) probes differ and prevent from computing a common set of reference quantiles

2) The population to ‘correct’ (InfII) is the larger one and may therefore bias the distribution of the other population (InfI)

3) There is a large imbalance in the proportions of Inf I and Inf II probes covering the different CpG and gene-sequence regions

• A global standardization of methylation values distributions may lead to a dramatic loss of information because the variation of the methylation status may be specific for probes covering different subcategories of CpG

• SQN proposed to solve the two first issues by normalizing the gene-expression signal by splitting between type1 and type2 and ‘anchor’ type2 probes by the more stable and accurate type1 probes.

How does SQN work?

• Reference quantiles of a target set of features are estimated from the smaller set of features used as ‘anchors’ that are considered to be more reliable and stable.

• Modifies the values of the target distribution based on rank equivalence

• Correct the data so that non-anchor and anchor probes of the same percentiles will have the same value.

• Use InfI signals as the anchors to estimate a reference distribution of quantiles and to use this reference to estimate a target distribution of quantiles for InfII probes

• This should provide an accurate normalization of InfI/InfII probes and correct for the shift

• Implemented two versions of SQN approach using provided Illumina annotations.

1) Based on the ‘relation to CpG’

2) Based on the ‘relation to gene sequence’

Touleimat,N. and Tost,J. (2012)

Verification of SQN

• Touleimat,N. and Tost,J. (2012) verified their results using Pyrosequencing

• A technique which, according to their paper, provides high quantitative precision and provides data with single-nucleotide resolution.

• Chose 13 probes for comparison which had to meet following criteria:

• Stable methylation values between samples of the same phenotype (β SD < 0.1)

• differentially methylated (differential methylation > 20%) between samples of different phenotypes

• Most importantly, large difference between median β-values obtained with each variant of our preprocessing pipeline

• Their results, Table 1 on next slide, show SQN using the relation to CpG annotations to identify category-related anchors provided the greatest number of closest methylation values (n = 7) to those obtained by pyrosequencing for the very same CpG.

• Note: With the exception of normalization method F, most performed fairly well with G being best

Median of paired differences of Methylation values to PS

Verification of SQN Cont.

• Their results, Table 2 on next slide, also show the SQN approach, together with the peak-based correction approach, provided the smallest absolute differences in the methylation values when compared with pyrosequencing- based methylation values.

• Note: Most performed fairly well with G and E tied for best results

Median of DNA methylation differences

Subset Quantile Normalization (SQN) Results

• In general, SQN works well and avoids sensitivity issues to variations in the shape of the methylation density curves seen by PBC.

• However, SQN requires a separate normalization to be performed on selected subsets of probes that are matched for biological characteristics (e.g. CpG density).

• SQN depends on a priori choices of which biological characteristics to use when matching the type1 and type2 distribution

• Another model, BMIQ, is assumption-free, as it does not require a separate normalization to be performed

Beta-MIxture Quantile dilation (BMIQ)

• New technique proposed aims to adjust the beta-values of type2 design probes into a statistical distribution characteristic of type1 probes in order to make their statistical distributions comparable. 3 steps:

1. Assign probes to methylation states

2. Transform probabilities into quantiles

3. Perform methylation dependent dilation transformation to preserve the monotonicity and continuity of the data


• Authors verified data by comparing results from tumor tissue samples to other known methods. After assessment, BMIQ improves on ‘no normalization’ and compares favorably to other methods of normalization with:

• Improved robustness of the normalization procedure

• Reduced technical variation and bias of type2 probe values

• Elimination of type1 enrichment bias cause by lower dynamic range of type2 probes

• Code available at http://code.google.com/p/bmiq/downloads/list

BMIQ INPUT

• ### beta.v: vector consisting of beta-values for a given sample. NAs are not allowed Beta-values that are exactly 0 or 1 will be replaced by the min positive or max value below 1, respectively.

• ### design.v: corresponding vector specifying probe design type (1=type1,2=type2). This must be of the same length as beta.v and in the same order.

• ### doH: perform normalization for hemimethylated type2 probes. By default TRUE.

• ### nfit: number of probes of a given design to use for the fitting. Default is 50000. Smaller values (~10000) will make BMIQ run faster at the expense of a small loss in accuracy. For most applications, 10000 is ok.

• ### nL: number of states in beta mixture model. 3 by default. At present BMIQ only works for nL=3.

• ### th1.v: thresholds used for the initialization of the EM-algorithm, they should represent best guesses for calling type1 probes hemi-methylated and methylated, and will be refined by the EM algorithm. Default values work well in most cases.

• ### th2.v: thresholds used for the initialization of the EM-algorithm, they should represent best guesses for calling type2 probes hemi-methylated and methylated, and will be refined by the EM algorithm. By default this is null, and the thresholds are estimated based on th1.v and a modified PBC correction method.

• ### niter: maximum number of EM iterations to do. By default 5.

• ### tol: tolerance threshold for EM algorithm. By default 0.001.

• ### plots: logical specifying whether to plot the fits and normalized profiles out. By default TRUE.

• ### sampleID: the ID of the sample being normalized.


• ### OUTPUT

• ### A list with the following elements:

• ### nbeta: the normalized beta-profile for the sample

• ### class1: the assigned methylation state of type1 probes

• ### class2: the assigned methylation state of type2 probes

• ### av1: mean beta-values for the nL classes for type1 probes.

• ### av2: mean beta-values for the nL classes for type2 probes.

• ### hf: the "Hubble" dilation factor

• ### th1: estimated thresholds used for type1 probes

• ### th2: estimated thresholds used for type2 probes

BMIQ Paper used 10, 450k data sets

• Datasets 1 and 2: (BT) and (CL) subset of the dataset considered in Dedeurwaerder et al. (2011). eight fresh frozen (FF) breast tumors and eight normal breast tissue specimens [hereafter referred to as (BT)], as well as the three replicates from the HCT116 WT cell-line [hereafter referred to as (CL)]. For these cell-lines, matched bisulphite pyrosequencing (BPS) data were available for nine type2 probes.

• Datasets 3 and 4: (FFPE) and (FF) consists of 32 formalin-fixed paraffin-embedded (FFPE) head and neck cancers HNCs), of which 18 were HPV+ and 14 HPV-, as well as five fresh frozen HNCs (FF), of which 2 were HPV+ and 3 HPV-. Available from GEO under accession number GSE38271.

• Dataset 5: (GBM) consists of 81 glioblastoma multiformes (GBMs) (Turcan et al., 2012), 49 of which were categorized as CpG island methylator positive (CIMP+) and 32 as CIMP-.

• Datasets 6–10: TCGA, LIV, LC, BLDC, HCC samples are all from the TCGA: Dataset6 (TCGA) consists of 10 samples as provided in the Bioconductor data package TCGAmethylation 450k, Dataset7 (LIV) consists of nine normal liver tissue samples from Batch203 in the TCGA data portal, Dataset8 (LC) consists of 22 lung cancer samples from Batch196, Dataset9 (BLDC) consists of 12 bladder cancer samples from Batch86 and Dataset10 (HCC) consists of 10 hepatocellular carcinoma samples from Batch153.

BMIQ normalization criteria

i. Must allow for the different biological characteristics of type1 and type2 probes

• Type1 probes are significantly more likely to map to CpG islands than type2 probes, and hence the relative proportion of methylated and unmethylated probes will vary between the two designs. In the case of the type2 probes, this means that these proportions must be invariant under the normalization transformation.

ii. The transformation of the type2 probe values should reduce the bias

• which amounts to matching of the density distributions of the two design types, specially at the unmethylated and methylated extremes.

iii. The transformation must be monotonic

• Relative ranking of beta values of the type2 probes must be invariant under the transformation.

BMIQ normalization strategy

• Fit a three state beta mixture model (unmethylated-U, hemimethylated-H, fully methylated-M) to type1 and type2 probes separately using three steps

• Note: Let {(aIU,bI

U),(aIH,bI

H),(aIM,bI

M)} denote the parameters of the three beta distributions for the type1 probes, and similarly let {(aII

U,bIIU),

(aIIH,bII

H),(aIIM,bII

M)} describe the estimated parameters the three beta components for the type2 probes. State membership of individual probes is determined by the maximum probability criterion.

Beta Distribution• Family of continuous probability distributions defined on the interval [0, 1]

parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution.

http://en.wikipedia.org/wiki/Beta_distribution

BMIQ normalization strategy 3 steps cont.

1. For those type2 probes assigned to the U-state, transform their probabilities of belonging to the U-state to quantiles using the inverse of the cumulative beta distribution with beta parameters (aI

U,bIU) estimated from the type1 U

component. Let nuII denote the normalized values of the type2 U-probes.

2. For those type2 probes assigned to the M-state, transform their probabilities of belonging to the M-state to quantiles using the inverse of the cumulative beta distribution with beta parameters (aI

M,bIM) estimated from the type1 M

component. Let nMII denote the normalized values of the type2 M-probes.

3. For the type2 probes assigned to the H-state, we perform a dilation (scale) transformation to ‘fit’ the data into the ‘gap’ with endpoints defined by ma{nu

II} and min{nMII}

BMIQ normalization procedure, First model each β value

Aside: Expectation Maximization using Beta Mixture

• EM – Uses Beta-mixture model. From Ji et al. 2005• The beta-mixture model deals with a vector of correlation coefficients of

gene-expression levels. Correlation coefficients are assumed to come from multiple underlying probability distributions, in our case, beta distributions. To fit the beta distribution, for each correlation coefficient xi , we apply a linear transformation yi = (xi +1)/2, so that the range of the transformed values is between 0 and 1. The index i represents the gene with respect to which the correlation coefficient y is calculated. Let {yi }, i = 1, . . . , n, denote the transformed correlation coefficients (where n is the total number of observations and L is the number of components in the mixture) under a mixture of beta distributions,

Denotes the density of the beta-distribution:


Use expectation maximization algorithm (Dempster et al., 1977) to iteratively maximize the log-likelihood and update the conditional probability that yi comes from the l-th component, which is defined as

Consists of 4 steps. Repeat the first three until Repeat M-step and E-step until the change in the value of the log-likelihood in Equation (1) is negligible.

Ji et al. 2005

The EM algorithm yields the final estimated posterior probability z , the ∗value of which represents the il posterior probability that correlation coefficient yi comes from component l.

BMIQ normalization procedure• Results from EM algorithm are two-tailed so need to subdivide Beta values

into those values which fall left or right of the mean. Unmethylated being to the left and Methylated to the right.

• Use these to normalize U and M beta values

• Now need to normalize H beta values

• Normalized beta-values for the H-probes is given by the conformal (shift+dilation) transformation based on max{M} and min{U} values

• This conformal transformation involves a non-uniform rescaling of the H probe beta values since it depends on the beta-value of the probe. This is absolutely key in order to avoid gaps or holes from emerging in the normalized distribution

• It is important to match normalize with respect to which tail the beta value falls in because the left tail end of the methylated type2 distribution is generally not well described by a beta-distribution, presumably a result of die bias. Similar for unmethylated and the right tail.

BMIQ normalization procedure

BMIQ normalization procedure

• Resulting thresholds would normally fall within the ranges 0.2–0.3 and 0.60–0.8, respectively. Having thus identified reasonable initial estimates for the weights {πU

II,πHII,πM

(II)} the algorithm will then automatically determine the unmethylated, hemimethylated and methylated fractions for each sample individually.

Improved robustness of BMIQ• BMIQ does not use the type1 modes to adjust the type2 data, and hence

BMIQ normalization of the type2 probes generated a much smoother density distribution, suggestive of an improved normalization framework (Fig. 1B)

BMIQ reduces technical variation (ERROR)

• BMIQ not only led to a significant improvement, but was also marginally better than PBC (Fig. 2B)

Manhattan distance – distance between two points in a grid based on a strictly horizontal and/or vertical path

BMIQ reduces bias of type2 methylation values

• BMIQ significantly reduced the bias of type2 values (Fig. 3), although there was no improvement over PBC itself

BMIQ eliminates the type1 enrichment bias

• To assess any potential bias towards type1 probes, computed for a given number of top ranked probes the odds ratio (OR) of relative enrichment of type1 over type2 probes. BMIQ successfully avoided any type1/type2 enrichment bias in all three datasets, indicative of an improved normalization of type2 values

Reduced technical variability within probe clusters

• Defined probe clusters as contiguous regions containing at least seven probes with no two adjacent probes separated by >300bp.

• Within these probe clusters, paper posited that pairs of adjacent probes, one from each design and within 200 bp of each other, should have similar methylation values.

• To evaluate normalization algorithms evaluated which one minimizes the absolute difference in methylation between such closely adjacent type1-type2 pairs

Reduced technical variability within probe clusters

BMIQ robustly identifies features associated with HPV status

• Paper attempted to verify that a reduction in technical variation obtained with BMIQ is not at the expense of reduced biological signal.

• Used a training test set strategy to identify features in a training set and calling them true positives if validated in a test set.

• Allows for a comparison of sensitivity and positive predictive value (PPV) between the different normalization methods.

• BMIQ identified more differentially methylated features than PBC or SWAN, not at the expense of a smaller PPV, and so, overall, BMIQ identified more true positives

Results

• Because of the different nature of type1 and type2 probes on the Illumina 450k Methylation Beadchip a different kind of normalization is necessary then what was used on 27k data

• There are several methods to do this, each is better then performing quantile normalization without discriminating probe types.

• Normalization with regard to probe type improved robustness, Reduced technical variation, Reduced bias of type2 methylation values, Elimination of type1 enrichment bias

normalization of illumina 450 dna methylation data

Education