statistics for epigenetic epidemiology - university of · pdf file• methylation metric...

26
Statistics for epigenetic epidemiology

Upload: doantu

Post on 17-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Statistics for epigenetic epidemiology

Page 2: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Overview

• (Some) considerations for epigenetic analyses

• Common techniques being utilised

• Examples

Page 3: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

• Locus-specific analyses • Candidate gene studies and validation of microarray data

• Pyrosequencing, Sequenom Epityper

• High throughput microarray analyses • Exploratory

• Illumina (27K, 450K, Veracode)

• Methylation metric • Continuous values ranging from 0-1 (or 0-100)

• Interpreted as methylation percentage

• One for each interrogated CpG site

• Beta value = methylated signal

(unmethylated signal + methylated signal + 100)

Study designs

Page 4: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Non-normality

• Often not normally distributed

• Can be profoundly skewed, bimodal, other-modal

0

510

15

De

nsity

.05 .1 .15 .2 .25Beta Value

cg00025357

05

10

15

De

nsity

.76 .78 .8 .82 .84 .86Beta Value

cg00026546

0

.02

.04

.06

De

nsity

20 40 60 80 100Beta Value

TACSTD2

Example histograms with normal density curves

02

46

810

De

nsity

.15 .2 .25 .3 .35Beta Value

cg00000165

Page 5: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Heteroscedasticity

• Variance across the spectrum of methylation is not equal

• Extreme values of highly methylated and unmethylated DNA show reduced variance compared to intermediate values

Infinium 27k Chip GoldenGate (1.5k)

Du et al. BMC Bioinformatics. 2010; 11:587 Relton et al. PLoS ONE. 2012; 7(3):e31821

0.1

.2.3

.4

Sta

nd

ard

De

via

tio

n

0 .2 .4 .6 .8 1

Mean Beta-values across 175 individuals

Page 6: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Non-parametric tests

• Confirm normality first with distribution diagnostic test • Kolmogorov–Smirnov test, Shapiro–Wilk test, Shapiro–Francia test

• Otherwise, perform non-parametric tests instead

Model Parametric Non-parametric equivalent

One group or Paired/correlated groups: Difference in means

Paired t-test Sign test Wilcoxon signed-rank

Two independent groups: Difference in means

Unpaired t-test Mann-Whitney U (aka: Wilcoxon rank-sum test)

Multiple independent groups: Difference in means

One-way anova Kruskal-Wallis

Multiple independent groups: Equality of variances

Bartlett Levene’s

Paired/correlated groups: Correlation

Pearson's r Spearman’s rho Kendall tau

Page 7: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Regression modelling

• Complex regression modelling may be unavoidable

• Two noteworthy points:

1. Assumptions of regression • Constant variance of residuals/errors (aka homoscedasticity)

• Normality of residuals/errors

• It’s the residuals/errors that are important

2. Exposure vs. outcome • Exposure: predictor variables, independent variables

• Outcome: response variables, dependent variables

• [Models with skewed exposure variables are less susceptible to violations]

Page 8: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Regression modelling

Possible solutions/options

• Try non-parametric/alternative regression models

• Beta distribution

• Generalised linear models

• Generalised least squares models

• Test the robustness of the model e.g. with resampling methods (such as: bootstrapping, permutations, cross validation)

• Validate with non-parametric and stratification analyses

• Manipulate the methylation data

• Categorise into groups (e.g. tertiles, quartiles; low, medium, high)

• Transformations

• (switch the model around such that methylation is a predictor)

Page 9: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Issues with transformations

• Common transformations don’t help

• Difficult to interpret the results

• May limit power of the analyses

05

10

15

20

De

nsity

.75 .8 .85 .9 .95Beta Value

cg00002593

02

46

810

.4 .5 .6 .7 .8

cubic

05

10

15

.6 .65 .7 .75 .8 .85

square

05

10

15

20

25

.75 .8 .85 .9 .95

identity

01020

30

40

50

.88 .9 .92 .94 .96

sqrt

05

10

15

20

25

-.25 -.2 -.15 -.1

log

010

20

30

40

-1.14 -1.12 -1.1 -1.08 -1.06 -1.04

1/sqrt

05

10

15

20

-1.3 -1.25 -1.2 -1.15 -1.1

inverse

02

46

8

-1.7 -1.6 -1.5 -1.4 -1.3 -1.2

1/square

01

23

45

-2.2 -2 -1.8 -1.6 -1.4 -1.2

1/cubic

De

nsity

Transformed methylation valuesHistograms by transformation

cg00002593

Transformation formula chi2(2) P(chi2)

raw 17.57 0.000

cubic beta^3 15.25 0.000

square beta^2 16.41 0.000

identity 17.57 0.000

square root sqrt(beta) 18.15 0.000

log log(beta) 18.72 0.000

1/(square root) 1/sqrt(beta) 19.30 0.000

inverse 1/beta 19.87 0.000

1/square 1/beta^2 21.00 0.000

1/cubic 1/beta^3 22.11 0.000

Page 10: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Transformations addressing the issue of heteroscedasticity

• M-value (Du et al. BMC Bioinformatics 2010 11:587)

• Arcsine (variance-stabilising transformation)

(Lin et al. Nucleic Acids Res. 2008 Feb;36(2):e11)

Other transformations

0.5

11.5

De

nsity

1.5 2 2.5 3 3.5M-value

cg00002593

02

46

8

De

nsity

.9 .95 1 1.05 1.1 1.15arcsin transformed beta value

cg00002593

Page 11: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Multiple comparisons

Alpha inflation = the more tests you perform, the more likely you are to see a false-positive effect

• Family wise error rate (FWER):

• Relates to the overall type 1 error for the group of tests performed

• the probability of any one of the tests being significant when in fact all hypotheses are null (or the overall hypothesis is null)

• Controls the overall chance of reporting a false positive finding

• Popular example: Bonferroni correction (/number of tests) • E.g. p=0.05 and 8 tests • Individual test error rate = 0.05/8 = 0.00625 • FWER is maintained at 0.05

• Assumes independence of individual tests, otherwise can be overly stringent

• Important in validation and replication studies

Page 12: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Multiple comparisons

• False discovery rate (FDR):

• Controls the expected proportion of incorrectly rejected null hypotheses

• The proportion of false positives among all significant tests

• Reports q-values

• The probability that an individual finding is in fact a false positive

• Set the FDR at a level you are happy with by deciding on a value of q to accept as significant e.g. 0.05 (much like a p-value)

• Popular example: Benjamini & Hochberg (and variations on this)

• Less conservative than FWER

• Greater power but with an increased risk of type 1 error rates

• Used in exploratory/hypothesis generating studies which require replication

• [Both methods are based on parametric assumptions] • Non-parametric data may require permutations instead

Page 13: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Analysis types

Locus-specific and moderate-throughput analyses

• Interested in univariate associations between methylation at individual CpG sites and phenotype

• Popular options

• Sign-test, Mann-Whitney and Kruskall-Wallis for comparing average methylation levels across categorical groups

• Levene Test for Equality of Variances: compare degree of variance between groups

• Spearman’s correlation to compare with continuous phenotypes

• (Regression)

Page 14: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Analysis types

High-throughput candidate gene and microarray analyses

• Interested in identifying features and patterns within the methylation profile and reducing the dimensionality of the data

• Popular options

• Supervised clustering (aka class discrimination): aims to predict the known phenotype based on the DNA methylation profile.

• Unsupervised clustering (aka class discovery, e.g. hierarchical clustering): groups samples together based on how similar they are in regards to the DNA methylation data, irrespective of any other phenotype. The defined clusters can then be analysed in regards to the phenotypes of interest.

• Often presented along with heatmaps which enable visual inspection of patterns within the samples and methylation profiles.

Page 15: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Analysis types

High-throughput candidate gene and microarray analyses

• Interested in identifying features and patterns within the methylation profile and reducing the dimensionality of the data

• Popular options

• Principal component analysis (PCA): aims to represent a large number of correlated measures (e.g. CpG sites) by a smaller set of uncorrelated variables, which together capture the majority of the variance present in the dataset. The resulting variables can then be utilised in subsequent analyses involving phenotypes of interest.

• Surrogate Variable Analysis (SVA) and Independent SVA (I-SVA): aims to remove the variance caused by confounding factors (e.g. technical error/artefact) and leave only biological variation for subsequent analysis.

Page 16: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 1

Variation, patterns, and temporal stability of DNA methylation: considerations for epigenetic epidemiology

• Netherlands twin biobank samples (n 100)

• 164 CpG sites spanning 16 candidate gene loci

• Sequenom Epityper

• Assessed inter-individual variability, blood cell heterogenetic, stability of DNA over time and correlation between blood & buccal cells

• Presented a range of summary statistics and performed various non-parametric tests such as

• Spearman’s correlation

• Levene’s test for equality of variance

Talens et al. FASEB, 2010

Page 17: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 1

Inter-individual variation in DNA methylation • Summarised the distribution of methylation across the candidate CpG sites

• Average methylation levels ranged for 0 to 98%, SD range from 0 to 15%

• Presented the data graphically using dot plots

Talens et al. FASEB, 2010

Figure 1: Inter-individual variation in DNA methylation

Page 18: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 1 Stability over time and differences between tissues • Paired samples from two time points (10-20 years apart)

• Paired samples from blood and buccal cells

• Average difference and SD of differences were greater between tissues than over time within tissue. Talens et al. FASEB, 2010

Methylation (%)

Locus Old blood New blood Difference (%) Spearman’s ρ

IL10 22.4 ± 9.0 25.2 ± 6.6 −2.8 ± 9.1 0.422

IGF2R 65.8 ± 16.8 67.6 ± 17.7 −1.8 ± 8.1 0.883

LEP 20.0 ± 11.5 21.8 ± 13.0 −1.8 ± 6.1 0.895

Table 4: Comparison of DNA methylation in recent blood and buccal cell samples. Values are mean SD

Table 3: Comparison of DNA methylation in blood samples. Values are mean SD

Methylation (%)

Locus Blood Buccal cells Difference (%) Spearman’s ρ

IL10 24.8 ± 7.1 64.9 ± 18.8 −40.1 ± 17.1 0.442

IGF2R 68.3 ± 17.4 81.6 ± 12.4 −13.3 ± 10.6 0.827

LEP 21.7 ± 13.7 11.8 ± 8.8 9.9 ± 9.9 0.798

Page 19: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 2

DNA Methylation Patterns in Human Fetal Heart Development

• DNA samples from 13 heart and 13 paired placenta samples

• Taken at 3 gestational time-points: 7-8 weeks, 9-11 weeks, 12-14 weeks

• VeraCode Custom panel

• 96 CpG sites spanning 31 heart-related candidate genes

• Q1: Does DNA methylation in heart tissue differ over time

• Kruskall-Wallis: compared average methylation levels across groups

• Levene’s test: compared degree of variance in methylation across groups

• Q2: Does DNA methylation in heart tissue differ to that of paired placenta tissue

• Spearman’s: assess correlation between heart and placenta methylation levels

• Wilcoxon signed-rank: compared average methylation levels across tissues

Spencer et al. in prep.

Page 20: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 2

Spencer et al. in prep.

Figure 1. Overall methylation profile in fetal heart tissue samples

Figure 2. Distribution of methylation in fetal heart tissue samples for each CpG site

Summary and descriptive statistics of sample and DNA methylation profile

Page 21: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 2

Spencer et al. in prep.

Differences over time and between tissues

7-8 weeks 9-11 weeks 12-14 weeks Kruskall Wallis Levenes Test

CpG site N Mean (sd) N Mean (sd) N Mean (sd) 2 P-value F P-value

1 6 0.37 (0.07) 5 0.30 (0.07) 2 0.21 (0.06) 6.58 0.04 0.44 0.66

2 6 0.87 (0.03) 5 0.87 (0.03) 2 0.69 (0.12) 4.74 0.09 6.76 0.01

3 6 0.75 (0.02) 5 0.72 (0.02) 2 0.63 (0.12) 6.13 0.05 26.50 1.01E-04

4 6 0.28 (0.15) 5 0.53 (0.08) 2 0.34 (0.15) 7.22 0.03 3.19 0.08

Table 1: Association between methylation levels and developmental stage in heart tissue

All samples 7-8 weeks 9-11 weeks

Wilcoxon signed-rank Spearman’s Wilcoxon signed-rank Spearman’s Wilcoxon signed-rank Spearman’s

CpG site N Mean (SE)

difference z P rho P N

Mean (SE)

difference z P rho P N

Mean (SE)

Difference z P rho P

1 10 -0.01 (0.05) -0.15 0.88 -0.41 0.24 5 0.03 (0.06) 0.40 0.69 -0.60 0.28 5 -0.05 (0.07) -0.94 0.35 -0.30 0.62

2 10 0.15 (0.02) 2.80 0.01 -0.16 0.65 5 0.13 (0.02) 2.02 0.04 -0.10 0.87 5 0.17 (0.03) 2.02 0.04 -0.10 0.87

3 10 -0.05 (0.01) -2.50 0.01 0.02 0.96 5 -0.04 (0.01) -2.02 0.04 0.50 0.39 5 -0.06 (0.02) -1.48 0.14 -0.60 0.28

4 9 -0.06 (0.08) -1.36 0.17 0.25 0.52 4 -0.12 (0.10) -1.10 0.27 -0.40 0.60 5 -0.01 (0.12) -0.67 0.50 0.00 1.00

Table 2: Comparison between paired heart and placenta tissues

Page 22: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 3

Fryer et al. Epigenetics, 2011

Quantitative, high-resolution epigenetic profiling of CpG loci identifies associations with cord blood plasma homocysteine and birth weight in humans

• DNA samples from 12 fetal cord blood samples

• Analysed ~6.5k sites from Infinium 27k chip

• Performed strict QC measures because of the small sample set

• Unsupervised hierarchical cluster analysis identified 2 distinct clusters of samples

• Compared clinical phenotypes of interest (namely: birthweight centile) across these two groups using Mann-Whitney U tests

• Univariate linear regression analysis for individual CpG sites (predictor)

Page 23: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Example 4

Tobacco-Smoking-Related Differential DNA Methylation: 27K Discovery and Replication

Breitling et al. AJHG, 2011

• DNA samples from 180 subjects

• Infinium 27k chip

• Mixed linear model: methylation as outcome, smoking status and sex as fixed effects, batch as random effect

Page 24: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

• Validated with non-parametric Kruskall-Wallis tests, stratified on sex

• No correction for multiple testing but performed validation and independent replication

Example 4

Tobacco-Smoking-Related Differential DNA Methylation: 27K Discovery and Replication

Breitling et al. AJHG, 2011

Page 25: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Software • R

• Freely available statistics platform

• http://www.r-project.org/

• Numerous packages written and available for download

• http://cran.r-project.org/

• http://www.bioconductor.org/

• Data processing and analysis;

• Minifi, Lumi, MethyLumi, IMA, Limma, HumMeth27QCReport

• Other statistics platforms can perform many of the analyses • Stata, SPSS, MiniTab, SAS

• Can perform some stats in the accompanying platform software

• Genome studio, Pyro Q-CpG software

Page 26: Statistics for epigenetic epidemiology - University of · PDF file• Methylation metric • Continuous values ranging from 0-1 (or 0-100) • Interpreted as methylation percentage

Summary

Flowchart for analysis of quantitative DNA methylation. Adapted from E Andres Houseman, 2012 (Chapter 5 from Epigenetic Epidemiology. Karin B. Michels (ed). Springer, 2012) (Its also worth looking for reviews by Kimberly D Siegmund)