statistics for epigenetic epidemiology - university of · pdf file• methylation metric...
TRANSCRIPT
Statistics for epigenetic epidemiology
Overview
• (Some) considerations for epigenetic analyses
• Common techniques being utilised
• Examples
• Locus-specific analyses • Candidate gene studies and validation of microarray data
• Pyrosequencing, Sequenom Epityper
• High throughput microarray analyses • Exploratory
• Illumina (27K, 450K, Veracode)
• Methylation metric • Continuous values ranging from 0-1 (or 0-100)
• Interpreted as methylation percentage
• One for each interrogated CpG site
• Beta value = methylated signal
(unmethylated signal + methylated signal + 100)
Study designs
Non-normality
• Often not normally distributed
• Can be profoundly skewed, bimodal, other-modal
0
510
15
De
nsity
.05 .1 .15 .2 .25Beta Value
cg00025357
05
10
15
De
nsity
.76 .78 .8 .82 .84 .86Beta Value
cg00026546
0
.02
.04
.06
De
nsity
20 40 60 80 100Beta Value
TACSTD2
Example histograms with normal density curves
02
46
810
De
nsity
.15 .2 .25 .3 .35Beta Value
cg00000165
Heteroscedasticity
• Variance across the spectrum of methylation is not equal
• Extreme values of highly methylated and unmethylated DNA show reduced variance compared to intermediate values
Infinium 27k Chip GoldenGate (1.5k)
Du et al. BMC Bioinformatics. 2010; 11:587 Relton et al. PLoS ONE. 2012; 7(3):e31821
0.1
.2.3
.4
Sta
nd
ard
De
via
tio
n
0 .2 .4 .6 .8 1
Mean Beta-values across 175 individuals
Non-parametric tests
• Confirm normality first with distribution diagnostic test • Kolmogorov–Smirnov test, Shapiro–Wilk test, Shapiro–Francia test
• Otherwise, perform non-parametric tests instead
Model Parametric Non-parametric equivalent
One group or Paired/correlated groups: Difference in means
Paired t-test Sign test Wilcoxon signed-rank
Two independent groups: Difference in means
Unpaired t-test Mann-Whitney U (aka: Wilcoxon rank-sum test)
Multiple independent groups: Difference in means
One-way anova Kruskal-Wallis
Multiple independent groups: Equality of variances
Bartlett Levene’s
Paired/correlated groups: Correlation
Pearson's r Spearman’s rho Kendall tau
Regression modelling
• Complex regression modelling may be unavoidable
• Two noteworthy points:
1. Assumptions of regression • Constant variance of residuals/errors (aka homoscedasticity)
• Normality of residuals/errors
• It’s the residuals/errors that are important
2. Exposure vs. outcome • Exposure: predictor variables, independent variables
• Outcome: response variables, dependent variables
• [Models with skewed exposure variables are less susceptible to violations]
Regression modelling
Possible solutions/options
• Try non-parametric/alternative regression models
• Beta distribution
• Generalised linear models
• Generalised least squares models
• Test the robustness of the model e.g. with resampling methods (such as: bootstrapping, permutations, cross validation)
• Validate with non-parametric and stratification analyses
• Manipulate the methylation data
• Categorise into groups (e.g. tertiles, quartiles; low, medium, high)
• Transformations
• (switch the model around such that methylation is a predictor)
Issues with transformations
• Common transformations don’t help
• Difficult to interpret the results
• May limit power of the analyses
05
10
15
20
De
nsity
.75 .8 .85 .9 .95Beta Value
cg00002593
02
46
810
.4 .5 .6 .7 .8
cubic
05
10
15
.6 .65 .7 .75 .8 .85
square
05
10
15
20
25
.75 .8 .85 .9 .95
identity
01020
30
40
50
.88 .9 .92 .94 .96
sqrt
05
10
15
20
25
-.25 -.2 -.15 -.1
log
010
20
30
40
-1.14 -1.12 -1.1 -1.08 -1.06 -1.04
1/sqrt
05
10
15
20
-1.3 -1.25 -1.2 -1.15 -1.1
inverse
02
46
8
-1.7 -1.6 -1.5 -1.4 -1.3 -1.2
1/square
01
23
45
-2.2 -2 -1.8 -1.6 -1.4 -1.2
1/cubic
De
nsity
Transformed methylation valuesHistograms by transformation
cg00002593
Transformation formula chi2(2) P(chi2)
raw 17.57 0.000
cubic beta^3 15.25 0.000
square beta^2 16.41 0.000
identity 17.57 0.000
square root sqrt(beta) 18.15 0.000
log log(beta) 18.72 0.000
1/(square root) 1/sqrt(beta) 19.30 0.000
inverse 1/beta 19.87 0.000
1/square 1/beta^2 21.00 0.000
1/cubic 1/beta^3 22.11 0.000
Transformations addressing the issue of heteroscedasticity
• M-value (Du et al. BMC Bioinformatics 2010 11:587)
• Arcsine (variance-stabilising transformation)
(Lin et al. Nucleic Acids Res. 2008 Feb;36(2):e11)
Other transformations
0.5
11.5
De
nsity
1.5 2 2.5 3 3.5M-value
cg00002593
02
46
8
De
nsity
.9 .95 1 1.05 1.1 1.15arcsin transformed beta value
cg00002593
Multiple comparisons
Alpha inflation = the more tests you perform, the more likely you are to see a false-positive effect
• Family wise error rate (FWER):
• Relates to the overall type 1 error for the group of tests performed
• the probability of any one of the tests being significant when in fact all hypotheses are null (or the overall hypothesis is null)
• Controls the overall chance of reporting a false positive finding
• Popular example: Bonferroni correction (/number of tests) • E.g. p=0.05 and 8 tests • Individual test error rate = 0.05/8 = 0.00625 • FWER is maintained at 0.05
• Assumes independence of individual tests, otherwise can be overly stringent
• Important in validation and replication studies
Multiple comparisons
• False discovery rate (FDR):
• Controls the expected proportion of incorrectly rejected null hypotheses
• The proportion of false positives among all significant tests
• Reports q-values
• The probability that an individual finding is in fact a false positive
• Set the FDR at a level you are happy with by deciding on a value of q to accept as significant e.g. 0.05 (much like a p-value)
• Popular example: Benjamini & Hochberg (and variations on this)
• Less conservative than FWER
• Greater power but with an increased risk of type 1 error rates
• Used in exploratory/hypothesis generating studies which require replication
• [Both methods are based on parametric assumptions] • Non-parametric data may require permutations instead
Analysis types
Locus-specific and moderate-throughput analyses
• Interested in univariate associations between methylation at individual CpG sites and phenotype
• Popular options
• Sign-test, Mann-Whitney and Kruskall-Wallis for comparing average methylation levels across categorical groups
• Levene Test for Equality of Variances: compare degree of variance between groups
• Spearman’s correlation to compare with continuous phenotypes
• (Regression)
Analysis types
High-throughput candidate gene and microarray analyses
• Interested in identifying features and patterns within the methylation profile and reducing the dimensionality of the data
• Popular options
• Supervised clustering (aka class discrimination): aims to predict the known phenotype based on the DNA methylation profile.
• Unsupervised clustering (aka class discovery, e.g. hierarchical clustering): groups samples together based on how similar they are in regards to the DNA methylation data, irrespective of any other phenotype. The defined clusters can then be analysed in regards to the phenotypes of interest.
• Often presented along with heatmaps which enable visual inspection of patterns within the samples and methylation profiles.
Analysis types
High-throughput candidate gene and microarray analyses
• Interested in identifying features and patterns within the methylation profile and reducing the dimensionality of the data
• Popular options
• Principal component analysis (PCA): aims to represent a large number of correlated measures (e.g. CpG sites) by a smaller set of uncorrelated variables, which together capture the majority of the variance present in the dataset. The resulting variables can then be utilised in subsequent analyses involving phenotypes of interest.
• Surrogate Variable Analysis (SVA) and Independent SVA (I-SVA): aims to remove the variance caused by confounding factors (e.g. technical error/artefact) and leave only biological variation for subsequent analysis.
Example 1
Variation, patterns, and temporal stability of DNA methylation: considerations for epigenetic epidemiology
• Netherlands twin biobank samples (n 100)
• 164 CpG sites spanning 16 candidate gene loci
• Sequenom Epityper
• Assessed inter-individual variability, blood cell heterogenetic, stability of DNA over time and correlation between blood & buccal cells
• Presented a range of summary statistics and performed various non-parametric tests such as
• Spearman’s correlation
• Levene’s test for equality of variance
Talens et al. FASEB, 2010
Example 1
Inter-individual variation in DNA methylation • Summarised the distribution of methylation across the candidate CpG sites
• Average methylation levels ranged for 0 to 98%, SD range from 0 to 15%
• Presented the data graphically using dot plots
Talens et al. FASEB, 2010
Figure 1: Inter-individual variation in DNA methylation
Example 1 Stability over time and differences between tissues • Paired samples from two time points (10-20 years apart)
• Paired samples from blood and buccal cells
• Average difference and SD of differences were greater between tissues than over time within tissue. Talens et al. FASEB, 2010
Methylation (%)
Locus Old blood New blood Difference (%) Spearman’s ρ
IL10 22.4 ± 9.0 25.2 ± 6.6 −2.8 ± 9.1 0.422
IGF2R 65.8 ± 16.8 67.6 ± 17.7 −1.8 ± 8.1 0.883
LEP 20.0 ± 11.5 21.8 ± 13.0 −1.8 ± 6.1 0.895
…
Table 4: Comparison of DNA methylation in recent blood and buccal cell samples. Values are mean SD
Table 3: Comparison of DNA methylation in blood samples. Values are mean SD
Methylation (%)
Locus Blood Buccal cells Difference (%) Spearman’s ρ
IL10 24.8 ± 7.1 64.9 ± 18.8 −40.1 ± 17.1 0.442
IGF2R 68.3 ± 17.4 81.6 ± 12.4 −13.3 ± 10.6 0.827
LEP 21.7 ± 13.7 11.8 ± 8.8 9.9 ± 9.9 0.798
Example 2
DNA Methylation Patterns in Human Fetal Heart Development
• DNA samples from 13 heart and 13 paired placenta samples
• Taken at 3 gestational time-points: 7-8 weeks, 9-11 weeks, 12-14 weeks
• VeraCode Custom panel
• 96 CpG sites spanning 31 heart-related candidate genes
• Q1: Does DNA methylation in heart tissue differ over time
• Kruskall-Wallis: compared average methylation levels across groups
• Levene’s test: compared degree of variance in methylation across groups
• Q2: Does DNA methylation in heart tissue differ to that of paired placenta tissue
• Spearman’s: assess correlation between heart and placenta methylation levels
• Wilcoxon signed-rank: compared average methylation levels across tissues
Spencer et al. in prep.
Example 2
Spencer et al. in prep.
Figure 1. Overall methylation profile in fetal heart tissue samples
Figure 2. Distribution of methylation in fetal heart tissue samples for each CpG site
Summary and descriptive statistics of sample and DNA methylation profile
Example 2
Spencer et al. in prep.
Differences over time and between tissues
7-8 weeks 9-11 weeks 12-14 weeks Kruskall Wallis Levenes Test
CpG site N Mean (sd) N Mean (sd) N Mean (sd) 2 P-value F P-value
1 6 0.37 (0.07) 5 0.30 (0.07) 2 0.21 (0.06) 6.58 0.04 0.44 0.66
2 6 0.87 (0.03) 5 0.87 (0.03) 2 0.69 (0.12) 4.74 0.09 6.76 0.01
3 6 0.75 (0.02) 5 0.72 (0.02) 2 0.63 (0.12) 6.13 0.05 26.50 1.01E-04
4 6 0.28 (0.15) 5 0.53 (0.08) 2 0.34 (0.15) 7.22 0.03 3.19 0.08
Table 1: Association between methylation levels and developmental stage in heart tissue
All samples 7-8 weeks 9-11 weeks
Wilcoxon signed-rank Spearman’s Wilcoxon signed-rank Spearman’s Wilcoxon signed-rank Spearman’s
CpG site N Mean (SE)
difference z P rho P N
Mean (SE)
difference z P rho P N
Mean (SE)
Difference z P rho P
1 10 -0.01 (0.05) -0.15 0.88 -0.41 0.24 5 0.03 (0.06) 0.40 0.69 -0.60 0.28 5 -0.05 (0.07) -0.94 0.35 -0.30 0.62
2 10 0.15 (0.02) 2.80 0.01 -0.16 0.65 5 0.13 (0.02) 2.02 0.04 -0.10 0.87 5 0.17 (0.03) 2.02 0.04 -0.10 0.87
3 10 -0.05 (0.01) -2.50 0.01 0.02 0.96 5 -0.04 (0.01) -2.02 0.04 0.50 0.39 5 -0.06 (0.02) -1.48 0.14 -0.60 0.28
4 9 -0.06 (0.08) -1.36 0.17 0.25 0.52 4 -0.12 (0.10) -1.10 0.27 -0.40 0.60 5 -0.01 (0.12) -0.67 0.50 0.00 1.00
Table 2: Comparison between paired heart and placenta tissues
Example 3
Fryer et al. Epigenetics, 2011
Quantitative, high-resolution epigenetic profiling of CpG loci identifies associations with cord blood plasma homocysteine and birth weight in humans
• DNA samples from 12 fetal cord blood samples
• Analysed ~6.5k sites from Infinium 27k chip
• Performed strict QC measures because of the small sample set
• Unsupervised hierarchical cluster analysis identified 2 distinct clusters of samples
• Compared clinical phenotypes of interest (namely: birthweight centile) across these two groups using Mann-Whitney U tests
• Univariate linear regression analysis for individual CpG sites (predictor)
Example 4
Tobacco-Smoking-Related Differential DNA Methylation: 27K Discovery and Replication
Breitling et al. AJHG, 2011
• DNA samples from 180 subjects
• Infinium 27k chip
• Mixed linear model: methylation as outcome, smoking status and sex as fixed effects, batch as random effect
• Validated with non-parametric Kruskall-Wallis tests, stratified on sex
• No correction for multiple testing but performed validation and independent replication
Example 4
Tobacco-Smoking-Related Differential DNA Methylation: 27K Discovery and Replication
Breitling et al. AJHG, 2011
Software • R
• Freely available statistics platform
• http://www.r-project.org/
• Numerous packages written and available for download
• http://cran.r-project.org/
• http://www.bioconductor.org/
• Data processing and analysis;
• Minifi, Lumi, MethyLumi, IMA, Limma, HumMeth27QCReport
• Other statistics platforms can perform many of the analyses • Stata, SPSS, MiniTab, SAS
• Can perform some stats in the accompanying platform software
• Genome studio, Pyro Q-CpG software
Summary
Flowchart for analysis of quantitative DNA methylation. Adapted from E Andres Houseman, 2012 (Chapter 5 from Epigenetic Epidemiology. Karin B. Michels (ed). Springer, 2012) (Its also worth looking for reviews by Kimberly D Siegmund)