forest approach to genetic studies

48
Forest Approach to Genetic Studies Heping Zhang Presented at IMS Genomic Workshop, NUS Singapore, June 8, 2009 d Xiang Chen, Ching-Ti Liu, Minghui Wang, Meizhuo Z

Upload: ivor-cross

Post on 03-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Forest Approach to Genetic Studies. And Xiang Chen, Ching-Ti Liu, Minghui Wang, Meizhuo Zhang. Heping Zhang. Presented at IMS Genomic Workshop, NUS Singapore, June 8, 2009. Outline. Background for genetic studies of complex traits Recursive partitioning, trees, and forests - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Forest Approach to Genetic Studies

Forest Approach to Genetic StudiesForest Approach to Genetic Studies

Heping ZhangHeping Zhang

Presented at IMS Genomic Workshop, NUS Singapore, June 8, 2009

And Xiang Chen, Ching-Ti Liu, Minghui Wang, Meizhuo Zhang

Page 2: Forest Approach to Genetic Studies

2

OutlineOutline

Background for genetic studies of complex traits

Recursive partitioning, trees, and forests

Challenges & solutions in genetic studies

A case study

Page 3: Forest Approach to Genetic Studies

Complex TraitsComplex Traits

Diseases that do not follow Mendelian Inheritance Pattern

Genetic factors, Environment factors, G-G and G-E interactions

Interactions: effects that deviate from the additive effects of single effects

3 of 48

Page 4: Forest Approach to Genetic Studies

4

Successes in Genetic Studies of Complex TraitsSuccesses in Genetic Studies of Complex Traits

Genetic variants have been identified for Age-related Macular Degeneration, Diabetes, Inflammatory Bowel Disorders, etc.

Page 5: Forest Approach to Genetic Studies

5

SNP and Complex TraitsSNP and Complex Traits

http://en.wikipedia.org/wiki/Single_nucleotide_polymorphism

Page 6: Forest Approach to Genetic Studies

6

SNPs and HaplotypesSNPs and Haplotypes

Page 7: Forest Approach to Genetic Studies

7

Gold MiningGold Mining

Page 8: Forest Approach to Genetic Studies

8

Regression approachRegression approach

1

2

25

26

72

*

*

*

*

*

~

~

~

~

~

Page 9: Forest Approach to Genetic Studies

9

Classic Modeling vs Genomic Association AnalysisClassic Modeling vs Genomic Association Analysis

In classic statistical modeling, we tend to have an adequate sample size for estimating parameters of interest. Often, we have hundreds or thousands of observations for the inference on a few parameters. We can try to settle an “optimal” model.In genomic studies, we have more and more variables (gene based) but the access to the number of study subjects remains the same. One model can no longer provide an adequate summary of the information.

Page 10: Forest Approach to Genetic Studies

Recursive PartitioningRecursive Partitioning

A technique to identify heterogeneity in the data and fit a simple model (such as constant or linear) locally, and this avoids pre-specifying a systematic component.

10 of 48

Page 11: Forest Approach to Genetic Studies

11

Leukemia DataLeukemia Data

Source: http://www-genome.wi.mit.edu/cancer

Contents:

• 25 mRNA - acute myeloid leukemia (AML)

• 38 - B-cell acute lymphoblastic leukemia (B-ALL)

• 9 - T-cell acute lymphoblastic leukemia (T-ALL)

• 7,129 genes

Question: are the microarray data useful in classifying different types of leukemia?

Page 12: Forest Approach to Genetic Studies

12

3-D View3-D View

AML

T-ALLB-ALL

Page 13: Forest Approach to Genetic Studies

13

Click to see the diagram

Node SplittingNode Splitting

Page 14: Forest Approach to Genetic Studies

14

Tree StructureTree Structure

Node 7Node 6Node 5Node 4

Node 1

Node 3Node 2

Page 15: Forest Approach to Genetic Studies

15

ForestsForests

To identify a constellation of models that collectively help us understand the data. For example, in gene expression profiling, we can select and rank the genes whose expressions show a great promise of classifying tumor cells.

Page 16: Forest Approach to Genetic Studies

16

Bagging (Bootstrap Aggregating)Bagging (Bootstrap Aggregating)

Cancer NormalHigh

Low

A random tree

A Random Forest

Repetition

A tree

Page 17: Forest Approach to Genetic Studies

17

Choose 20 best splits

Choose 3 best splits for each daughter node

1.1

2.13.13.23.3

1.1

2.1

1.1

2.2

1.1

2.3

1.1

2.2 3.43.53.6

1.1

2.3 3.73.83.9

. . .1.1 1.2 1.3 1.4 1.5

For the highlighted daughter nodes, we choose three best splits

Deterministic ForestDeterministic Forest

Page 18: Forest Approach to Genetic Studies

Challenge I: Memory ConstraintChallenge I: Memory Constraint

• The number of SNPs makes it impossible to conduct a full genomewide association study in standard desktop computers.

• Data security requirements often do not allow the analysis done in computers with huge memory.

• We need a simple but efficient memory management design.

18 of 48

Page 19: Forest Approach to Genetic Studies

How to Use Memory Efficiently?How to Use Memory Efficiently?

0 (AA), 1 (AB), 2 (BB) & 3 (missing)

2 0 3 1 0 …

1 0 0 0 1 1 0 1 0 0 …

byte bit

Com

pression D

ecompression

19 of 48

Page 20: Forest Approach to Genetic Studies

WillowsWillows

20 of 48

Page 21: Forest Approach to Genetic Studies

Williows GUIWilliows GUI

21

Page 22: Forest Approach to Genetic Studies

Williows OutputWilliows Output

22 of 48

Page 23: Forest Approach to Genetic Studies

Challenge II: Haplotype CertaintyChallenge II: Haplotype Certainty

SNPs Directly observed No uncertainty

Less informativeTree approaches

Haplotypes Inferred from SNPs

Uncertain More informative Forest approaches

23 of 48

Page 24: Forest Approach to Genetic Studies

24

Forest Forming SchemeForest Forming Scheme

Unphased data

Estimated haplotype frequencies

Reconstructed phased data 1

Reconstructed phased data 2

Reconstructed phased data 3

Reconstructed phased data 4

Reconstructed phased data n

……

Importance index for haplotype 1

Importance index for haplotype 2

Importance index for haplotype 3

Importance index for haplotype k

……

Tree 1

Tree 2

Tree 3

Tree 4

Tree n

……

Page 25: Forest Approach to Genetic Studies

25

Haplotype Frequency EstimationHaplotype Frequency Estimation

Existing haplotype frequency estimation software that output a set of haplotype pairs with corresponding frequencies for each subject in each region.

We used SNPHAP (Clayton 2006)

Page 26: Forest Approach to Genetic Studies

26

Unphased to Phased DataUnphased to Phased Data

One unphased data expands to a large number of phased datasets.

In each region, an individual’s haplotype pair is randomly selected based on the estimated frequencies to account for the uncertainty of the haplotypes.

Haplotypes with low frequencies (~5-10%) should have some representations.

Page 27: Forest Approach to Genetic Studies

27

Trees Based on Phased DataTrees Based on Phased Data

A tree is grown for each phased data set.

A random forest is formed for all phased data sets.

Page 28: Forest Approach to Genetic Studies

28

Inference from the ForestInference from the Forest

ce.independen of statistictest - theof value

theis and node ofdepth theis where

,2

in tree haplotype of Importance

2

2t

by split is ,

2

tL

V

Th

t

htTtt

Lh

t

Page 29: Forest Approach to Genetic Studies

29

Significance LevelSignificance Level

Distribution of the maximum haplotype importance

under null hypothesis is determined by permutation.

First, disease status is permuted among study

subjects while keeping the genome intact for all

individuals.

Then, each of the permuted data set is treated in the

same way as the original data.

Page 30: Forest Approach to Genetic Studies

Simulation Studies (2 loci)Simulation Studies (2 loci)• 300 cases and 300 controls • Each region has 3 SNPs• 12 interaction models from Knapp et. al. (1994) and Becker et. al. (2005)• 2 additive models with background penetrance• 3 scenarios

• Neither region is in LD with the disease allele• One of the regions is in LD (D’ = 0.5) with the disease allele• Both regions are in LD (D’ = 0.5) with the disease allele

30 of 48

Page 31: Forest Approach to Genetic Studies

31

Simulation Studies (2 loci)Simulation Studies (2 loci)

00f01f 02f

10f 11f12f

20f 21f 22f

Region 2

0 1 2

Region 1

0

1

2

Penetrance

Page 32: Forest Approach to Genetic Studies

32

Simulation Studies (2 loci)Simulation Studies (2 loci)00f01f02f10f11f12f20f21f22f

ffModel

Ep-1 0 0 0 0 0 0.707 0.210 0.210

Ep-2 0 0 0 0 0 0 0 0.778 0.600 0.199

Ep-3 0 0 0 0 0 0 0 0 0.900 0.577 0.577

Ep-4 0 0 0 0 0 0.911 0.372 0.243

Ep-5 0 0 0 0 0 0 0.799 0.349 0.349

Ep-6 0 0 0 0 0 1.000 0.190 0.190

Het-1 0 0.495 0.053 0.053

Het-2 0 0 0.660 0.279 0.040

Het-3 0 0 0 0 1.000 0.194 0.194

S-1 0 0.522 0.052 0.052

S-2 1 1 1 0 0 0.574 0.228 0.045

S-3 1 1 1 0 0 0 0.512 0.194 0.194

Ad-1 0.04 0.304 0.02 0.01 0.01 0.01 0.799 0.349 0.349

Ad-2 0.15 0.324 0.10 0.05 0.05 0.05 0.799 0.349 0.349

ff

f

f

ff

f f

ff

f

f

f

ff

f ff f f f

f

ff

f

f

ff

f

fff

fff

ffffff

ff

ffff

ff

f

f

ggg

gggg

1p 2p

22 ffg

Page 33: Forest Approach to Genetic Studies

33

Simulation Studies (2 loci)Simulation Studies (2 loci)

Benckmark:

FAMHAP software from Becker et. al. (2005)

Page 34: Forest Approach to Genetic Studies

34

Result for Scenario IResult for Scenario I

False positive rate:

Our method: < 1%

FAMHAP: > 5%

Page 35: Forest Approach to Genetic Studies

35

Result for Scenario IIResult for Scenario II

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Ep-1 Ep-2 Ep-3 Ep-4 Ep-5 Ep-6 Het-1

Identify the correcthaplotype (Forest)

Identify an incorrecthaplotype (Forest)

Identify SNPs in thecorrect region(FAMHAP)

Identify SNPs in theneutral region(FAMHAP)

Page 36: Forest Approach to Genetic Studies

36

Result for Scenario IIResult for Scenario II

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Het-2 Het-3 S-1 S-2 S-3 Ad-1 Ad-2

Identify the correcthaplotype (Forest)

Identify an incorrecthaplotype (Forest)

Identify SNPs in thecorrect region(FAMHAP)

Identify SNPs in theneutral region(FAMHAP)

Page 37: Forest Approach to Genetic Studies

37

Result for Scenario IIIResult for Scenario III

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Po

wer

Ep-1 Ep-2 Ep-3 Ep-4 Ep-5 Ep-6 Het-1

Identify at leastone haplotype(Forest)Identify bothhaplotypes(Forest)Identify SNPs in atleast one region(FAMPHAP)Identify SNPs inboth regions(FAMHAP)

Page 38: Forest Approach to Genetic Studies

38

Result for Scenario IIIResult for Scenario III

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Po

wer

Het-2 Het-3 S-1 S-2 S-3 Ad-1 Ad-2

Identify at leastone haplotype(Forest)Identify bothhaplotypes(Forest)Identify SNPs in atleast one region(FAMHAP)Identify SNPs inboth regions(FAMHAP)

Page 39: Forest Approach to Genetic Studies

Real Case StudyReal Case Study

Age-related macular degeneration (AMD)Leading cause of vision loss in elderlyAffects more than 1.75 million individuals in the United StatesProjected to about 3 million by 2020

Klein et al. (2005)Case-control (96 AMD cases, 50 controls)~100,000 SNPs for each individualCFH gene identified

39 of 48

Page 40: Forest Approach to Genetic Studies

40

Analysis ProcedureAnalysis Procedure

Willows programEach SNP is used as one covariateTwo SNPs identified as potentially associated with AMD (rs1329428 on chromosome 1 and rs10272438 on chromosome 7)

Hapview program: LD block construction6-SNP block for rs132942811-SNP block for rs10272438

Page 41: Forest Approach to Genetic Studies

41

ResultResult

Two haplotypes are identifiedMost significant: ACTCCG in region 1 (p-value = 2e-6)

Identical to Klein et. al. (2005)Located in CFH gene

Another significant haplotype: TCTGGACGACA, in region 2 (p-value = 0.0024)

Not reported beforeProtectiveLocated in BBS9 gene

Page 42: Forest Approach to Genetic Studies

42

Expected FrequenciesExpected Frequencies

Page 43: Forest Approach to Genetic Studies

43

RemarksRemarks

A

A

B

Page 44: Forest Approach to Genetic Studies

Main ReferencesMain References

• Chen, Liu, Zhang, Zhang, PNAS, 2007• Zhang, Wang, Chen, BMC Bioinformatics,

2009• Chen, et al., BMC Genetics, 2009

44 of 48

Page 45: Forest Approach to Genetic Studies

45

BooksBooks

Zhang HP and Singer B. Recursive Partitioning in the Health Sciences. Springer, 1999.

Page 46: Forest Approach to Genetic Studies

46

Trees in Genetic StudiesTrees in Genetic Studies

Zhang and Bonney (2000)Nelson et al. (2001)Bastone et al. (2004)Cook, Zee and Ridker (2004)Foulkes, De Gruttola and Hertogs (2004)

Page 47: Forest Approach to Genetic Studies

47

References on ForestsReferences on Forests

Breiman L. Bagging predictors. Machine Learning, 24(2):123-140, 1996.

Zhang HP. Classification trees for multiple binary responses. Journal of the American Statistical Association, 93: 180-193, 1998.

Zhang HP et al. Cell and Tumor Classification using Gene Expression Data: Construction of Forests. Proceedings of the National Academy of Sciences USA, 100: 4168-4172, 2003.

Page 48: Forest Approach to Genetic Studies

Thank you!Thank you!

48 of 48