recursive partitioning and its applications in genetic studies

31
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University

Upload: samira

Post on 23-Jan-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Recursive Partitioning And Its Applications in Genetic Studies. Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University. OUTLINE. Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies linkage analysis - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Recursive Partitioning And Its Applications in Genetic Studies

Recursive Partitioning And Its Applications in Genetic Studies

Chin-Pei Tsai

Assistant ProfessorDepartment of Applied Mathematics

Providence University

Page 2: Recursive Partitioning And Its Applications in Genetic Studies

OUTLINE

Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies

linkage analysis association analysis

Recursive-partitioning based tools for data analyses

Page 3: Recursive Partitioning And Its Applications in Genetic Studies

Genetic Data

Nuclear Family

Father Mother

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

Affected

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

1 2

3 4 5 6

Tree-based Analyses in

Genetic Studies

Page 4: Recursive Partitioning And Its Applications in Genetic Studies

Genetic Data

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

Genotype

1 7

1 7

2 2

2 2

2 6 3 3

7 2 2 3

1 6 2 3

1 2 2 3

7 2 2 3

3 4 2 5

3 2 4 4

3 3 2 4

3 2 5 4

Page 5: Recursive Partitioning And Its Applications in Genetic Studies

Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues

Purpose: to predict new tissue

Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001)

Page 6: Recursive Partitioning And Its Applications in Genetic Studies

Node 1 CT:40 NT:22

Node 2 CT: 0 NT:14

Node 3 CT: 40 NT: 8

>60M26383

Node 5 CT: 30 NT: 0

Node 4 CT: 10 NT: 8

>290R15447

Node 7 CT: 0 NT: 7

Node 6 CT: 10 NT: 1

>770M28214

Automatically Selected Tree (by RTREE)

Page 7: Recursive Partitioning And Its Applications in Genetic Studies

log(M26383)

log

(R1

54

47

)

3 4 5 6 7

45

67

Node 2Node 3

Page 8: Recursive Partitioning And Its Applications in Genetic Studies

log(R15447)

log

(M2

82

14

)

4 5 6 7

5.5

6.0

6.5

7.0

7.5

Node 5Node 7

Node 6

Page 9: Recursive Partitioning And Its Applications in Genetic Studies

3-D Representation of Tree

Page 10: Recursive Partitioning And Its Applications in Genetic Studies

The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes.

Concluding Remarks

Using three genes can achieve high

classification accuracy.

These three genes are related to tumors.

Page 11: Recursive Partitioning And Its Applications in Genetic Studies

Tree Growing

Impurity functions: entropy

For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 0 11/2 p

1/2

Splitting criterion

Goodness of Split

= weighted sum of node impurities

Basic Ideas in

Classification Trees

Page 12: Recursive Partitioning And Its Applications in Genetic Studies

Node Impurity

.6853 .6365

.3251 .4741

.6931 .6829

By left rightGender 10 9 1 1Race 9 7 2 3Smoked 9 1 2 9Age 7 7 4 3

GenderMale

109

11

1110

Cancer subjects 11

Normal subjects 10

2

1log

2

1

2

1log

2

16931.0

right .6931

19

9log

19

9

19

10log

19

106918.0

Entropy

left.6918

Page 13: Recursive Partitioning And Its Applications in Genetic Studies

Goodness of Split

left right19/21 2/2116/21 5/2110/21 11/2114/21 7/21

Weight (p(t))

s.6919.6737.4031.6897

No split: .6920

Goodness of split s = p(L)i(L) + p(R)i(R)

Entropy (i(t))

By left rightGender .6918 .6931Race .6853 .6365Smoked .3251 .4741Age .6931 .6829

Page 14: Recursive Partitioning And Its Applications in Genetic Studies

Tree Pruning

Fisher Exact Test Misclassification cost and rate Cost-complexity and complexity

parameter Optimal sub-trees

Page 15: Recursive Partitioning And Its Applications in Genetic Studies

Genetic Data

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

Genotype

1 7

1 7

2 2

2 2

2 6 3 3

7 2 2 3

1 6 2 3

1 2 2 3

7 2 2 3

3 4 2 5

3 2 4 4

3 3 2 4

3 2 5 4

Page 16: Recursive Partitioning And Its Applications in Genetic Studies

Key Idea in Tree-based Analysis

If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance.

1 2 3 4Sib pair

Page 17: Recursive Partitioning And Its Applications in Genetic Studies

Covariate: the expected IBD (identity by descent) sharing at each marker locus

Tree-based Linkage Analysis

Unit of observation: sib pair

The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.

Page 18: Recursive Partitioning And Its Applications in Genetic Studies

Identity by Descent (IBD)

Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings.

1 3

Sib 1

2 4

Sib 2

IBD=0

1 3

Sib 1

2 3

Sib 2

IBD=1

1 3

Sib 1

1 3

Sib 2

IBD=2

1 2

Father’s genotype

3 4

Mother’s genotype

Page 19: Recursive Partitioning And Its Applications in Genetic Studies

The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002)

Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG)

We used data from 223 individuals in 51 families with 77 sib pairs.

Hoarding is a component of obsessive-compulsive disorder.

Genotypes are allele sizes from 370 markers on 22 chromosomes.

Page 20: Recursive Partitioning And Its Applications in Genetic Studies

232826

The Gilles de la Tourette Syndrome Phenotype data

IBD Sharing at D5SMfd154

P=0.0011> 1.9

708

162818

Overall p-value = 2.63e-6

D4S1652 P=0.0078> 1.16

10 34

617 14

D5S408 P=0.0034> 0

080

162018

Split p-values

Linkage Tree

Page 21: Recursive Partitioning And Its Applications in Genetic Studies

The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information.

Tree-based Association Study

The response variable is affection status.

If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.

Page 22: Recursive Partitioning And Its Applications in Genetic Studies

85135

3929

4688

4677

1954

011

2723

Copies of Allele D4S403-5

D4S2632-5

D4S2431-10

> 0 P=2e-4

> 1,NA P= 0.016

> 0 P=0.0023

Overall p-value = 1.03e-7

46106

D5S816-7> 0,NA P= 0.0017

018

Split p-values

The Gilles de la Tourette Syndrome Phenotype data

Association Tree

Page 23: Recursive Partitioning And Its Applications in Genetic Studies

Why Recursive Partitioning? Attempt to discover possibly very complex

structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical)

No need to do transformation

Impervious to outliers

Easy to use

Easy to interpret

Page 24: Recursive Partitioning And Its Applications in Genetic Studies

Recursive partitioning based tools for data analysis

Classification and regression RTREE (http://peace.med.yale.edu) CART

Longitudinal data analysis MASAL (http://peace.med.yale.edu)

Survival Analysis STREE (http://peace.med.yale.edu)

Multivariate Adaptive Regression Splines MASAL (http://peace.med.yale.edu) MARS

Page 25: Recursive Partitioning And Its Applications in Genetic Studies

ReferencesBooks

L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California.

H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York.

T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.

Page 26: Recursive Partitioning And Its Applications in Genetic Studies

ReferencesPapers

Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322.

Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904.

Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735.

Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.

Page 27: Recursive Partitioning And Its Applications in Genetic Studies

Recent Development Instability of Trees (high variance)

Bagging – averages many trees to reduce variance (Breiman, 1996)

Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998)

Random forest (Breiman, 1999) Lack of Smoothness

MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001)

Difficulty in Capturing Additive StructureMARS procedure

Page 28: Recursive Partitioning And Its Applications in Genetic Studies

Competitive Tree

for

Colon Data

Page 29: Recursive Partitioning And Its Applications in Genetic Studies

corr

ela

tion

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

R15447

M28214

M26383M28214

M26383

R15447

M26383R15447M28214

Page 30: Recursive Partitioning And Its Applications in Genetic Studies

Node 1 CT: 40 NT: 22

Node 8 CT: 6 NT: 0

Node 3: CT: 6

NT: 13

(372, 1052]

R87126

X15183

Node 2 CT: 34 NT: 3

Node 5 CT: 0 NT: 3

Node 6 CT: 34 NT: 0

Node 7 CT: 0

NT: 13

>1052

>457>28

T62947

Node 4 CT: 0 NT:6

Competitive Tree

Page 31: Recursive Partitioning And Its Applications in Genetic Studies

3-D Representation of Tree