recursive partitioning and its applications in genetic studies

Post on 23-Jan-2016

41 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Recursive Partitioning And Its Applications in Genetic Studies. Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University. OUTLINE. Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies linkage analysis - PowerPoint PPT Presentation

TRANSCRIPT

Recursive Partitioning And Its Applications in Genetic Studies

Chin-Pei Tsai

Assistant ProfessorDepartment of Applied Mathematics

Providence University

OUTLINE

Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies

linkage analysis association analysis

Recursive-partitioning based tools for data analyses

Genetic Data

Nuclear Family

Father Mother

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

Affected

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

1 2

3 4 5 6

Tree-based Analyses in

Genetic Studies

Genetic Data

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

Genotype

1 7

1 7

2 2

2 2

2 6 3 3

7 2 2 3

1 6 2 3

1 2 2 3

7 2 2 3

3 4 2 5

3 2 4 4

3 3 2 4

3 2 5 4

Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues

Purpose: to predict new tissue

Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001)

Node 1 CT:40 NT:22

Node 2 CT: 0 NT:14

Node 3 CT: 40 NT: 8

>60M26383

Node 5 CT: 30 NT: 0

Node 4 CT: 10 NT: 8

>290R15447

Node 7 CT: 0 NT: 7

Node 6 CT: 10 NT: 1

>770M28214

Automatically Selected Tree (by RTREE)

log(M26383)

log

(R1

54

47

)

3 4 5 6 7

45

67

Node 2Node 3

log(R15447)

log

(M2

82

14

)

4 5 6 7

5.5

6.0

6.5

7.0

7.5

Node 5Node 7

Node 6

3-D Representation of Tree

The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes.

Concluding Remarks

Using three genes can achieve high

classification accuracy.

These three genes are related to tumors.

Tree Growing

Impurity functions: entropy

For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 0 11/2 p

1/2

Splitting criterion

Goodness of Split

= weighted sum of node impurities

Basic Ideas in

Classification Trees

Node Impurity

.6853 .6365

.3251 .4741

.6931 .6829

By left rightGender 10 9 1 1Race 9 7 2 3Smoked 9 1 2 9Age 7 7 4 3

GenderMale

109

11

1110

Cancer subjects 11

Normal subjects 10

2

1log

2

1

2

1log

2

16931.0

right .6931

19

9log

19

9

19

10log

19

106918.0

Entropy

left.6918

Goodness of Split

left right19/21 2/2116/21 5/2110/21 11/2114/21 7/21

Weight (p(t))

s.6919.6737.4031.6897

No split: .6920

Goodness of split s = p(L)i(L) + p(R)i(R)

Entropy (i(t))

By left rightGender .6918 .6931Race .6853 .6365Smoked .3251 .4741Age .6931 .6829

Tree Pruning

Fisher Exact Test Misclassification cost and rate Cost-complexity and complexity

parameter Optimal sub-trees

Genetic Data

1

1

1

1

1

1

2

2

2

2

0 0

0 0

1 2

1 2

1 2

1 2

0 0

0 0

1 2

1 2

1

2

1

2

1

2

1

2

1

2

2

1

2

1

1

2

2

1

1

2

1

2

3

4

5

6

1

2

3

4

Genotype

1 7

1 7

2 2

2 2

2 6 3 3

7 2 2 3

1 6 2 3

1 2 2 3

7 2 2 3

3 4 2 5

3 2 4 4

3 3 2 4

3 2 5 4

Key Idea in Tree-based Analysis

If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance.

1 2 3 4Sib pair

Covariate: the expected IBD (identity by descent) sharing at each marker locus

Tree-based Linkage Analysis

Unit of observation: sib pair

The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.

Identity by Descent (IBD)

Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings.

1 3

Sib 1

2 4

Sib 2

IBD=0

1 3

Sib 1

2 3

Sib 2

IBD=1

1 3

Sib 1

1 3

Sib 2

IBD=2

1 2

Father’s genotype

3 4

Mother’s genotype

The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002)

Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG)

We used data from 223 individuals in 51 families with 77 sib pairs.

Hoarding is a component of obsessive-compulsive disorder.

Genotypes are allele sizes from 370 markers on 22 chromosomes.

232826

The Gilles de la Tourette Syndrome Phenotype data

IBD Sharing at D5SMfd154

P=0.0011> 1.9

708

162818

Overall p-value = 2.63e-6

D4S1652 P=0.0078> 1.16

10 34

617 14

D5S408 P=0.0034> 0

080

162018

Split p-values

Linkage Tree

The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information.

Tree-based Association Study

The response variable is affection status.

If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.

85135

3929

4688

4677

1954

011

2723

Copies of Allele D4S403-5

D4S2632-5

D4S2431-10

> 0 P=2e-4

> 1,NA P= 0.016

> 0 P=0.0023

Overall p-value = 1.03e-7

46106

D5S816-7> 0,NA P= 0.0017

018

Split p-values

The Gilles de la Tourette Syndrome Phenotype data

Association Tree

Why Recursive Partitioning? Attempt to discover possibly very complex

structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical)

No need to do transformation

Impervious to outliers

Easy to use

Easy to interpret

Recursive partitioning based tools for data analysis

Classification and regression RTREE (http://peace.med.yale.edu) CART

Longitudinal data analysis MASAL (http://peace.med.yale.edu)

Survival Analysis STREE (http://peace.med.yale.edu)

Multivariate Adaptive Regression Splines MASAL (http://peace.med.yale.edu) MARS

ReferencesBooks

L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California.

H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York.

T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.

ReferencesPapers

Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322.

Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904.

Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735.

Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.

Recent Development Instability of Trees (high variance)

Bagging – averages many trees to reduce variance (Breiman, 1996)

Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998)

Random forest (Breiman, 1999) Lack of Smoothness

MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001)

Difficulty in Capturing Additive StructureMARS procedure

Competitive Tree

for

Colon Data

corr

ela

tion

0 500 1000 1500 2000

0.0

0.2

0.4

0.6

0.8

1.0

R15447

M28214

M26383M28214

M26383

R15447

M26383R15447M28214

Node 1 CT: 40 NT: 22

Node 8 CT: 6 NT: 0

Node 3: CT: 6

NT: 13

(372, 1052]

R87126

X15183

Node 2 CT: 34 NT: 3

Node 5 CT: 0 NT: 3

Node 6 CT: 34 NT: 0

Node 7 CT: 0

NT: 13

>1052

>457>28

T62947

Node 4 CT: 0 NT:6

Competitive Tree

3-D Representation of Tree

top related