recursive partitioning and its applications in genetic studies chin-pei tsai assistant professor...
TRANSCRIPT
Recursive Partitioning And Its Applications in Genetic Studies
Chin-Pei Tsai
Assistant ProfessorDepartment of Applied Mathematics
Providence University
OUTLINE
Genetic data Example Basic ideas of recursive partitioning Applications in genetic studies
linkage analysis association analysis
Recursive-partitioning based tools for data analyses
Genetic Data
Nuclear Family
Father Mother
1
1
1
1
1
1
2
2
2
2
0 0
0 0
1 2
1 2
1 2
1 2
0 0
0 0
1 2
1 2
1
2
1
2
1
2
1
2
1
2
Affected
2
1
2
1
1
2
2
1
1
2
1
2
3
4
5
6
1
2
3
4
1 2
3 4 5 6
Tree-based Analyses in
Genetic Studies
Genetic Data
1
1
1
1
1
1
2
2
2
2
0 0
0 0
1 2
1 2
1 2
1 2
0 0
0 0
1 2
1 2
1
2
1
2
1
2
1
2
1
2
2
1
2
1
1
2
2
1
1
2
1
2
3
4
5
6
1
2
3
4
Genotype
1 7
1 7
2 2
2 2
2 6 3 3
7 2 2 3
1 6 2 3
1 2 2 3
7 2 2 3
3 4 2 5
3 2 4 4
3 3 2 4
3 2 5 4
Gene expression profiles of 2,000 genes in 22 normal and 40 colon cancer tissues
Purpose: to predict new tissue
Application of Recursive Partitioning in Microarray Data (Zhang et al.,PNAS, 2001)
Node 1 CT:40 NT:22
Node 2 CT: 0 NT:14
Node 3 CT: 40 NT: 8
>60M26383
Node 5 CT: 30 NT: 0
Node 4 CT: 10 NT: 8
>290R15447
Node 7 CT: 0 NT: 7
Node 6 CT: 10 NT: 1
>770M28214
Automatically Selected Tree (by RTREE)
log(M26383)
log
(R1
54
47
)
3 4 5 6 7
45
67
Node 2Node 3
log(R15447)
log
(M2
82
14
)
4 5 6 7
5.5
6.0
6.5
7.0
7.5
Node 5Node 7
Node 6
3-D Representation of Tree
The three genes, IL-8 (M26383), CANX (R15447) and RAB3B (M28214), were chosen from 2,000 genes.
Concluding Remarks
Using three genes can achieve high
classification accuracy.
These three genes are related to tumors.
Tree Growing
Impurity functions: entropy
For binary outcome, y=0, 1, let p = proportion of (y=1). Entropy: -p log(p) - (1-p) log(1-p) where 0log(0) = 0 0 11/2 p
1/2
Splitting criterion
Goodness of Split
= weighted sum of node impurities
Basic Ideas in
Classification Trees
Node Impurity
.6853 .6365
.3251 .4741
.6931 .6829
By left rightGender 10 9 1 1Race 9 7 2 3Smoked 9 1 2 9Age 7 7 4 3
GenderMale
109
11
1110
Cancer subjects 11
Normal subjects 10
2
1log
2
1
2
1log
2
16931.0
right .6931
19
9log
19
9
19
10log
19
106918.0
Entropy
left.6918
Goodness of Split
left right19/21 2/2116/21 5/2110/21 11/2114/21 7/21
Weight (p(t))
s.6919.6737.4031.6897
No split: .6920
Goodness of split s = p(L)i(L) + p(R)i(R)
Entropy (i(t))
By left rightGender .6918 .6931Race .6853 .6365Smoked .3251 .4741Age .6931 .6829
Tree Pruning
Fisher Exact Test Misclassification cost and rate Cost-complexity and complexity
parameter Optimal sub-trees
Genetic Data
1
1
1
1
1
1
2
2
2
2
0 0
0 0
1 2
1 2
1 2
1 2
0 0
0 0
1 2
1 2
1
2
1
2
1
2
1
2
1
2
2
1
2
1
1
2
2
1
1
2
1
2
3
4
5
6
1
2
3
4
Genotype
1 7
1 7
2 2
2 2
2 6 3 3
7 2 2 3
1 6 2 3
1 2 2 3
7 2 2 3
3 4 2 5
3 2 4 4
3 3 2 4
3 2 5 4
Key Idea in Tree-based Analysis
If a marker locus is close to a disease locus, then individuals from a given family who are phenotypically similar are expected to be genotypically more similar than expected by chance.
1 2 3 4Sib pair
Covariate: the expected IBD (identity by descent) sharing at each marker locus
Tree-based Linkage Analysis
Unit of observation: sib pair
The response variable y takes three possible values depending on whether none, one, or both sibs are affected, which we arbitrarily coded as 0, 1, and 2.
Identity by Descent (IBD)
Genes (or alleles) inherited by relatives from the same ancestor. For two sibs, they can share at most one IBD gene from the father, and at most one from the mother. Thus, 0, 1, or 2 genes can be shared by two siblings.
1 3
Sib 1
2 4
Sib 2
IBD=0
1 3
Sib 1
2 3
Sib 2
IBD=1
1 3
Sib 1
1 3
Sib 2
IBD=2
1 2
Father’s genotype
3 4
Mother’s genotype
The Gilles de la Tourette Syndrome (GTS) Phenotype data (Joint work with Zhang et al., 2002)
Genome scan of the hoarding phenotype collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG)
We used data from 223 individuals in 51 families with 77 sib pairs.
Hoarding is a component of obsessive-compulsive disorder.
Genotypes are allele sizes from 370 markers on 22 chromosomes.
232826
The Gilles de la Tourette Syndrome Phenotype data
IBD Sharing at D5SMfd154
P=0.0011> 1.9
708
162818
Overall p-value = 2.63e-6
D4S1652 P=0.0078> 1.16
10 34
617 14
D5S408 P=0.0034> 0
080
162018
Split p-values
Linkage Tree
The covariates include gender, the parental phenotypes, race and the variables constructed using the marker information.
Tree-based Association Study
The response variable is affection status.
If a marker has n distinct alleles, then n covariates, each taking a value of 0, 1 or 2, are then constructed for this marker. For example, if n=7, then the 7 covariates take values (0,0,0,1,0,1,0) for a genotype of 4/6 and (0,0,0,0,0,0,2) for a genotype of 7/7.
85135
3929
4688
4677
1954
011
2723
Copies of Allele D4S403-5
D4S2632-5
D4S2431-10
> 0 P=2e-4
> 1,NA P= 0.016
> 0 P=0.0023
Overall p-value = 1.03e-7
46106
D5S816-7> 0,NA P= 0.0017
018
Split p-values
The Gilles de la Tourette Syndrome Phenotype data
Association Tree
Why Recursive Partitioning? Attempt to discover possibly very complex
structure in huge databases - genotypes for hundreds of markers - expression profiles for thousands of gene - all possibly predictors (continuous, categorical)
No need to do transformation
Impervious to outliers
Easy to use
Easy to interpret
Recursive partitioning based tools for data analysis
Classification and regression RTREE (http://peace.med.yale.edu) CART
Longitudinal data analysis MASAL (http://peace.med.yale.edu)
Survival Analysis STREE (http://peace.med.yale.edu)
Multivariate Adaptive Regression Splines MASAL (http://peace.med.yale.edu) MARS
ReferencesBooks
L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone, 1984, Classification and Regression Trees, Wadsworth, California.
H. Zhang and B. Singer, 1999, Recursive Partitioning in the Health Sciences, Springer, New York.
T. Hastie, R. Tibshirani and J. Friedman, 2001, The Elements of Statistical Learning, Springer, New York.
ReferencesPapers
Zhang, Tsai, Yu, and Bonney, 2001, Genetic Epidemiology, 21, Supplement 1, S317-S322.
Zhang, Leckman, Pauls, Tsai, Kidd, Campos and The TSAICG, 2002, American Journal of Human Genetic, 70, 896-904.
Zhang, Yu, Singer and Xiong, 2001, Proc Natl Acad Sci U S A, 98, 6730-6735.
Tsai, Acharyya, Yu and Zhang, 2002, In Recent Research Developments in Human Genetic.
Recent Development Instability of Trees (high variance)
Bagging – averages many trees to reduce variance (Breiman, 1996)
Boosting (Breiman, 1998, Mason et al. 2000, Friedman el al. 1998)
Random forest (Breiman, 1999) Lack of Smoothness
MARS procedure (Zhang & Singer, 1999, Hastie et al. 2001)
Difficulty in Capturing Additive StructureMARS procedure
Competitive Tree
for
Colon Data
corr
ela
tion
0 500 1000 1500 2000
0.0
0.2
0.4
0.6
0.8
1.0
R15447
M28214
M26383M28214
M26383
R15447
M26383R15447M28214
Node 1 CT: 40 NT: 22
Node 8 CT: 6 NT: 0
Node 3: CT: 6
NT: 13
(372, 1052]
R87126
X15183
Node 2 CT: 34 NT: 3
Node 5 CT: 0 NT: 3
Node 6 CT: 34 NT: 0
Node 7 CT: 0
NT: 13
>1052
>457>28
T62947
Node 4 CT: 0 NT:6
Competitive Tree
3-D Representation of Tree