1 treedt:gene mapping by tree disequilibrium test author:pettri sevon dept. of computer science...
TRANSCRIPT
1
TreeDT:Gene Mapping by Tree Disequilibrium Test
Author:Pettri Sevon
Dept. of computer science & Finnish Genome
center. Univ. of Helsinki
Hannu T.T. Toivonen Nokia Research Center. Univ. of Helsinki
Vesa Ollikainen Finnish Genome Center. Univ. of Helsinki
Advisor: Dr. HsuGraduate: Cheng-Wen Hong
2
Outline
• 1.Motivation• 2.Objective• 3.Introduction• 4.Problem Background• 5.Method• 6.Algorithms• 7.Related Work• 8.Experiment• 9.Conclusions• 10.Personal Opinion
3
Motivation
• USA and England will finish the human gene mapping in 2003. In the long time. A geneticist will research human gene sequence variation,the inheritance of complex trait and the discovery of new disease susceptibility genes. It is an immense important for human health.
4
Objective
• We find a novel gene mapping method (TreeDT).It is effective to locate a disease-susceptility gene for a given disease.
• The gene and the proteins can be analyzed to understand the disease causing mechanisms and to design new medicines.
5
Introduction
• (1).Gene mapping aims at discovering a statistical connection from a given disease to a narrow region in the genome(chromosomes).
• (2).Genetic markers along chromosomes provide data that can be used to discover associations between patient phenotypes(diseased vs.healthy) and chromosomal regions(i.e. potential disease gene loci).
• (3).We introduce TreeDT, a novel method for gene mapping. It analyses the observed strings of markers by tree patterns that reflect the possible genetic history of a disease susceptibility(DS) gene and locate the DS gene loci effectively.
6
• (3).The contributions of TreeDT are:
• (1). A novel approach to gene mapping using tree • patterns.• (2). An efficient algorithm for generating and testing• tree patterns.• (3).a method for estimating the statistical significance• of findings.•
7
Problem Background• (1).Marker Data: A genetic marker is a short polymorphic region in• the DNA, denoted here by M1,M2,…The different variants of DNA • that different people have at the marker are alleles , denoted in our • examples by 1,2,3,…. The collection of markers is a maker map,• And its corresponding alleles constitute its haplotype (figure1) • The input data consists of haplotypes of diseased and control perso
ns.
8
Problem Background
• (2).Linkage disequilibrium• All the current carriers of a DS gen
e have inherited from a founder who introduced the gene mutation to population(figure2).
• And if find a haplotype linked with the mutation locus forever.It is a linkage disequilibrium(LD),non-
• random association between nearby markers.
• (3).Gene Mapping
• Using linkage analysis to determine the relative position bet-
• -ween two genes on chromosome.
9
Problem Background
• (4).Summary of Background and Problem• Located markers can be very informative:given an ance
stor with a mutated gene, the descendants that inherit the gene are also likely to inherit alleles of nearby markers.
• The LD-based gene mapping problem is now.• The input consists of a marker map,and a set of disease-• -associated haplotypes and a set of control haplotypes o
n the given map.The task is to predict the location of a disease susceptibility gene on map.
10
Method
• Based on the observed haplotypes, TreeDT evaluates the most likely coalescence tree at a number of locations along the analyzed chromosome.and then assesses the subtree clustering of disease-associated haplotypes in these trees(Using tree disequilibrium test,intended for predicting DS gene location.)
11
Method
• (1).Haplotype Prefix Trees:Given a location(potential gene locus) in the chromosome-the haplotypes to the right(or to the left) of the location can be organized into a prefix tree (Figure3and4) .
• TreeDT builds two prefix trees, one to the left and one to the right ,
• Between each pair of consecutive markers and test their disequilibrum.
12
Method
• (2).Tree Disequilibrium Test( for a haplotype prefix tree T)• H0:The disease-association statuses are randomly distributed in the leaves
of T.• H1:The distribution of the disease-association statuses deviates in some su
btrees of T from the overall distribution of statuses.• For measuring the disequilibruim: The test statistic Zk for a tree with k devia
nt subtrees T1,..,TK ,where ai is the number of disease-associated haplotypes and ni the total number of haplotypes in subtree TiES,AND P is the proportion of disease-associated haplotypee in the sample.
k
i i
pnak
ppnz ii
1 )1(
13
Method
• (3).Significance Test• (a)Zk is a measure for the disequilibrium of a given tree,at a certain
location in the chromosome,with given k deviant subtrees.
• (b)TreeDT finds for each k the set S of subtrees that maximizes Zk
• (Zk can be efficiently maximized simultaneously for all k using a recursive algorithm.)
• (c)Since Zk’s for different degrees of freedom k are not comparable and the distribution of the maximized Zk is very complex,TreeDT estimates the p value for each maximized Zk (under H0 ), p values are estimated by a permutation test.
• (d)In order to get a single p value for the disequilibrium at a given location, A comined measure we the product of the lowest p value over aal k from each side.
14
Method
• (e)The output of TreeDT is essentially the p value ranked list of locations. A point prediction for the DS gene location is obtained by taking the best location, a (potentially fragmented) region of length L is obtained by taking best locations until a length of L is covered.
• (f)All these three nested p value tests(for each tree and k , for each location ,for the best location) can carried out efficiently.
15
Algorithms
• (1).Constructing Haplotype Prefix-Trees• The haplotype prefix-trees to the left and right from each analyze
d location can be efficiently identified using a string –sorting algorithm.
• (2).An Algorithm for Maximizing the Tree Disequilibrium Statistic Zk
• It is essential that the time-complexity of the algorithm for maximizing the Zk is as low as possible. Because it must be excuted for each tree location and permutation in turn.
• (3).INPUT: A haplotype prefix tree T• OUTPUT:Maximum values of Zk in the tree T for each k.• The time complexity of the algorithm is O(n*n),where n is the nu
mber of leaves(haplotype) in the tree.
16
Algorithm
• (4).Multiple Nest Permutation Tests• The straight forward algorithm for a three-level nested permutation t
est using nested loops would have time complexity proportional to n*n*n,where nis the number of permutations at each level.
17
Relate Work
• (1).Several statistical methods to detect LD around a DS gene. But these methods are computationally heavy.
• (2).Haplotype Pattern Mining(HPM) is based on analyzing the LD of sets of haplotype patterns.
• (3).Transmission / Disequilibrium Tests(TDT) are an established way of testing association and linkage in a sample where linkage disequilibrium exists between the mutation locus and nearby marker loci.
• (4).m-TDT is to detect LD in multipoint variant,haplotype of several alleles.
18
Experiments
• We compare TreeDT empirically to TDT, to m-TDT,and to HPM.• We evaluate the methods on Simulation of data( simulated to resem
ble a realistic population isolate.• Using 100 data sets,Each data set consisted of 200 disease-associa
ted and 200 control chromosomes.The length of be analyzed was 100 cM, and a map of 101equidistantly spaced markers,each having 5 alleles.
19
Analysis of TreeDT
• (1).First we assess the prediction accuracy(power) of TreeDT with different A ,the proportion of disease-associated chromosomes that actually carry the mutation.For A=20% or 15% the accuracy is very good. And with lower values of A the accuracy decreases until with A=5%(challenging) only in20-30% of data sets can the gene be localized within a reasonable accuracy 10-20 cm.
20
Analysis of TreeDT
• (2).We evaluate the effect of the only parameter of TreeDT,the number of deviant subtrees(founders) that are searched for in each tree (FIGURE5B).
• As we increase the number of founders (deviant subtrees),evidence about the gene location becomes more fragmented, but the upper limit of 6 subtrees gives consistently competitive results.
21
Analysis of TreeDT
• Figure 5c show the experimental relationship between power(ratio ture positives / all positives) and overall p(ratio false positives / all negatives),For higher values of A the classification accuracy is extremely good,but A=5%(challenging) the classification no better than random guessing.
22
Comparison to other methods• (1).TreeDT,HPM and m-TDT have practically identical performance
in localizing the DS gene in the baseline setting (FIGURE 6A), TDT is clearly inferior compared to the other methods.
23
Comparison to other methods
• (2).In a test setting with three founders who introduced the mutation to the population (Figure 6B),TreeDT has an edge over HPM,which in turn has an edge over m-TDT,TDT barely beats random guessing.
24
Comparison to other methods
• (3).We compare the methods with a large amount of missing data (Figure 6c).HPM is most robust with respect to missing data ,but TreeDT is not much weaker than HPM.Performance of m-TDT degrads much more clearly.
• In the previous discussion(1)(2)(3) can show that TreeDT is very competitive.
25
Conclusions
• (1).TreeDT is a novel method for gene mapping and our experiment show that TreeDT is effective in extreme conditions for gene mapping problems:with lots of noise(only 10% - 20% of affected chromosomes carry the mutation ,lots of missing data) and with small sample sizes(200 affected and 200 control chromosomes).
• (2).TreeDT is competitive with other recent data mining methods.
26
Personal Opinion
• We can find a better statistic for Tree Disequilibrium Test,• (1).The Distribution of the maximized statistic is very sim
ple and compute p values are low time complexity ,• (2).The maximized statistics are comparable in different
degrees of freedom.• (3).we don,t use Tree method to find other methods2626
.