1 treedt:gene mapping by tree disequilibrium test author:pettri sevon dept. of computer science...

1

TreeDT:Gene Mapping by Tree Disequilibrium Test

Author:Pettri Sevon

Dept. of computer science & Finnish Genome

center. Univ. of Helsinki

Hannu T.T. Toivonen Nokia Research Center. Univ. of Helsinki

Vesa Ollikainen Finnish Genome Center. Univ. of Helsinki

Advisor: Dr. HsuGraduate: Cheng-Wen Hong

2

Outline

• 1.Motivation• 2.Objective• 3.Introduction• 4.Problem Background• 5.Method• 6.Algorithms• 7.Related Work• 8.Experiment• 9.Conclusions• 10.Personal Opinion

3

Motivation

• USA and England will finish the human gene mapping in 2003. In the long time. A geneticist will research human gene sequence variation,the inheritance of complex trait and the discovery of new disease susceptibility genes. It is an immense important for human health.

4

Objective

• We find a novel gene mapping method (TreeDT).It is effective to locate a disease-susceptility gene for a given disease.

• The gene and the proteins can be analyzed to understand the disease causing mechanisms and to design new medicines.

5

Introduction

• (1).Gene mapping aims at discovering a statistical connection from a given disease to a narrow region in the genome(chromosomes).

• (2).Genetic markers along chromosomes provide data that can be used to discover associations between patient phenotypes(diseased vs.healthy) and chromosomal regions(i.e. potential disease gene loci).

• (3).We introduce TreeDT, a novel method for gene mapping. It analyses the observed strings of markers by tree patterns that reflect the possible genetic history of a disease susceptibility(DS) gene and locate the DS gene loci effectively.

6

• (3).The contributions of TreeDT are:

• (1). A novel approach to gene mapping using tree • patterns.• (2). An efficient algorithm for generating and testing• tree patterns.• (3).a method for estimating the statistical significance• of findings.•

7

Problem Background• (1).Marker Data: A genetic marker is a short polymorphic region in• the DNA, denoted here by M1,M2,…The different variants of DNA • that different people have at the marker are alleles , denoted in our • examples by 1,2,3,…. The collection of markers is a maker map,• And its corresponding alleles constitute its haplotype (figure1) • The input data consists of haplotypes of diseased and control perso

ns.

8

Problem Background

• (2).Linkage disequilibrium• All the current carriers of a DS gen

e have inherited from a founder who introduced the gene mutation to population(figure2).

• And if find a haplotype linked with the mutation locus forever.It is a linkage disequilibrium(LD),non-

• random association between nearby markers.

• (3).Gene Mapping

• Using linkage analysis to determine the relative position bet-

• -ween two genes on chromosome.

9

Problem Background

• (4).Summary of Background and Problem• Located markers can be very informative:given an ance

stor with a mutated gene, the descendants that inherit the gene are also likely to inherit alleles of nearby markers.

• The LD-based gene mapping problem is now.• The input consists of a marker map,and a set of disease-• -associated haplotypes and a set of control haplotypes o

n the given map.The task is to predict the location of a disease susceptibility gene on map.

10

Method

• Based on the observed haplotypes, TreeDT evaluates the most likely coalescence tree at a number of locations along the analyzed chromosome.and then assesses the subtree clustering of disease-associated haplotypes in these trees(Using tree disequilibrium test,intended for predicting DS gene location.)

11

Method

• (1).Haplotype Prefix Trees:Given a location(potential gene locus) in the chromosome-the haplotypes to the right(or to the left) of the location can be organized into a prefix tree (Figure3and4) .

• TreeDT builds two prefix trees, one to the left and one to the right ,

• Between each pair of consecutive markers and test their disequilibrum.

12

Method

• (2).Tree Disequilibrium Test( for a haplotype prefix tree T)• H0:The disease-association statuses are randomly distributed in the leaves

of T.• H1:The distribution of the disease-association statuses deviates in some su

btrees of T from the overall distribution of statuses.• For measuring the disequilibruim: The test statistic Zk for a tree with k devia

nt subtrees T1,..,TK ,where ai is the number of disease-associated haplotypes and ni the total number of haplotypes in subtree TiES,AND P is the proportion of disease-associated haplotypee in the sample.

k

i i

pnak

ppnz ii

1 )1(

13

Method

• (3).Significance Test• (a)Zk is a measure for the disequilibrium of a given tree,at a certain

location in the chromosome,with given k deviant subtrees.

• (b)TreeDT finds for each k the set S of subtrees that maximizes Zk

• (Zk can be efficiently maximized simultaneously for all k using a recursive algorithm.)

• (c)Since Zk’s for different degrees of freedom k are not comparable and the distribution of the maximized Zk is very complex,TreeDT estimates the p value for each maximized Zk (under H0 ), p values are estimated by a permutation test.

• (d)In order to get a single p value for the disequilibrium at a given location, A comined measure we the product of the lowest p value over aal k from each side.

14

Method

• (e)The output of TreeDT is essentially the p value ranked list of locations. A point prediction for the DS gene location is obtained by taking the best location, a (potentially fragmented) region of length L is obtained by taking best locations until a length of L is covered.

• (f)All these three nested p value tests(for each tree and k , for each location ,for the best location) can carried out efficiently.

15

Algorithms

• (1).Constructing Haplotype Prefix-Trees• The haplotype prefix-trees to the left and right from each analyze

d location can be efficiently identified using a string –sorting algorithm.

• (2).An Algorithm for Maximizing the Tree Disequilibrium Statistic Zk

• It is essential that the time-complexity of the algorithm for maximizing the Zk is as low as possible. Because it must be excuted for each tree location and permutation in turn.

• (3).INPUT: A haplotype prefix tree T• OUTPUT:Maximum values of Zk in the tree T for each k.• The time complexity of the algorithm is O(n*n),where n is the nu

mber of leaves(haplotype) in the tree.

16

Algorithm

• (4).Multiple Nest Permutation Tests• The straight forward algorithm for a three-level nested permutation t

est using nested loops would have time complexity proportional to n*n*n,where nis the number of permutations at each level.

17

Relate Work

• (1).Several statistical methods to detect LD around a DS gene. But these methods are computationally heavy.

• (2).Haplotype Pattern Mining(HPM) is based on analyzing the LD of sets of haplotype patterns.

• (3).Transmission / Disequilibrium Tests(TDT) are an established way of testing association and linkage in a sample where linkage disequilibrium exists between the mutation locus and nearby marker loci.

• (4).m-TDT is to detect LD in multipoint variant,haplotype of several alleles.

18

Experiments

• We compare TreeDT empirically to TDT, to m-TDT,and to HPM.• We evaluate the methods on Simulation of data( simulated to resem

ble a realistic population isolate.• Using 100 data sets,Each data set consisted of 200 disease-associa

ted and 200 control chromosomes.The length of be analyzed was 100 cM, and a map of 101equidistantly spaced markers,each having 5 alleles.

19

Analysis of TreeDT

• (1).First we assess the prediction accuracy(power) of TreeDT with different A ,the proportion of disease-associated chromosomes that actually carry the mutation.For A=20% or 15% the accuracy is very good. And with lower values of A the accuracy decreases until with A=5%(challenging) only in20-30% of data sets can the gene be localized within a reasonable accuracy 10-20 cm.

20

Analysis of TreeDT

• (2).We evaluate the effect of the only parameter of TreeDT,the number of deviant subtrees(founders) that are searched for in each tree (FIGURE5B).

• As we increase the number of founders (deviant subtrees),evidence about the gene location becomes more fragmented, but the upper limit of 6 subtrees gives consistently competitive results.

21

Analysis of TreeDT

• Figure 5c show the experimental relationship between power(ratio ture positives / all positives) and overall p(ratio false positives / all negatives),For higher values of A the classification accuracy is extremely good,but A=5%(challenging) the classification no better than random guessing.

22

Comparison to other methods• (1).TreeDT,HPM and m-TDT have practically identical performance

in localizing the DS gene in the baseline setting (FIGURE 6A), TDT is clearly inferior compared to the other methods.

23

Comparison to other methods

• (2).In a test setting with three founders who introduced the mutation to the population (Figure 6B),TreeDT has an edge over HPM,which in turn has an edge over m-TDT,TDT barely beats random guessing.

24

Comparison to other methods

• (3).We compare the methods with a large amount of missing data (Figure 6c).HPM is most robust with respect to missing data ,but TreeDT is not much weaker than HPM.Performance of m-TDT degrads much more clearly.

• In the previous discussion(1)(2)(3) can show that TreeDT is very competitive.

25

Conclusions

• (1).TreeDT is a novel method for gene mapping and our experiment show that TreeDT is effective in extreme conditions for gene mapping problems:with lots of noise(only 10% - 20% of affected chromosomes carry the mutation ,lots of missing data) and with small sample sizes(200 affected and 200 control chromosomes).

• (2).TreeDT is competitive with other recent data mining methods.

26

Personal Opinion

• We can find a better statistic for Tree Disequilibrium Test,• (1).The Distribution of the maximized statistic is very sim

ple and compute p values are low time complexity ,• (2).The maximized statistics are comparable in different

degrees of freedom.• (3).we don,t use Tree method to find other methods2626

.

1 treedt:gene mapping by tree disequilibrium test author:pettri sevon dept. of computer science...

Documents