computational problems in perfect phylogeny haplotyping: xor-genotypes and tag snps

Post on 10-Feb-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs. Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5 1 Computer Science and Applied Mathematics, Weizmann Institute of Science 2 Molecular Genetics, Weizmann Institute of Science - PowerPoint PPT Presentation

TRANSCRIPT

Computational Problems in Computational Problems in Perfect Phylogeny Perfect Phylogeny Haplotyping: Haplotyping:

Xor-Genotypes and Tag SNPsXor-Genotypes and Tag SNPs Tamar BarzuzaTamar Barzuza11 Jacques S. Jacques S.

BeckmannBeckmann2,32,3

Ron ShamirRon Shamir44 Itsik Pe’erItsik Pe’er55

11Computer Science and Applied Mathematics, Weizmann Institute of Computer Science and Applied Mathematics, Weizmann Institute of ScienceScience

22Molecular Genetics, Weizmann Institute of ScienceMolecular Genetics, Weizmann Institute of Science33Génétique Médicale, Universitätsspital LausanneGénétique Médicale, Universitätsspital Lausanne 44School of Computer Science, Tel- Aviv UniversitySchool of Computer Science, Tel- Aviv University

55Medical and Population Genetics Group, Broad InstituteMedical and Population Genetics Group, Broad Institute

OverviewOverview IntroductionIntroduction Xor PPHXor PPH

Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results

Informative SNPsInformative SNPs Theoretical resultsTheoretical results

Summary and Future researchSummary and Future research

ChromosomesChromosomes

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA

AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATTAGCTGCCACA

AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA

ATTAAA

GTTTGG

AACCCC

CCCTTT

SNP – Single nucleotide SNP – Single nucleotide polymorphismpolymorphism

ATTAAA

GTTTGG

AACCCC

CCCTTT

SNP – Single nucleotide SNP – Single nucleotide polymorphismpolymorphism

Haplotypes, Genotypes and XOR-Haplotypes, Genotypes and XOR-GenotypesGenotypes

Genotype: A/T T/G A C

Haplotypes:A G A C T T A C

XOR-Genotype: Het Het Hom Hom

1 2 3 4ATTAAA

GTTTGG

AACCCC

CCCTTT

100111

100011

001111

111000

Haplotypes, Genotypes and XOR-Haplotypes, Genotypes and XOR-GenotypesGenotypes

1 2 3 4ATTAAA

GTTTGG

AACCCC

CCCTTT

100111

100011

001111

111000

Genotype: 2 2 0 1

Haplotypes:1 1 0 1 0 0 0 1

XOR-Genotype: {1, 2} {1, 2}

Perfect PhylogenyPerfect Phylogeny1 0 0 0 01 0 0 1 01 0 1 0 01 1 0 0 01 0 0 1 10 0 0 1 0

SNPs only

1 0 0 1 0

1 0 0 0 0 0 0 0 1 0 1 0 0 1 1

1 1 0 0 0 1 0 1 0 0

4: 1→01: 1→0 5: 0→1

2: 0→1 3: 0→11 0 1 0 01 1 0 0 0

2 3

Previous workPrevious workHaplotyping:Haplotyping: haplotypes from haplotypes from genotypesgenotypes::Input:Input: Genotypes Genotypes GG={={GG11,…,,…,GGnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}Output:Output: Find the haplotypes Find the haplotypes HH={={HH11,…,,…,HH22nn}} that gave rise to that gave rise to GG

General heuristics: General heuristics: Clark ’90 Clark ’90 Excoffier+Slatkin ‘95Excoffier+Slatkin ‘95

PPH:PPH: Perfect phylogeny haplotyping ( Perfect phylogeny haplotyping (nn genotypes, genotypes, mm SNPs):SNPs):Gusfield 2002Gusfield 2002 O(O(nmnm((nn,,mm)) )) Bafna et. al 2002Bafna et. al 2002O(O(nmnm22))Eskin et. al 2003Eskin et. al 2003O(O(nmnm22))

Graph Realization

Graph Realization

Previous workPrevious work

Tutte 1959 Tutte 1959 O(O(nn22mm), ), Gavril and Tamari 1983 Gavril and Tamari 1983 O(O(nmnm22), ),

Bixby and Wagner 1988 Bixby and Wagner 1988 O(O(nmnm((nn,,mm))))

The graph realization problem:The graph realization problem: Input: Input: A hypergraphA hypergraph HH=({1,…,=({1,…,mm}, }, PP))

PP={={PP11,,PP22,…,,…,PPnn}, }, PPii{1,…,{1,…,mm}}

Goal: Goal: A treeA tree TT=(=(VV,,EE) ) with with EE==NN s.ts.t PPii labels a path inlabels a path in TT

Input:Input: { {1,2}, {2,3} }{ {1,2}, {2,3} }Output:Output:

11 22 3311

22 33

OverviewOverview IntroductionIntroduction Xor PPHXor PPH

Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results

Informative SNPsInformative SNPs Theoretical resultsTheoretical results

Summary and Future researchSummary and Future research

Xor-haplotypingXor-haplotyping: haplotypes from : haplotypes from xor-genotypesxor-genotypes::Input:Input: 1. Xor-genotype data 1. Xor-genotype data (can be obtained by DHPLC)(can be obtained by DHPLC)

2. Three genotypes2. Three genotypesGoal:Goal: Resolve the haplotypes and their perfect phylogeny Resolve the haplotypes and their perfect phylogeny

XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping

haplo

type

s Xor-genotypes genotypes{1, 2}{1, 2} 0/1 0/1 0 1

{2, 4}{2, 4} 0 0/1 0 0/1

{2, 3, 4}{2, 3, 4} 0 0/1 0/1 0/1

{1, 2, 4}{1, 2, 4} 0/1 0/1 0 0/1

{1}{1} 0/1 1 0 01 1 0 10 1 0 1

1 1 0 10 0 0 1

0 1 0 10 0 0 0

0 1 1 10 0 0 01 1 0 10 0 0 0

?

????

Xor-haplotypingXor-haplotyping: haplotypes from : haplotypes from xor-genotypesxor-genotypes::Input:Input: 1. Xor-genotype data 1. Xor-genotype data (can be obtained by DHPLC)(can be obtained by DHPLC)

2. Three genotypes2. Three genotypesGoal:Goal: Resolve the haplotypes and their perfect phylogeny Resolve the haplotypes and their perfect phylogeny

XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping

haplo

type

s Xor-genotypes genotypes{1, 2}{1, 2} 0/1 0/1 0 1

{2, 4}{2, 4} 0 0/1 0 0/1

{2, 3, 4}{2, 3, 4} 0 0/1 0/1 0/1

{1, 2, 4}{1, 2, 4} 0/1 0 0/1 0/1

{1}{1} 0/1 1 0 0

?????

Strategy:Strategy: 1.1. Input: Input: Xor-genotype data Xor-genotype dataGoal:Goal: Find the perfect phylogeny Find the perfect phylogeny

2. Additional 2. Additional Input:Input: 3 genotypes 3 genotypesGoal:Goal: Find haplotypes Find haplotypes

Step 1:Step 1:Xor-genotypeXor-genotype = {Het SNPs} = A = {Het SNPs} = A pathpath in the in the perfect perfect

phylogenyphylogeny Build a tree from its paths Build a tree from its paths Graph realization Graph realization

Input reduction:Input reduction: Merge SNPs that are equivalent in the xor- Merge SNPs that are equivalent in the xor-datadata

Proof:Proof: Unique graph realization solution Unique graph realization solution A perfect phylogeny A perfect phylogeny

XPPH - Xor perfect phylogeny XPPH - Xor perfect phylogeny haplotypinghaplotyping

GREALGREAL Find graph realization or determine that none Find graph realization or determine that none

existsexists Count num of graph realization solutions for dataCount num of graph realization solutions for data Stable and fastStable and fast Available at Available at http://http://www.cs.tau.ac.il/~rshamir/grealwww.cs.tau.ac.il/~rshamir/greal//

SimulationsSimulations Simulate data of Simulate data of nn individuals using Hudson 2002 individuals using Hudson 2002 Remove all SNPs with <5% minor allele frequencyRemove all SNPs with <5% minor allele frequency Apply GREAL: Is there a single solution?Apply GREAL: Is there a single solution? Repeat 5000 times for each Repeat 5000 times for each nn

We implemented Gavril & Tamari’s algorithm (83) We implemented Gavril & Tamari’s algorithm (83) for graph realization: for graph realization: O(O(mm22nn))

ResultsResultsThe percentage of single solutions vs sample size

The percentage of single solutions vs sample size

R.H. Chung and D. Gusfield 2003

ResultsResults

Perfect phylogenyPerfect phylogeny? HaplotypesHaplotypesStep 2Step 2

1

230 0 0

1 1 01 0 1

1

231 0 0

0 1 0 0 0 1

{1, 2}{1, 3}{2, 3}

Xor-genotypes

?

XPPHXPPH

Resolution up to Resolution up to bit flippingbit flipping : gives the haplotypes : gives the haplotypes structurestructure

1

23

{1, 2}{1, 3}{2, 3}

Xor-genotypes

1 2 2Genotype

1 x x1 x x

0 x x

SNP #1 homozygous SNP #1 homozygous Can infer SNP #1 for all Can infer SNP #1 for all haplotypeshaplotypes Need individuals with Need individuals with xor-genotypes (=xor-genotypes (={het {het SNPs}) = SNPs}) =

XPPHXPPH

Perfect phylogenyPerfect phylogeny? HaplotypesHaplotypesStep 2Step 2

Theorem:Theorem: xor-genotypes=xor-genotypes= there are there are three three xor-genotypes with empty intersectionxor-genotypes with empty intersection

Proof: Proof: ! xor-genotypes are tree paths ! xor-genotypes are tree paths (ow: NP-(ow: NP-hard)hard)

(1) The intersection of two tree paths is an (1) The intersection of two tree paths is an intervalinterval

(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn

XX11

(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn

XX11

(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn

(3) (3) XXLL ends firstends first,, XXRR begins last begins last

XXLL

XXRR

XX11

XX11

(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn

(3)(3) XXLL ends firstends first,, XXRR begins last begins last

XXLL

XXRR

XX11XXLL

XXRR

XX11

(Proof) (Proof) (2) Pick (2) Pick XX11 arbitrarily, take arbitrarily, take XX11 XX22, , XX11 XX33, … , … XX11XXnn

XX11XXLLXXRR==

XXLL

XXRR

XX11 XXLL

XXRR

XX11

XXLL

XXRR

XX11

Find 3 individuals to genotype in Find 3 individuals to genotype in O(O(nmnm))

Resolve the haplotypesResolve the haplotypes

XXLL

XXRR

XX11 XXLL

XXRR

XX11

XXLL

XXRR

XX11

OverviewOverview IntroductionIntroduction Xor PPHXor PPH

Theoretical outlines and resultsTheoretical outlines and results Experimental resultsExperimental results

Informative SNPsInformative SNPs Theoretical resultsTheoretical results

Summary and Future researchSummary and Future research

Input:Input: 1. Haplotypes 1. Haplotypes HH={={HH11,…,,…,HHnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}2. A set of interesting SNPs2. A set of interesting SNPs SS""SS

Output:Output: Minimal setMinimal set SSSS\\SS"" that distinguishes the same that distinguishes the same haplotypes as haplotypes as SS""

Informative SNPs (Bafna et al. 2003):Informative SNPs (Bafna et al. 2003):

Informative SNPsInformative SNPs

1 0 0 0 00 0 1 0 00 0 0 1 10 1 0 1 0Ha

plo t

ypes

4 3

2

1

SNPs1 2 3 4 5

Not perfect phylogeny: NP-hard (Not perfect phylogeny: NP-hard (MINIMUM TEST SETMINIMUM TEST SET))Perfect phylogeny, 1 interesting SNP: O(Perfect phylogeny, 1 interesting SNP: O(nmnm), Bafna et al. 2003), Bafna et al. 2003

Informative SNPs:Informative SNPs:Input:Input: 1. Haplotypes 1. Haplotypes HH={={HH11,…,,…,HHnn} } on SNPs on SNPs SS={={ss11,…,,…,ssmm}}

2. A set of interesting SNPs2. A set of interesting SNPs SS""SS 3. A perfect phylogeny for 3. A perfect phylogeny for HH..4. A cost function4. A cost function CC::SSRR++..

Output:Output: SSSS\\SS"" with minimal costwith minimal cost that distinguishes that distinguishes the same haplotypes as the same haplotypes as SS""

Informative SNPsInformative SNPs

Generalization of prev defGeneralization of prev def

1 0 0 0 00 0 1 0 00 0 0 1 10 1 0 1 0Ha

plo t

ypes

4 3

2

1

SNPs1 2 3 4 5

We find informative SNPs setWe find informative SNPs set Of minimal costOf minimal cost For any number of interesting SNPsFor any number of interesting SNPs In O(In O(mm))

By a dynamic programming algorithm that By a dynamic programming algorithm that climbs up the perfect phylogeny treeclimbs up the perfect phylogeny tree

We prove that the definition of informative We prove that the definition of informative SNPs generalizes to a more practical SNPs generalizes to a more practical definitiondefinition Under the perfect phylogeny model, informative Under the perfect phylogeny model, informative

SNPs on genotypes and haplotypes are SNPs on genotypes and haplotypes are equivalentequivalent

SummarySummary Xor-haplotyping:Xor-haplotyping:

DefinitionDefinition Resolve haplotypes given xor-data and 3 Resolve haplotypes given xor-data and 3

genotypes in O(genotypes in O(nmnm((mm,,nn)))) ImplementationImplementation Experimental resultsExperimental results

Selection of tag SNPs:Selection of tag SNPs: Generalize to Generalize to

arbitrary costarbitrary cost many interesting SNPsmany interesting SNPs

Find optimal informative SNPs set in O(Find optimal informative SNPs set in O(mm) time) time Combinatorial observation allows practical usesCombinatorial observation allows practical uses

Future researchFuture research Relax the strong assumption of perfect Relax the strong assumption of perfect

phylogenyphylogeny Deal with data errors and missing dataDeal with data errors and missing data

Obtain empirical results for the theoretical Obtain empirical results for the theoretical work on informative SNPswork on informative SNPs Preliminary results show that blocks of up to 600 Preliminary results show that blocks of up to 600

SNPs are distinguishable by ~20 informative SNPsSNPs are distinguishable by ~20 informative SNPs

Theorem:Theorem: All genotypes are distinct within a block All genotypes are distinct within a blockProof: Proof: Assume to the contrary equivalency of two:Assume to the contrary equivalency of two:

1111

0000

11 0011 00

00 1100 11

1111

0000

1111

0000

1111

0000

2222

1100

1100

2222

1100

1100

2222

0011

1100

HaplotypePair 1

HaplotypePair 2

Genotype 1Genotype 2

top related