giuseppe lancia university of udine the phasing of heterozygous traits: algorithms and complexity

92
Giuseppe Lancia University of Udine The phasing of heterozygous The phasing of heterozygous traits: traits: Algorithms and Complexity Algorithms and Complexity

Upload: edward-chandler

Post on 11-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Giuseppe LanciaUniversity of Udine

The phasing of The phasing of heterozygous heterozygous

traits: traits: Algorithms and ComplexityAlgorithms and Complexity

Page 2: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

-The genomic age has allowed to look at ourselves in a detailed, comparative way

-All humans are >99% identical at genome level

-Small changes in a genome can make a big difference in how we look and who we are

Page 3: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

What makes us different from each other?

The answer is

POLYMORPHISMSPOLYMORPHISMS

Page 4: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

This is true for humans

as well as for other species

Page 5: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Polymorphisms are features existing in different“flavours”, that make us all look (and be) different

Examples can be eye-color, blood type, hair, etc…

In fact, polymorphisms in the way we look (phenotyes) are determined by polymorphisms in our genome

Page 6: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

For a given polymorhism, say the eye-color, thepossible forms are called alleles

We all inherit two alleles (paternal and maternal)

identical HOMOZYGOUS

If they are

different HETEROZYGOUS

{

Page 7: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

childHomozygous

Page 8: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

childHomozygous

mother

father

childHeterozygous

Dominant Recessive

Page 9: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

Page 10: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

Page 11: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

child

mother

father

child

mother

father

child

??

??

??

??

??

??

Page 12: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

mother

father

child

mother

father

child

mother

father

child

??

??

??

??

??

??

Page 13: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

SingleSingle NucleotideNucleotidePolymorphismsPolymorphisms

Page 14: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

Page 15: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

Page 16: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- SNPs are predominant form of human variations

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

Page 17: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgt

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

Page 18: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

Page 19: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

Page 20: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

GENOTYPEGENOTYPE: “union” of 2 haplotypes

{c}{g,t}

{a,c}{g,t}

{a}{g}

{a}{g,t} {a}{t}

{a,c}{g}

{a,c}{g}

Page 21: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

{a,c}{g}

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2

Page 22: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

{a,c}{g}

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

Page 23: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

10 11

01 10

01 00

11 11

10 00

10 00

10 10

02

22

10

12 11

20

20

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

Page 24: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

10 11

01 10

01 00

11 00

00 10

10 10

02

22

10

12

22

20

0 + 0 =--- 0

1 + 1 =--- 1

0 + 1 + 1 = 0 = --- --- 2 2

ALGEBRA OF HAPLOTYPES:

Homozygous sites Heterozygous (ambiguous) sites

Page 25: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

12202

1110110000

1110010001

1100110100

1100010101

Phasing the allelesPhasing the alleles

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

Page 26: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

THE PHASING (or HAPLOTYPING) PROBLEMTHE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

Page 27: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

The input is GENOTYPE data

00011

11011

21221

22221

11221

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

Page 28: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

The input is GENOTYPE data

1101111101

00011

0001111101

1101101101

1101111011

0001100011

11011

21221

22221

11221

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

Each genotype is resolved by two haplotypes

We will define some objectives for H

Page 29: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

--without objectives/constraints, the haplotyping problem would be (mathematically)trivial

OBJECTIVES

22021 00001 11011

E.g., always put 0 above and 1 below

12022 10000 11011

--the objectives/constraints must be “driven by biology”

Page 30: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

2°) 2°) (parsimony): minimize |H|

1°) 1°) Clark’s inference rule

3°) Perfect Phylogeny3°) Perfect Phylogeny

4°) Disease Association4°) Disease Association

OBJECTIVES

Page 31: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Obj: Clark’s ruleObj: Clark’s rule

1st1st

Page 32: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1011001011 +********** =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

Page 33: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1011001011 +1101001110 =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

new (derived) haplotype h’

We write h + h’ = g

Page 34: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

Page 35: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

Page 36: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

Page 37: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

1100

Page 38: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000100022001122

1100 1111 SUCCESS

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

Page 39: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

Page 40: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

0100

Page 41: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000100022001122

0100 FAILURE (can’t resolve 1122 )

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

Page 42: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explainingall genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

Page 43: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

The problem was studied by Gusfield(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size (practical for small real instances)

Page 44: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Obj: Max ParsimonyObj: Max Parsimony

2nd2nd

Page 45: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- Clark conjectured solution (when found) uses min # of haplotypes

- this is clearly false

- solution with few haplotypes is biologically relevant (as we all descend from a small set of ancestors)

Page 46: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

011101

111111

011000

010001

010011

111111

Page 47: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

011101

111111

011000

010001

010011

111111

022

222

012

221

011111 022211

012022

012

222

Page 48: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

minimize |H|

2nd Objective (parsimony)2nd Objective (parsimony) :

Page 49: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1. The problem is APX-Hard1. The problem is APX-Hard

Reduction from VERTEX-COVER

Page 50: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

Page 51: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

Page 52: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB BC AE DE AD

Page 53: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB BC AE DE AD

A B C D E

Page 54: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

ABCDE

Page 55: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

A 0B 0C 0D 0E 0

Page 56: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2

A 0 0 B 0 0C 0 0 D 0 0 E 0 0

Page 57: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

Page 58: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

Page 59: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

Page 60: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1B’ 1 0 1 1 1 1E’ 1 1 1 1 0 1

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

Page 61: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

A basic ILP formulation

Page 62: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Expand your input G in all possible ways

220 120 022

A basic ILP formulation

Page 63: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Expand your input G in all possible ways

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

A basic ILP formulation

Page 64: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

hx

21,hh

hx

yhh 21 ,

Expand your input G in all possible ways

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

A basic ILP formulation

Page 65: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

The resulting Integer Program (IP1):

Page 66: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity
Page 67: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

Page 68: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity
Page 69: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Obj: Perfect PhylogenyObj: Perfect Phylogeny

3rd3rd

Page 70: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- Parsimony does not take into account mutations/evolution of haplotypes

- parsimony is very relialable on “small” haplotype blocks

- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)

- the cleanest model for evolution is the perfect phylogeny

Page 71: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogenyperfect phylogeny

Page 72: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

3rd objective is based on perfect phylogenyperfect phylogeny

has tailflies

Page 73: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

But…a new species may come along so that noPerfect phylogeny is possible…

has tailflies

Page 74: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

two legs

tail

flies

Page 75: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

two legs

tail

flies

Page 76: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

Page 77: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 02 1 0 22 0 2 0

Page 78: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 0 00 1 1 01 1 0 10 1 0 01 0 0 00 0 1 0

0 1 2 02 1 0 22 0 2 0

Page 79: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 02 1 0 22 0 2 0

0 1 0 00 1 1 01 1 0 10 1 0 0 1 0 0 00 0 1 0

NOT a perfect phylogeny solution !

Page 80: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 0 0 1 0 20 0 0 2

Page 81: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 0 0 1 0 20 0 0 2

0 1 0 0 0 1 1 00 1 0 0

1 1 0 1 0 0 0 00 0 0 1

A perfect phylogeny

Page 82: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Page 83: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types (forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

22 00 + 11 or 22 01 + 10

- a complex visit of the graph decides how to phase free SNPs

Page 84: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Obj: Disease AssociationObj: Disease Association

4th4th

Page 85: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

Page 86: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

00011

02011 21221

02201

11221

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

11221

Page 87: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

1101111101

00011

0110100001

1101101101

0101100011

0001100011

02011 21221

02201

11221

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

1100111111

11221

Page 88: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity
Page 89: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

Page 90: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity
Page 91: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

Summary:

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems (mostly from binary nature of SNPs)

- these problems are technology-dependant and may become obsolete (hopefully after we have retired)

Page 92: Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

ThanksThanks