giuseppe lancia university of udine the phasing of heterozygous traits: algorithms and complexity

Post on 11-Jan-2016

212 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Giuseppe LanciaUniversity of Udine

The phasing of The phasing of heterozygous heterozygous

traits: traits: Algorithms and ComplexityAlgorithms and Complexity

-The genomic age has allowed to look at ourselves in a detailed, comparative way

-All humans are >99% identical at genome level

-Small changes in a genome can make a big difference in how we look and who we are

What makes us different from each other?

The answer is

POLYMORPHISMSPOLYMORPHISMS

This is true for humans

as well as for other species

Polymorphisms are features existing in different“flavours”, that make us all look (and be) different

Examples can be eye-color, blood type, hair, etc…

In fact, polymorphisms in the way we look (phenotyes) are determined by polymorphisms in our genome

For a given polymorhism, say the eye-color, thepossible forms are called alleles

We all inherit two alleles (paternal and maternal)

identical HOMOZYGOUS

If they are

different HETEROZYGOUS

{

mother

father

childHomozygous

mother

father

childHomozygous

mother

father

childHeterozygous

Dominant Recessive

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

mother

father

child

mother

father

child

mother

father

child

??

??

??

??

??

??

mother

father

child

mother

father

child

mother

father

child

??

??

??

??

??

??

SingleSingle NucleotideNucleotidePolymorphismsPolymorphisms

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

- SNPs are predominant form of human variations

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

atcggcttagttagggcacaggacgtac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggattagttagggcacaggacgt

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggcttagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

GENOTYPEGENOTYPE: “union” of 2 haplotypes

{c}{g,t}

{a,c}{g,t}

{a}{g}

{a}{g,t} {a}{t}

{a,c}{g}

{a,c}{g}

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

{a,c}{g}

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2

ag at

ct ag

ct cg

at at

ag cg

ag cg

ag ag

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

{a,c}{g}

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

10 11

01 10

01 00

11 11

10 00

10 00

10 10

02

22

10

12 11

20

20

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

10 11

01 10

01 00

11 00

00 10

10 10

02

22

10

12

22

20

0 + 0 =--- 0

1 + 1 =--- 1

0 + 1 + 1 = 0 = --- --- 2 2

ALGEBRA OF HAPLOTYPES:

Homozygous sites Heterozygous (ambiguous) sites

12202

1110110000

1110010001

1100110100

1100010101

Phasing the allelesPhasing the alleles

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

THE PHASING (or HAPLOTYPING) PROBLEMTHE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data

00011

11011

21221

22221

11221

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

The input is GENOTYPE data

1101111101

00011

0001111101

1101101101

1101111011

0001100011

11011

21221

22221

11221

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

Each genotype is resolved by two haplotypes

We will define some objectives for H

--without objectives/constraints, the haplotyping problem would be (mathematically)trivial

OBJECTIVES

22021 00001 11011

E.g., always put 0 above and 1 below

12022 10000 11011

--the objectives/constraints must be “driven by biology”

2°) 2°) (parsimony): minimize |H|

1°) 1°) Clark’s inference rule

3°) Perfect Phylogeny3°) Perfect Phylogeny

4°) Disease Association4°) Disease Association

OBJECTIVES

Obj: Clark’s ruleObj: Clark’s rule

1st1st

1011001011 +********** =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

1011001011 +1101001110 =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

new (derived) haplotype h’

We write h + h’ = g

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

1100

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000100022001122

1100 1111 SUCCESS

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

0000100022001122

0100

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000100022001122

0100 FAILURE (can’t resolve 1122 )

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic: the algorithm could end without explainingall genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

The problem was studied by Gusfield(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size (practical for small real instances)

Obj: Max ParsimonyObj: Max Parsimony

2nd2nd

- Clark conjectured solution (when found) uses min # of haplotypes

- this is clearly false

- solution with few haplotypes is biologically relevant (as we all descend from a small set of ancestors)

011101

111111

011000

010001

010011

111111

011101

111111

011000

010001

010011

111111

022

222

012

221

011111 022211

012022

012

222

minimize |H|

2nd Objective (parsimony)2nd Objective (parsimony) :

1. The problem is APX-Hard1. The problem is APX-Hard

Reduction from VERTEX-COVER

A

B

C

D E

A

B

C

D E

A B C D E *

A

B

C

D E

A B C D E *

AB BC AE DE AD

A

B

C

D E

A B C D E *

AB BC AE DE AD

A B C D E

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

ABCDE

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

A 0B 0C 0D 0E 0

A

B

C

D E

A B C D E *

AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2

A 0 0 B 0 0C 0 0 D 0 0 E 0 0

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1B’ 1 0 1 1 1 1E’ 1 1 1 1 0 1

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A basic ILP formulation

Expand your input G in all possible ways

220 120 022

A basic ILP formulation

Expand your input G in all possible ways

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

A basic ILP formulation

hx

21,hh

hx

yhh 21 ,

Expand your input G in all possible ways

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

A basic ILP formulation

The resulting Integer Program (IP1):

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

Obj: Perfect PhylogenyObj: Perfect Phylogeny

3rd3rd

- Parsimony does not take into account mutations/evolution of haplotypes

- parsimony is very relialable on “small” haplotype blocks

- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)

- the cleanest model for evolution is the perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogenyperfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

3rd objective is based on perfect phylogenyperfect phylogeny

has tailflies

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

has 2 legs

But…a new species may come along so that noPerfect phylogeny is possible…

has tailflies

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

two legs

tail

flies

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

two legs

tail

flies

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 02 1 0 22 0 2 0

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 0 00 1 1 01 1 0 10 1 0 01 0 0 00 0 1 0

0 1 2 02 1 0 22 0 2 0

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 02 1 0 22 0 2 0

0 1 0 00 1 1 01 1 0 10 1 0 0 1 0 0 00 0 1 0

NOT a perfect phylogeny solution !

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 0 0 1 0 20 0 0 2

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 0 0 1 0 20 0 0 2

0 1 0 0 0 1 1 00 1 0 0

1 1 0 1 0 0 0 00 0 0 1

A perfect phylogeny

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types (forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

22 00 + 11 or 22 01 + 10

- a complex visit of the graph decides how to phase free SNPs

Obj: Disease AssociationObj: Disease Association

4th4th

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

00011

02011 21221

02201

11221

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

11221

1101111101

00011

0110100001

1101101101

0101100011

0001100011

02011 21221

02201

11221

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

1100111111

11221

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

Summary:

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems (mostly from binary nature of SNPs)

- these problems are technology-dependant and may become obsolete (hopefully after we have retired)

ThanksThanks

top related