giuseppe lancia university of udine the phasing of heterozygous traits: algorithms and complexity

Giuseppe LanciaUniversity of Udine

The phasing of The phasing of heterozygous heterozygous

traits: traits: Algorithms and ComplexityAlgorithms and Complexity

-The genomic age has allowed to look at ourselves in a detailed, comparative way

-All humans are >99% identical at genome level

-Small changes in a genome can make a big difference in how we look and who we are

What makes us different from each other?

The answer is

POLYMORPHISMSPOLYMORPHISMS

This is true for humans

as well as for other species

Polymorphisms are features existing in different“flavours”, that make us all look (and be) different

Examples can be eye-color, blood type, hair, etc…

In fact, polymorphisms in the way we look (phenotyes) are determined by polymorphisms in our genome

For a given polymorhism, say the eye-color, thepossible forms are called alleles

We all inherit two alleles (paternal and maternal)

identical HOMOZYGOUS

If they are

different HETEROZYGOUS

mother

father

childHomozygous

mother

father

childHomozygous

mother

father

childHeterozygous

Dominant Recessive

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

mother

father

childHomozygous

mother

father

childHeterozygous

mother

father

childHomozygous

Dominant Recessive

mother

father

mother

father

mother

father

mother

father

mother

father

mother

father

SingleSingle NucleotideNucleotidePolymorphismsPolymorphisms

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

At DNA level, a polymorphism is a sequence of nucleotidesvarying in a population.

The shortest possible sequence has only 1 nucleotide, hence

SSingle NNucleotide PPolymorphism (SNP)

atcggattagttagggcacaggacggac

atcggattagttagggcacaggacgtac

atcggcttagttagggcacaggacgtac

atcggcttagttagggcacaggacggac

- SNPs are predominant form of human variations

- Used for drug design, study disease, forensic, evolutionary...

- On average one every 1,000 bases

atcggattagttagggcacaggacgt

HAPLOTYPEHAPLOTYPE: chromosome content at SNP sites

GENOTYPEGENOTYPE: “union” of 2 haplotypes

{c}{g,t}

{a,c}{g,t}

{a}{g}

{a}{g,t} {a}{t}

{a,c}{g}

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

CHANGE OF SYMBOLSCHANGE OF SYMBOLS: each SNP only two values in a population (bio).

Call them 0 and 1. Also, call 2 the fact that a site is heterozygous

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2

{a,c}{g,t}

{a}{g,t}

{c}{g,t}

{a}{g}

{a}{t}

{a,c}{g}

HAPLOTYPEHAPLOTYPE: string over 0, 1GENOTYPEGENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

0 + 0 =--- 0

1 + 1 =--- 1

0 + 1 + 1 = 0 = --- --- 2 2

ALGEBRA OF HAPLOTYPES:

Homozygous sites Heterozygous (ambiguous) sites

1110110000

1110010001

1100110100

1100010101

Phasing the allelesPhasing the alleles

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

THE PHASING (or HAPLOTYPING) PROBLEMTHE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

The input is GENOTYPE data

1101111101

0001111101

1101101101

1101111011

0001100011

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 11221, 22221, 11011, 21221, 00011 }

Each genotype is resolved by two haplotypes

We will define some objectives for H

--without objectives/constraints, the haplotyping problem would be (mathematically)trivial

OBJECTIVES

22021 00001 11011

E.g., always put 0 above and 1 below

12022 10000 11011

--the objectives/constraints must be “driven by biology”

2°) 2°) (parsimony): minimize |H|

1°) 1°) Clark’s inference rule

3°) Perfect Phylogeny3°) Perfect Phylogeny

4°) Disease Association4°) Disease Association

OBJECTIVES

Obj: Clark’s ruleObj: Clark’s rule

1st1st

1011001011 +********** =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

1011001011 +1101001110 =1221001212

known haplotype h

known (ambiguos) genotype g

Inference RuleInference Rule

for a compatible pair h , g

new (derived) haplotype h’

We write h + h’ = g

1st Objective (Clark, 1990)1st Objective (Clark, 1990)

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic

0000100022001122

1100 1111 SUCCESS

0000100022001122

0100 FAILURE (can’t resolve 1122 )

Step 3 is non-deterministic: the algorithm could end without explainingall genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

OBJ: find order of application rule that leaves the fewest elements in GOBJ: find order of application rule that leaves the fewest elements in G

The problem was studied by Gusfield(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size (practical for small real instances)

Obj: Max ParsimonyObj: Max Parsimony

2nd2nd

- Clark conjectured solution (when found) uses min # of haplotypes

- this is clearly false

- solution with few haplotypes is biologically relevant (as we all descend from a small set of ancestors)

011101

111111

011000

010001

010011

111111

011101

111111

011000

010001

010011

111111

011111 022211

012022

minimize |H|

2nd Objective (parsimony)2nd Objective (parsimony) :

1. The problem is APX-Hard1. The problem is APX-Hard

Reduction from VERTEX-COVER

A B C D E *

AB BC AE DE AD

A B C D E *

AB BC AE DE AD

A B C D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

A 0B 0C 0D 0E 0

A B C D E *

AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2

A 0 0 B 0 0C 0 0 D 0 0 E 0 0

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1B’ 1 0 1 1 1 1E’ 1 1 1 1 0 1

A basic ILP formulation

Expand your input G in all possible ways

220 120 022

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

yhh 21 ,

010 + 100, 000 + 110100 + 110 000 + 011, 001 + 010

220 120 022

The resulting Integer Program (IP1):

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

Obj: Perfect PhylogenyObj: Perfect Phylogeny

3rd3rd

- Parsimony does not take into account mutations/evolution of haplotypes

- parsimony is very relialable on “small” haplotype blocks

- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)

- the cleanest model for evolution is the perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogenyperfect phylogeny

has 2 legs

3rd objective is based on perfect phylogenyperfect phylogeny

has tailflies

has 2 legs

But…a new species may come along so that noPerfect phylogeny is possible…

has tailflies

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

two legs

TheoremTheorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

two legs

We can consider each SNP as a binary feature

Objective:Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)

0 1 2 02 1 0 22 0 2 0

0 1 0 00 1 1 01 1 0 10 1 0 01 0 0 00 0 1 0

0 1 2 02 1 0 22 0 2 0

0 1 0 00 1 1 01 1 0 10 1 0 0 1 0 0 00 0 1 0

NOT a perfect phylogeny solution !

0 1 2 0 0 1 0 20 0 0 2

0 1 0 0 0 1 1 00 1 0 0

1 1 0 1 0 0 0 00 0 0 1

A perfect phylogeny

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types (forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

22 00 + 11 or 22 01 + 10

- a complex visit of the graph decides how to phase free SNPs

Obj: Disease AssociationObj: Disease Association

4th4th

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

02011 21221

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

1101111101

0110100001

1101101101

0101100011

0001100011

02011 21221

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

INPUT: GD = {11221,21221,02011}, GH = {11221,02201,00011}

1100111111

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

Summary:

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems (mostly from binary nature of SNPs)

- these problems are technology-dependant and may become obsolete (hopefully after we have retired)

ThanksThanks

giuseppe lancia university of udine the phasing of heterozygous traits: algorithms and complexity

ag cg ag cg ag ag snps

bases ag

sequence of nucleotidesvarying

shortest possible sequence

drug design

study disease

genome levelsmall changes

comparative way

Documents

maxidas lancia v2.31function list

lancia brochure - tom mortimer - front end developer ·...

brevetto triflux lancia

lancia steering wheel control interface

skemaer for -...

july 2021 lancia 2000 berlina sedan parts cars-lancia... ·...

icohtec newsletter...icohtec newsletter no 159 january 2019...

far maco vigi lancia

mech5 applications list: lancia - november 2016 · lancia...

lancia parts from lancia auto ®: custodians of the proud...

network of excellence wp 1 nanosensing with si nanowires...

ezrl2423 lancia fulvia 1600 hf painted -...

welcome in udine!!

lancia aurelia b20gt outlaw

welcome by furio honsell, major of udine city cristiana...

genes & allelesgenes & alleles genotype & phenotypegenotype...

brochure lancia scorpion

lancia musa owner handbook english.pdf

udine 20091217

study visit to udine, italy