ph.d candidate: lan liu

54
Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees PH.D candidate: Lan Liu Advisor: Tao Jiang

Upload: sona

Post on 15-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees. PH.D candidate: Lan Liu. Advisor: Tao Jiang. Outline. The haplotype inference problem The tagSNP selection problem The minimum common integer partition problem. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: PH.D candidate: Lan Liu

Some Algorithmic Problems Concerning the Inference and Analysis of TagSNPs, Haplotypes and Pedigrees

PH.D candidate: Lan Liu

Advisor: Tao Jiang

Page 2: PH.D candidate: Lan Liu

Outline

The haplotype inference problem The tagSNP selection problem The minimum common integer

partition problem

Page 3: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 4: PH.D candidate: Lan Liu

Introduction Basic concepts

Example: Mendelian experiment

2 2

2 1

1 2

1 1

1 2

Genotype

Haplotype

Locus

2 1 PS value=1

1 2 PS value=0

2 2Homozygous

1 1Heterozygous

Mendelian Law: one haplotype comes from the mother and the other comes from the father.

paternal maternal

Page 5: PH.D candidate: Lan Liu

Notations and Recombinant

1122

2222

Genotype

1222

2122

Haplotype Configuration

1111

2222

2222

2222

1111

0 recombinant

2222

FatherMother

Child: recombinant

1111

2222

2222

2222

1122

2222

1 recombinant

FatherMother

child

Page 6: PH.D candidate: Lan Liu

Pedigree

Camilla, Duchess of Cornwall

Peter Phillips Zara Phillips

Diana,Princess of Wales

Prince Williamof Wales

Prince Henry ofWales

PrincessBeatrice of York

PrincessEugenie of York

Lady LouiseWindsor

Prince Charles,Prince of Wales

Princess Anne, Princess Royal

CommanderTimothy Laurence

Prince Andrew,Duke of York

SarahMargaret Ferguson

Prince Edward, Earl of Wessex

Sophie Rhys-Jones

Elizabeth II ofthe United Kingdom

Prince Philip,Duke of Edinburgh

CaptainMark Phillips

An example: British Royal Family

A mating loop: a cycle inside the pedigree.

Page 7: PH.D candidate: Lan Liu

Haplotype Reconstruction - Haplotype: useful, expensive - Genotype: cheaper to obtain

1 21 2

1 21 2

M C

1 21 2

1 21 2

1 21 2

M C

1 21 2

(a)

1 21 2

1 22 1

M C

1 21 2

(b)

Reconstruct haplotypes from genotypes

Page 8: PH.D candidate: Lan Liu

Problem Definitions MRHC Given a pedigree and the genotype

information for each member, find a haplotype configuration for each member which obeys Mendelian law, s.t. the number of recombinants are minimized.

ZRHC: zero-recombinant

Loop-free-ZRHC: zero recombinant, pedigree with no mating loops

Page 9: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 10: PH.D candidate: Lan Liu

Approximation and Complexity of MRHC

The known hardness results for MRHC

NP-hard [LJ03]

P [LJ03]

P [DLJ03]

NP-hard [DLJ03]

2-locus-MRHCTree-MRHC with

bounded #membersTree-MRHC withbounded #loci

Tree-MRHC

Hardness

2-locus-MRHC: 2 loci Tree-MRHC: pedigree having no mating loops

Page 11: PH.D candidate: Lan Liu

Our Hardness and Approximation Results

Lower boundof approx.

ratio

Any f(n)

Any f(n)

Any constant

Assumption

P≠ NP

P≠ NP

P≠ NPthe Unique Games

Conjecture[Khot02]

Binary-tree-MRHC

2-locus-MRHC*

Binary-tree-MRHC*

2-locus-MRHC

Hardness

NP

Tree-MRHC Any constant P≠ NP

the Unique GamesConjecture

Upper boundof approx.

ratio

O ( )

The lower boundholds for

2-locus-MRHC*(4,1)

Binary-tree-MRHC*(1,1)

2-locus-MRHC(16,15)

Tree-MRHC(1,u)Tree-MRHC(u,1)

)log(n

Tree-MRHC: no mating loop Binary-tree-MRHC: 1 mate, 1 child Binary-tree-MRHC*: 1 mate, 1 child, missing data

2-locus-MRHC: 2 loci 2-locus-MRHC*: 2 loci with missing data

Page 12: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 13: PH.D candidate: Lan Liu

The ZRHC problem Problem definition Given a pedigree and the genotype

information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

Page 14: PH.D candidate: Lan Liu

Previous work Li and Jiang introduced a system of linear equations

over F[2] and presented an O(m3n3) time algorithm for ZRHC [LJ03] , where m is #loci and n is #members in pedigree.

Recently, Chan et al. proposed a linear-time algorithm in [CCC+06], which only works for pedigree without mating loops.

Methods based on fast matrix multiplication algorithms could achieve an asymptotic speed of O(k2.376) on k equations with k unknowns.

The Lanczos and conjugate gradient algorithms are only heuristics [GV96]. The Wiedeman algorithm has expected quadratic running time [W86].

Page 15: PH.D candidate: Lan Liu

Our Result

We present a much faster algorithm for ZRHC with running time . 2 3 2log log logO mn n n n

Ax=b

O mn

O mn Ax=b O mn Ax=b

transformation

redundancy elimination

O(n log2n log log n)

O(n)

O(n)

Page 16: PH.D candidate: Lan Liu

The New Linear System n, m

m : #loci n: #members in pedigree Unknowns

: the paternal haplotype vector of a member j. : the scalar demonstrating inheritance info between a parent j1 and a child j.

Page 17: PH.D candidate: Lan Liu

The New Linear System

0100

1101

0000

0111

0 0 0 1

1101

j2 j1

j

Pj1,1

pj1,2

pj1,3

pj1,4

j2

j

j1

Pj2,1

pj2,2

pj2,3

pj2,4

Pj2,1 +0

pj2,2 +1

pj2,3 +1

pj2,4 +1

Pj,1

pj,2

pj,3

pj,4

Pj,1 +1

pj,2 +1

pj,3 +0

pj,4 +0

hj1,j hj2,j

Pj1 +wj1Pj1Pj2 Pj2 +wj2

Pj1,1 +1

pj1,2 +0

pj1,3 +0

pj1,4 +1

Pj Pj +wj

pj1,2=1 pj1,3=0

Father

Mother

Child

Page 18: PH.D candidate: Lan Liu

The Linear System

O(mn) equations on O(mn) unknowns.

Given a homozygous locus i on a member j (with a child j1), pj[i] and pj1[i] are pre-determined.

Page 19: PH.D candidate: Lan Liu

Pedigree Graph A pedigree with genotype

1

6

9

8

32

4 75

12

11

12

12

11

12

12

12

12

22

12

12

12

22

22

12

12

12

11

22

12

11

12

12

22

12

12

1

6

9

8

32

4 75

Pedigree graph G

#edges · 2n

Page 20: PH.D candidate: Lan Liu

Locus Graph

Locus graph Gi

1

6

9

8

32

4 75

12 22 11

12 12 12 11

12

22

Example: Locus graph for the 3rd locus

Gi = (V, Ei), where Ei= {(k,j)| k is a parent of j, wk[i]=1}

(a) Genotype info

Zero-weight

:

1

6

9

8

32

4 75

? 1 0

1 1 1 0

1

0

h1,4

h4,9h8,9

h6,8

(b) Locus graph

p-variables: variables on vertices. h-variables: variables on edges shared by all locus graphs.

Page 21: PH.D candidate: Lan Liu

An Observation For any cycle or any path connecting two pre-determined vertices in a locus graph, the summation of h-variables along the path is a constant.We can use paths to denote

constraints!

a constant

+ dj0, j1

Pj1[i]hj1, j2

Pj2[i] Pjk-1[i] Pjk[i]hjk-1, jk

dj1, j2 djk-1, jk

Pj1[i] + dj1, j2+ hj1, j2 = Pj2[i]Pj2[i] + dj2, j3+ hj2, j2 = Pj3[i]…

Pjk-1[i] + djk-1, jk+ hjk-1, jk= Pjk[i]

Pj0[i]hj0, j1

dj0, j1

Pj0[i] = Pj1[i]

+ hj0, j1

(proof sketch) Assume the path in locus graph Gi connecting two pre-determined vertices j0 and jk .

Page 22: PH.D candidate: Lan Liu

Examples of Linear Constraints

1

6

9

8

32

4 75

? 1 0

1 1 1 0

1

0h8,9

h6,8

(a) 1st locus graph h6,8 + h8,9= 1

1

6

9

8

32

4 75

0 ? ?

1 ? ? 1

0

1:

(b) 2nd locus graph h3,5 + h3,6 + h2,5 + h2,6 =

0

h2,5

h3,5 h3,6

h2,6

1

6

9

8

32

4 75

? ? ?

? ? ? ?

0

1

h6,8

h2,4

h2,5

h3,5 h3,6

h4,9

(c) 3rd locus graph h4,9 + h2,4 + h2,5 + h3,5 +

h3,6 + h6,8 = 0

Page 23: PH.D candidate: Lan Liu

Linear Constraints

Obviously, the linear constraints are necessary. We can also show that these constraints are sufficient.

Moreover, we can upper bound #constraints in each locus graph as O(n), while the trivial analysis gives an upper bound O(n2).

Total #constraints = O(mn).

Page 24: PH.D candidate: Lan Liu

The ZRHC-PHASE algorithm

Algorithm ZRHC_PHASE

input: a pedigree G=(V,E) and genotype {gj}

output: a general solution of {pj}

begin

Step 1. Preprocessing

Step 2. Linear constraint generation on h-variables

Step 3. Solve h-variables by Gaussian Elimination

Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.

end

Our method Solve h-variables and p-variables separately

O(mn) linear equations on O(n) h-variables.

Traditional method Solve h-variables and p-variables together

O(mn) equations on O(mn) unknowns: O(mn) p-variables and O(n) h-variables.

Page 25: PH.D candidate: Lan Liu

Our Method

Ax=b

O mn

O mn Ax=b O mn Ax=b

transformation

redundancy elimination

O(n log2n log log n)

O(n)

O(n)

Page 26: PH.D candidate: Lan Liu

Redundant Equation Eliminationj0 j1

jk-1

jk

jk-2

j2

An observation

Given a cycle , assume that there are constraints among each pair of vertices. Originally, there are O(k2) constraints. Notice that they are not independent. We can replace the original constraints by an equivalent set of constraints with size O(k).

j2 ~ jk-1

j0 ~ j2

j0 ~ jk-1

Remove the redundant equations without solving them!

Key lemma

Page 27: PH.D candidate: Lan Liu

Given a spanning tree, the stretch of an edge (k, j) is defined as the length of the unique path between k and j on the tree.

Elkin, Emeky, Spielman and Teng shows that we can embed any graph in a low-stretch spanning tree with average stretch O(log2n log log n).

The number of irredundant constraints can be bounded by the sum of cycle lengths, which is further bounded by the sum of stretches O(nlog2n log log n).

Redundant Equation Elimination

Page 28: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 29: PH.D candidate: Lan Liu

The Loop-Free ZRHC problem

Problem definition Given a pedigree without mating loops

and the genotype information for each member, find a recombination-free haplotype configuration for each member that obeys the Mendelian law of inheritance.

Page 30: PH.D candidate: Lan Liu

Constraint Graphs Given the constraints in a pedigree graph, we can

construct the corresponding constraint graph.Pedigree Graph

vertex v A constraint for the path connecting vertices j and k with the sum of h-variables along the path being b

Constraint Graphvertex v

An edge (j, k) with weight b

(b) Corresponding constraint graph

1 2

3

4

51

1

0

0

An example

(a) A pedigree graph with constrains

1 2

3

4

5path

(1,5)(1,2)

Sum ofh-variables

11

Constraints

(2,4) 0(2,5) 0

Page 31: PH.D candidate: Lan Liu

A Key Lemma There exists a solution to the loop-free ZRHC problem

if and only if the weight sum of every cycle C is 0 in the corresponding constraint graph.

”<=” Done by a construction later.

1 2

3

4

5

(proof sketch)

Each h-variables occurs even number of times in the constraint set S corresponding to C. The sum of h-variable in S is equal to the weight sum of C. The weight sum of C is 0.

”=>”

1 2

3

4

51

1

0

(a) The pedigree graph (b) Corresponding constraint graph

The constraints in S are not independent!

Page 32: PH.D candidate: Lan Liu

The constraints forming a spanning forest in the constraint graph are sufficient to represent all constraints.

There are at most n-1 independent constraints. We can construct an injective mapping f from

the independent constraints to edges in the pedigree graph

A Mapping from Constraints to Edges

1 2

3

4

5constraints

(1,2)edge(2,3)

Mapping

(2,4) (3,4)(2,5) (4,5)

(b) The pedigree graph(a) A spanning forest for the constraint graph

1 2

3

4

5

1

0

0

path

(1,2)

Sum ofh-variables

1

Constraints

(2,4) 0(2,5) 0

Each constraint is mapped to an edge on the path corresponding to the constraint.

Page 33: PH.D candidate: Lan Liu

The ZRHC-PHASE algorithm

Algorithm ZRHC_PHASE

input: a pedigree G=(V,E) and genotype {gj}

output: a general solution of {pj}

begin

Step 1. Preprocessing

Step 2. Linear constraint generation on h-variables

Step 3. Solve h-variables by Gaussian Elimination

Step 4. Solve the p-variables by propagation from pre-determined p-variables to others.

end

It takes O(n3) time!

Page 34: PH.D candidate: Lan Liu

Solving h-variables

In order to obtain a linear-time algorithm, we want to avoid the Gaussian elimination method.

j0 j1 jk… jk-1

An observation Given a constraint along a path j0 , j1,…, jk-1 , jk

h +h + …+ h = b j0 , j1 j1 , j2 jk-1, j k

Assign the h-variables on edges (j0 , j1), (j1, j2), …, (jk-2, jk-1) arbitrarily. Assign the h-variables on the last edge (jk-1, jk) as a fixed value to satisfy the constraint: h = h + …+ h + b.j0 , j1 jk-2, j k-1jk-1, j k

We can solve the constraint in the following way:

Page 35: PH.D candidate: Lan Liu

Solving h-variables Based on the Mapping f

We have constructed the infective mapping f : S -> E , where S is the constraint set and E is the edge set.

h-variables can be solved by a single BFS Traversal.

1 2

3

4

5constraints

(1,2)

edge

(2,3)(2,4) (3,4)(2,5) (4,5)

Mappingsum of

h-variables100

0 1

10: not in f(E)

: in f(E)

We solve h-variables as follows: For each h-variable corresponding to an edge e not

in f (S), assign an arbitrary value. For each h-variable corresponding to an edge e in f

(S), assign a fixed value based on the constraint f –

1(e), such that the constraint is satisfied.

Page 36: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 37: PH.D candidate: Lan Liu

Motivation With the rapid development of genotyping

technologies, there are more than 10 million verified single-nucleotide polymorphisms (SNPs) in dbSNP.

We aim to select a subset of informative SNPs (i.e. tagSNPs) to save the cost for genotyping all SNPs and performing disease association mapping.

Page 38: PH.D candidate: Lan Liu

r2 Linkage Disequilibrium Statistics

Given a pair of genetic markers 1 and 2.

r2 statistics: r2 =(pAB –pA. p.B)2

pA.(1-pA.) p.B(1-p.B)

If r2 is no less than a given threshold r0, marker 1 (or marker 2) can tag marker 2 (or marker 1, respectively).

bBmarker 1

marker 2

A pAB pAb pA.

a paB pab pa. p.B p.b

Page 39: PH.D candidate: Lan Liu

The TagSNP Selection Problem

Given a set V of SNP markers and LD patterns E={(vj1,vj2)| r2(vj1,vj2) is no less than a given threshold r0, vj1 and vj2 are in

V}, we want to select a subset V' of minimum cardinality, such that given any v in V, there exists a v' in V' , where r2(v,v') is no less than r0.

If we define G=(V, E), a tagSNP set is equivalent to a dominating set on G.

1 2

3

45

6

(a) SNP markers and their LD patterns in a population

1 2

3

45

6

: tagSNP

(b) TagSNPs for the population

Page 40: PH.D candidate: Lan Liu

TagSNP Selection across Populations

In two populations with different evolutionary histories, a pair of SNPs having remarkably different marker frequencies and very weak LD may show strong LD in the admixed population.

Therefore, tagSNPs picked from the combined populations or one of the populations might not be sufficient to capture the variations in all populations.

Page 41: PH.D candidate: Lan Liu

Problem Definition

Given a set of SNP markers and LD patterns in multiple populations, we want to find a minimum common tagSNP set for each of the populations.

The above problem is called the minimum common tagSNP selection problem (MCTS).

1 2

3

45

1 2

3

45

6 6

Population 1 Population 2

(a) SNP markers and their LD patterns in two populations.

1 2

3

45

1 2

3

45

6 6

Population 1 Population 2

: tagSNP

(b) The minimum TagSNP set for these two populations.

Page 42: PH.D candidate: Lan Liu

Our Algorithms The MCTS problem can be easily formulated by integer linear programming.

Lower bound: GreedyTag_lb and LRTag_lb

We calculate both the upper bound (i.e. the number of the tagSNPs obtained by our algorithms) and the lower bound (i.e. the minimum number of tagSNPs needed).

We first apply some data reduction rules, then use one of the following algorithms

A greedy algorithm: GreedyTag A Lagrangian relaxation algorithm: LRTag

Page 43: PH.D candidate: Lan Liu

Experimental Result

We apply our algorithms on real HapMap data (release #19, NCBI build 34, October 2005).

There are four populations in HapMap data. CEU: Europe descendents. CHB: Chinese people from Beijing. JPT: Japanese people from Tokyo. YRI: Yoruba people of Ibadan, Nigeria.

We get tagSNPs for the following two datasets: Encode regions: all 10 ENCODE regions with totally

10,859 markers. Human genome: chromosomes 1 – 22 with totally

2,862,454 markers.

Page 44: PH.D candidate: Lan Liu

Experiment Result for ENCODE Regions

We compare our GreedyTag and LRTag with MultiPop-TagSelect(MPS).

The gap between LRTag_lb and LRTag is at most two for each ENCODE region and totally six for all ENCODE regions with the r2 threshold being 0.5. There is no gap with the r2 threshold being 0.8.

Page 45: PH.D candidate: Lan Liu

Experiment Result for Human Genome

The gap between our solution and the lower bound is 1061 SNPs with r2 threshold being 0.5, given the entire human genome with 2,862,454 SNPs. The gap is 142 SNPs with the r2 threshold being 0.8.

The numbers of tagSNPs selected by our algorithms are almost optimal.

Page 46: PH.D candidate: Lan Liu

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Outline

Page 47: PH.D candidate: Lan Liu

Problem Definitions

P(n): given an integer n, a partition is a set of integers, say {n1,n2,…, nr}, s.t.i=1

r ni=n. Example: given n=4, {2,2} is a P(4); given n=3, {3} is a P(3).

Example: given S= {3, 3, 4}, {2,2,3,3} is an IP({3,3,4}).

IP(S): given a multiset S= {x1, , xm}, an integer partition is a disjoint union

Page 48: PH.D candidate: Lan Liu

Examples CIP(S1, S2, …, Sk): given multisets S1, S2, …, Sk ,

a common integer partition of all multisets.

Example: given S= {3, 3, 4}, T={2,2,6},

{2,2,3,3} is a CIP(S,T); {1,1,2,2,4} is also a CIP(S,T).

#P(100)=190,569,292

MCIP is NP-hard

MCIP(S1, S2, , Sk): a common integer partition with the minimum cardinality.

Example: {2,2,3,3} is a MCIP(S,T).

Page 49: PH.D candidate: Lan Liu

Biological Applications(1) The distance between

two strings a b c d e f g h i j k h h i j k h e f g a b c d

Genetic distance between two genomes

a b c d e f g h i j k h

h i j k h e f g a b c d

Minimum Common Substring Partition

Page 50: PH.D candidate: Lan Liu

Biological Applications(2)

MCIP is a special case of Minimum Common Substring Partition(MCSP)

MCIP(S',T') S'= {x1, x2, , xm} T'= {y1, y2, , yn}

aa...a |- aa...a |- aa...ax1 x2 xn

aa...a -| aa...a -| aa...ay1 y2 ym

MCSP(S,T)

S=

T=

Page 51: PH.D candidate: Lan Liu

Our Result 2- MCIP: MCIP on two input multisets

k- MCIP: MCIP on k input multisets

APX-hard: There is a constant c, s.t. a problem cannot be approximated within c.

Approximation upperbound

5/4{3k(k-1)}/(3k-2)

2-MCIPk-MCIP (k>2)

Approximation lowerbound

APX-hardAPX-hard

Page 52: PH.D candidate: Lan Liu

Conclusion and Future Work

The haplotype inference problem Biological background Approximation and complexity of MRHC Efficient algorithms for ZRHC A linear-time algorithm for loop-free ZRHC

The tagSNP selection problem The minimum common integer

partition problem

Page 53: PH.D candidate: Lan Liu

References L. Liu and T. Jiang. Linear-Time Reconstruction of Zero-Recombinant Medelian

Inheritance on Pedigrees without Mating Loops. In submission. L. Liu, Y. Wu, S. Lonardi and T. Jiang. Efficient Algorithms for Genome-wide TagSNP

Selection across Populations via Linkage Disequilibrium Criterion. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).  

Y. Wu, L. Liu, T. Close and S. Lonardi. Deconvoluting the BAC-gene Relationship Using a Physical Map. To appear in proc. of 6th Annual International Conference on Computational Systems Bioinformatics(CSB'2007).

J. Xiao, L. Liu, L. Xia and T. Jiang. Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-Free Mendelian Inheritance on a Pedigree. In Proc. of ACM-SIAM Symposium on Discrete Algorithms(SODA'2007) , pp. 655-664.

X. Chen, L. Liu, Z. Liu and T. Jiang. On the Minimum Common Integer Partition Problem. In proc.of the 6th Conference on Algorithms and Complexity, Rome, Italy, pp. 236-247.

L. Liu, X. Chen, J. Xiao and T. Jiang. Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem. In Proc.of the 16th Annual International Symposium on Algorithms and Computation (ISAAC'05) , pp. 370-379. [Best paper nominations: 5.35%]. To appear in Theoretical Computer Science.

Page 54: PH.D candidate: Lan Liu

Thanks for your time and

attention!