efficient computation of close upper and lower bounds on the minimum number of recombinations in...

17
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu, Dan Gusfield UC Davis ISMB 2005

Upload: dangelo-spicer

Post on 31-Mar-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Efficient Computation of Close Upper and Lower Bounds on

the Minimum Number of Recombinations in Biological Sequence Evolution

Yun S. Song, Yufeng Wu, Dan Gusfield

UC Davis

ISMB 2005

Page 2: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Meiotic Recombination (single-crossover)

Prefix Suffix

Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species.

Estimating the frequency and the location of recombination is central to modern-day genetics.

(e.g. disease association mapping)

1 1L L

1 L

b b

b

Page 3: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

s1 =

s2 =

s3 =

s4 =

0

0

1

1

0

1

0

0

s1 =

s2 =

s3 =

s4 =

0

0

1

1

0

1

0

1

2

00

10 10 01 00

1

10 11 01 00

0 0

21

All four possible gametic types

Assumption: at most

one mutation per site

SNP Sequences

Possible gametic types:

{ 00, 01, 10, 11 }

1 0 0 1

1 1

Recombination

Mutation

Page 4: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

1 =2 =3 =4 =5 =6 =7 =8 =9 =10 =11 =

Given a set M of sequences, what is the minimum number Rmin(M) of recombinations needed for constructing evolutionary histories that explain M?

Minimizing Recombinations

Kreitman’s data from the adh locus of D. Melanogaster (1983)

M =

Minimization is NP-hard. (Wang et al 2000, Semple 2004)

Page 5: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Bounds on the Minimum Number Rmin(M) of Recombinations

Rmin(M)L(M) <

M, a set of sequences

< U(M)MinimumNo efficientmethod.

Lower bound

There are many methods.

Upper Bound

Novel.

Our Contribution:

Efficient, practical algorithms for computing lower and upper bounds on Rmin(M).

Key idea: If L(M) = U(M), then we know Rmin(M).

Empirical observation: L(M) = U(M) frequently for a surprisingly large range of data.

Page 6: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

The Composite Method (Myers & Griffiths 2003)

M

1. Given a set of intervals, and

Composite Problem: Find the minimum number of vertical lines so that every I intersects at least L(I) vertical lines.

2

1

2

2

2

31

2. for each interval I, a number L(I)

Let L(I) be a “local” recombination lower bound for I.

The composite recombination lower bound on Rmin(M) is

given by a solution to the composite problem.

8

Page 7: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Optimal Haplotype Bound as L(I) S = A subset of columns in I. Haplotype Bound h(S) = (Number of distinct rows

restricted to S) (Number of distinct columns in S) 1.

Optimal Haplotype Bound Opt(I) = Maximum value

of h(S) over all subsets S of columns in interval I.

1 0 0 1 0 11 1 1 0 1 00 0 1 0 0 01 0 1 0 1 10 1 0 1 1 1 1 1 1 0 1 1

I

1 1 00 1 11 1 01 0 00 0 1 0 0 1

0 1 00 1 11 0 00 1 0

1 0 1 0 0 1

1 01 10 11 10 0 1 1

0 0 11 1 00 1 00 1 11 0 1 1 1 1

h(S) = 421 = 1 h(S) = 631 = 2

Page 8: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Myers & Griffiths : For every interval I, restrict the maximum size (s) of S and the maximum distance (w)

between the leftmost and the rightmost columns in S.

I

|S| < s,d

d < w

Implemented in RecMin, along with the composite method.

Computing this optimal haplotype bound Opt(I) is NP-hard. (Bafna & Bansal 2005)

RecMin is a tremendous improvement over previous practical lower bound methods. But,

1. The user is instructed to experiment with parameters s and w until the bound does not change.

2. That does not guarantee that the bound could not be improved by further increasing the parameters.

Page 9: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

(Implemented in HapBound)

1. No parameters.

2. Much faster than RecMin.

3. Implements additional ideas that produce lower bounds even better than the optimal haplotype bound.

How to derive sharper bounds? In the composite method, check if each local bound L(I) is in fact equal to Rmin(I),

and if not, increase L(I) by one. (S and M options)

We cast the problem as a classic set cover problem that can be formulated as an ILP problem, with 1 variable per column and 1 inequality per pair of rows.

We can compute the optimal haplotype bound exactly.

Can use either GNU ILP Solver or CPLEX.

Page 10: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

RecMin vs. HapBound on the human LPL (Nickerson et al., 1998)

Program Lower Bound Time

RecMin –s 8 –w 12 (default) 59 3 sec

RecMin s 25 w 25 75 7944 sec

RecMin s 48 w 48 No result 5 days

HapBound 75 31 sec

HapBound S 78 1643 sec

88 Sequences, 48 sites

Page 11: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

1 9 7 10 3 6 5 2 8 4

Mutation

Recombination

Upper Bound on Rmin(M)

Branch and Bound construction of genealogies backwards in time (using an alternating series of coalescent, mutation, and recombination events).

Page 12: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

B&B uses recombination lower bounds and randomization.

Implemented in SHRUB (Simulated History Recombination Upper Bound)

SHRUB constructs genealogies that can be viewed using an open source program.

Contains U(M) recombination vertices.

Page 13: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Kreitman’s ADH data (1983)11 alleles of the alcohol dehydrogenase locus of Drosophila melanogaster. (43 Sites)

There is only one previous implemented method that computes Rmin(M) exactly. (Song and Hein, 2003)

That method took about 1.5 GB of memory and 30 minutes of CPU time to find Rmin(M) = 7.

Page 14: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

We tried 9 different implemented lower bound methods, aside from HapBound. They all produced either 5 or 6.

Time

Both HapBound (with –M option) and SHRUB produced 7 and took only a fraction of a second to analyze this data set.

L(M) = U(M) = 7 Rmin(M) = 7.

An evolutionary history, found by SHRUB, with 7 recombination events.

It corresponds to the most parsimonious history.

Page 15: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

The Human LPL Data (Nickerson et al., 1998)

(88 Sequences, 48 sites)

Our new lower bound

HapBound S M

Upper bound

SHRUB

(We ignored insertion/deletion, unphased sites, and sites with missing data.)

Composite optimal haplotype bounds

Page 16: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Match frequency for simulated data = scaled recombination rate.= scaled mutation rate.

Frequency of having L(M) = U(M)

Used Hudson’s MS to generate1000 simulated datasets for each pair of and

For < 5, our lower and upper bounds match over 90% of the time.

This is a significant progress, as there currently exists no other method that can find Rmin(M) for more than 9 sequences after some data reduction.

n = number of sequences

Page 17: Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Softwares

HapBound and SHRUB can be found at

wwwcsif.cs.ucdavis.edu/~gusfield/