computational molecular biology non-unique probe selection via group testing

39
Computational Molecular Biology Non-unique Probe Selection via Group Testing

Upload: asher-richardson

Post on 02-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Computational Molecular Biology Non-unique Probe Selection via Group Testing

Computational Molecular Biology

Non-unique Probe Selection via Group Testing

Page 2: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

2

Page 3: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

3

Page 4: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

4

Page 5: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

5

DNA Microarrays

DNA Microarrays are small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations.

Contain a very large number of genes in a small size chip.

A tool for performing large numbers of DNA-RNA hybridization experiments in parallel.

Page 6: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

6

Applications Quantitative analysis of expression levels of individual

genesThe comparison of cell samples from different tissues.Computational diagnostics.

Qualitative analysis of an unknown sample Identification of micro-bacterial organisms.Detection of contamination of biotechnological products. Identification of viral subtypes.

Page 7: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

7

Unique Probes vs. Non-unique Probes

Unique probes Gene-specific probes or signature probes. Difficult to find such probes

Non-unique probes Hybridize to more than one target. Difficult to design the test based on non-unique

probes

Page 8: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

8

Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of

probes reacting to any target in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes

that separate two sets.

Page 9: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

9

Probe-Target Matrix

Non-unique probe for each sequence group

probe1(p1)probe2(p2)probe3(p3)

s1group1group2group3

s2s3s4s5s6s7s8s9

probe4(p4)probe5(p5)

group1 group2 group3

t1 t2 t3 t4 t5 t6 t7 t8 t9

p1 1 1 0 0 0 1 0 0 0

p2 0 0 0 0 1 1 1 0 0

p3 1 0 1 0 0 0 0 1 0

p4 0 1 1 1 0 0 0 0 0

p3 0 0 0 0 0 0 1 1 1

Page 10: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

10

The Problem

Given a sample with m items and a set of n non-unique proble

Goal: determine the presence or absence of targets

Page 11: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

11

The Approach

3 Steps: Pre-select suitable probe candidates and compute

the probe-target incidence matrix H. Select a minimal set of probes and compute a

suitable design matrix D (a sub-matrix of H). Decode the result.

Page 12: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

12

An Example

Matrix H sub-matrix D

Page 13: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

13

Observation

What is the property of a sub-matrix D? If we want to identify at most d targets without

testing errors, D should be either d-separable matrix or d-disjunct matrix

With at most e errors, the matrix should have the e-error correcting. That is, the Hamming distance between any two unions must be at least 2e + 1

Page 14: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

14

Non-unique probe selection via group testing

Given a matrix H Find a submatrix D such that D is d-separable

or d-disjunct with the minimum number of rows

(We can easily extend this definition to the error correcting model)

Page 15: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

15

Min-d-DS

Minimum-d-Disjunct Submatrix: Given a binary matrix M, find a submatrix H with

minimum number of rows and the same number of columns such that H is d-disjunct

Page 16: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

16

Complexity

Theorem: Min-d-DS is NP-hard for any fixed d ≥ 1 Proof: Reduce the 3 dimensional matching into it

Y ZX

a b c

0

1 1

1 1

1 1

1

Min-1-DS is NP-hard

Page 17: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

17

Page 18: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

18

Page 19: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

19

Complexity

Page 20: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

20

Page 21: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

21

Page 22: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

22

Approximation

Pair: (c0, <c1…cd>)

Cover: a probe is said to cover a pair (c0, <c1…cd>) if the incident entry at c0 is 1 where the rest is 0.

Greedy approach: While all pairs not covered yet, at each iteration:

Choose a probe that can cover the most un-covered pairs

Approximation ratio: 1 + (d+1)ln n If NP is not contained by DTIME(n^{log log n}), then

no approximation has performance (1-ε)ln n for any ε > 0.

Page 23: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

23

Pool Size is 2

Consider a case when each probe can hybridize with exactly 2 targets

Min-1-separable submatrix is also called the minimum test cover

The minimum test cover is APX-complete

The Min-d-DS is really polynomial-time solvable.

Page 24: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

24

Lemma

Consider a collection C of pools of size at most 2. Let G be the graph with all items as vertices and all pools of size 2 as edges. Then

C gives a d-disjunct matrix if and only if every item not in a singleton pool has degree at least d+1 in G.

Page 25: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

25

Proof

matrix.disjunct - a formnot does Therefore,

. labels with columns ofunion in the

contained is labelh column witThen .at of

edges all be )( )()()(

Let .most at is in degree its and of pool

singletonany in not iteman exists thereSuppose

21

00

02010

0

dC

, a, , aa

aaG

dk ,aa, , , aa, ,aa

dGC

a

k

k

0 1 1 0

1 1 0 0

1 0 1 0

0 1 0 1

Page 26: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

26

matrix.disjunct - a form Hence,

. of anyonecontain not does and

contains 2 size of pool a is there,

itemsother any for case,latter in the and

item,other any contain not does poolsingleton the

case,former In the .1least at degree

ofor poolsingleton ain either is itemevery

then exists, iteman such no if ,Conversely

1 0

21

0

dC

, a, aa

, a, , aa

d

d

a

d

d

Page 27: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

27

Theorem

Min-d-DS is polynomial-time solvable in the case that all given pools have size exactly 2

Proof: Given a graph G representing Mthen finding a minimum d-DS is equivalent to

finding a subgraph H with minimum number of edges such that all the vertices has a degree at least d + 1

Equivalent to maximize number of edges in G – H such that every vertex v has the degree at most dG(v) – d -1 in G - H

Page 28: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

28

Complexity

Min-1-DS is NP-hard in the case that all given pools have size at most 2.

Proof: Reduce Vertex-Cover

Min-d-DS is MAX SNP-complete in the case that all given pools have size at most 2 for d ≥ 2

Proof:Reduce VC-CUBICGiven a cubic graph G, find the minimum vertex-cover of

G

Page 29: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

29

Approximation

Consist of 2 steps:Step 1

Compute a minimum solution of the polynomial-time solvable problem as mentioned

Step 2Choose all singleton pools at vertices with degree

less than d+ 1 in H

Page 30: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

30

Approximation Ratio Analysis

Suppose all given pools have size at most 2. Let s be the number of given singleton pools. Then any feasible solution of Min-d-DS contains at least s+ (n-s)(d+1)/2 pools.

Page 31: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

31

Proof

pools. 21least at

contains Then pools.singleton contains

Suppose 2. size of pools 1least at in or pool

singleton ain either is itemevery 1, LemmaBy

DS.--Min ofsolution feasible a is Suppose

)/(n-s)(ds

Cs C

d

dC

Page 32: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

32

Theorem

The feasible solution obtained in the above algorithm is a polynomial-time approximation with performance ratio 1+2/(d+1).

Page 33: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

33

Proof

Suppose H contains m edges and k vertices of degree at least d+1.

Suppose an optimal solution containing s* singletons and m* pools of size 2.

Then m < m* and (n-k)-s*< 2m*/(d+1).

(n-k)+m < s*+m*+ 2m*/(d+1)

< (s*+m*)(1+2/(d+1)).

Page 34: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

34

More Challenging

Experimental Errors False negative:

Pool (probes) contains some positive targets But return the negative outcome

False positive: Pool contains all negative targets But return the positive outcome

Page 35: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

35

An e-Error Correcting Model

Assume that there is at most e errors in testing (d,e)-disjunct matrix: for any column tj , tj must

have at least e + 1 1-entries not contained in the union of other d columns.

Theorem: (d+e)-disjunct matrix without any isolated column is (d,e)-disjunct matrix

Page 36: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

36

Decoding Algorithm with e Errors

Theorem: If the number of errors is at most e, then the number of negative pools containing a positive item is always smaller than the number of negative pools containing a negative item

Algorithm: Assume there is exactly d positive ones compute the number of negative results containing each item

and select d smallest ones.

Time complexity: O(tn)

Page 37: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

37

In S(d,n) sample

What if the sample contain at most d positive (not exactly d positive)

The previous theorem holds, however, the decoding algorithm will not work

If still decoding on (d,e)-disjunct, the time complexity is O((n + t)te) where t is the number of selected probes

Page 38: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

38

Decoding

Theorem: Suppose testing done on a (d, 2e)-disjunct matrix H with at most e errors, a positive item will appear in at most e negative results.

Proof. Since there are at most e errors, a target can appear in at most e negative results (due to errors). However, a negative item appears in at least 2e +1- e = e +1 > e negative results. It implies that a positive item appears in at most e negative results.

Page 39: Computational Molecular Biology Non-unique Probe Selection via Group Testing

My T. [email protected]

39

Decoding Algorithm

Algorithm: For each item, we just need to count the number of negative results containing it. If this number is less than e, then this item is positive.

Linear Decoding.