computational molecular biology non-unique probe selection via group testing
Post on 02-Jan-2016
216 Views
Preview:
TRANSCRIPT
Computational Molecular Biology
Non-unique Probe Selection via Group Testing
My T. Thaimythai@cise.ufl.edu
2
My T. Thaimythai@cise.ufl.edu
3
My T. Thaimythai@cise.ufl.edu
4
My T. Thaimythai@cise.ufl.edu
5
DNA Microarrays
DNA Microarrays are small, solid supports onto which the sequences from thousands of different genes are immobilized, or attached, at fixed locations.
Contain a very large number of genes in a small size chip.
A tool for performing large numbers of DNA-RNA hybridization experiments in parallel.
My T. Thaimythai@cise.ufl.edu
6
Applications Quantitative analysis of expression levels of individual
genesThe comparison of cell samples from different tissues.Computational diagnostics.
Qualitative analysis of an unknown sample Identification of micro-bacterial organisms.Detection of contamination of biotechnological products. Identification of viral subtypes.
My T. Thaimythai@cise.ufl.edu
7
Unique Probes vs. Non-unique Probes
Unique probes Gene-specific probes or signature probes. Difficult to find such probes
Non-unique probes Hybridize to more than one target. Difficult to design the test based on non-unique
probes
My T. Thaimythai@cise.ufl.edu
8
Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of
probes reacting to any target in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes
that separate two sets.
My T. Thaimythai@cise.ufl.edu
9
Probe-Target Matrix
Non-unique probe for each sequence group
probe1(p1)probe2(p2)probe3(p3)
s1group1group2group3
s2s3s4s5s6s7s8s9
probe4(p4)probe5(p5)
group1 group2 group3
t1 t2 t3 t4 t5 t6 t7 t8 t9
p1 1 1 0 0 0 1 0 0 0
p2 0 0 0 0 1 1 1 0 0
p3 1 0 1 0 0 0 0 1 0
p4 0 1 1 1 0 0 0 0 0
p3 0 0 0 0 0 0 1 1 1
My T. Thaimythai@cise.ufl.edu
10
The Problem
Given a sample with m items and a set of n non-unique proble
Goal: determine the presence or absence of targets
My T. Thaimythai@cise.ufl.edu
11
The Approach
3 Steps: Pre-select suitable probe candidates and compute
the probe-target incidence matrix H. Select a minimal set of probes and compute a
suitable design matrix D (a sub-matrix of H). Decode the result.
My T. Thaimythai@cise.ufl.edu
12
An Example
Matrix H sub-matrix D
My T. Thaimythai@cise.ufl.edu
13
Observation
What is the property of a sub-matrix D? If we want to identify at most d targets without
testing errors, D should be either d-separable matrix or d-disjunct matrix
With at most e errors, the matrix should have the e-error correcting. That is, the Hamming distance between any two unions must be at least 2e + 1
My T. Thaimythai@cise.ufl.edu
14
Non-unique probe selection via group testing
Given a matrix H Find a submatrix D such that D is d-separable
or d-disjunct with the minimum number of rows
(We can easily extend this definition to the error correcting model)
My T. Thaimythai@cise.ufl.edu
15
Min-d-DS
Minimum-d-Disjunct Submatrix: Given a binary matrix M, find a submatrix H with
minimum number of rows and the same number of columns such that H is d-disjunct
My T. Thaimythai@cise.ufl.edu
16
Complexity
Theorem: Min-d-DS is NP-hard for any fixed d ≥ 1 Proof: Reduce the 3 dimensional matching into it
Y ZX
a b c
0
1 1
1 1
1 1
1
Min-1-DS is NP-hard
My T. Thaimythai@cise.ufl.edu
17
My T. Thaimythai@cise.ufl.edu
18
My T. Thaimythai@cise.ufl.edu
19
Complexity
My T. Thaimythai@cise.ufl.edu
20
My T. Thaimythai@cise.ufl.edu
21
My T. Thaimythai@cise.ufl.edu
22
Approximation
Pair: (c0, <c1…cd>)
Cover: a probe is said to cover a pair (c0, <c1…cd>) if the incident entry at c0 is 1 where the rest is 0.
Greedy approach: While all pairs not covered yet, at each iteration:
Choose a probe that can cover the most un-covered pairs
Approximation ratio: 1 + (d+1)ln n If NP is not contained by DTIME(n^{log log n}), then
no approximation has performance (1-ε)ln n for any ε > 0.
My T. Thaimythai@cise.ufl.edu
23
Pool Size is 2
Consider a case when each probe can hybridize with exactly 2 targets
Min-1-separable submatrix is also called the minimum test cover
The minimum test cover is APX-complete
The Min-d-DS is really polynomial-time solvable.
My T. Thaimythai@cise.ufl.edu
24
Lemma
Consider a collection C of pools of size at most 2. Let G be the graph with all items as vertices and all pools of size 2 as edges. Then
C gives a d-disjunct matrix if and only if every item not in a singleton pool has degree at least d+1 in G.
My T. Thaimythai@cise.ufl.edu
25
Proof
matrix.disjunct - a formnot does Therefore,
. labels with columns ofunion in the
contained is labelh column witThen .at of
edges all be )( )()()(
Let .most at is in degree its and of pool
singletonany in not iteman exists thereSuppose
21
00
02010
0
dC
, a, , aa
aaG
dk ,aa, , , aa, ,aa
dGC
a
k
k
0 1 1 0
1 1 0 0
1 0 1 0
0 1 0 1
My T. Thaimythai@cise.ufl.edu
26
matrix.disjunct - a form Hence,
. of anyonecontain not does and
contains 2 size of pool a is there,
itemsother any for case,latter in the and
item,other any contain not does poolsingleton the
case,former In the .1least at degree
ofor poolsingleton ain either is itemevery
then exists, iteman such no if ,Conversely
1 0
21
0
dC
, a, aa
, a, , aa
d
d
a
d
d
My T. Thaimythai@cise.ufl.edu
27
Theorem
Min-d-DS is polynomial-time solvable in the case that all given pools have size exactly 2
Proof: Given a graph G representing Mthen finding a minimum d-DS is equivalent to
finding a subgraph H with minimum number of edges such that all the vertices has a degree at least d + 1
Equivalent to maximize number of edges in G – H such that every vertex v has the degree at most dG(v) – d -1 in G - H
My T. Thaimythai@cise.ufl.edu
28
Complexity
Min-1-DS is NP-hard in the case that all given pools have size at most 2.
Proof: Reduce Vertex-Cover
Min-d-DS is MAX SNP-complete in the case that all given pools have size at most 2 for d ≥ 2
Proof:Reduce VC-CUBICGiven a cubic graph G, find the minimum vertex-cover of
G
My T. Thaimythai@cise.ufl.edu
29
Approximation
Consist of 2 steps:Step 1
Compute a minimum solution of the polynomial-time solvable problem as mentioned
Step 2Choose all singleton pools at vertices with degree
less than d+ 1 in H
My T. Thaimythai@cise.ufl.edu
30
Approximation Ratio Analysis
Suppose all given pools have size at most 2. Let s be the number of given singleton pools. Then any feasible solution of Min-d-DS contains at least s+ (n-s)(d+1)/2 pools.
My T. Thaimythai@cise.ufl.edu
31
Proof
pools. 21least at
contains Then pools.singleton contains
Suppose 2. size of pools 1least at in or pool
singleton ain either is itemevery 1, LemmaBy
DS.--Min ofsolution feasible a is Suppose
)/(n-s)(ds
Cs C
d
dC
My T. Thaimythai@cise.ufl.edu
32
Theorem
The feasible solution obtained in the above algorithm is a polynomial-time approximation with performance ratio 1+2/(d+1).
My T. Thaimythai@cise.ufl.edu
33
Proof
Suppose H contains m edges and k vertices of degree at least d+1.
Suppose an optimal solution containing s* singletons and m* pools of size 2.
Then m < m* and (n-k)-s*< 2m*/(d+1).
(n-k)+m < s*+m*+ 2m*/(d+1)
< (s*+m*)(1+2/(d+1)).
My T. Thaimythai@cise.ufl.edu
34
More Challenging
Experimental Errors False negative:
Pool (probes) contains some positive targets But return the negative outcome
False positive: Pool contains all negative targets But return the positive outcome
My T. Thaimythai@cise.ufl.edu
35
An e-Error Correcting Model
Assume that there is at most e errors in testing (d,e)-disjunct matrix: for any column tj , tj must
have at least e + 1 1-entries not contained in the union of other d columns.
Theorem: (d+e)-disjunct matrix without any isolated column is (d,e)-disjunct matrix
My T. Thaimythai@cise.ufl.edu
36
Decoding Algorithm with e Errors
Theorem: If the number of errors is at most e, then the number of negative pools containing a positive item is always smaller than the number of negative pools containing a negative item
Algorithm: Assume there is exactly d positive ones compute the number of negative results containing each item
and select d smallest ones.
Time complexity: O(tn)
My T. Thaimythai@cise.ufl.edu
37
In S(d,n) sample
What if the sample contain at most d positive (not exactly d positive)
The previous theorem holds, however, the decoding algorithm will not work
If still decoding on (d,e)-disjunct, the time complexity is O((n + t)te) where t is the number of selected probes
My T. Thaimythai@cise.ufl.edu
38
Decoding
Theorem: Suppose testing done on a (d, 2e)-disjunct matrix H with at most e errors, a positive item will appear in at most e negative results.
Proof. Since there are at most e errors, a target can appear in at most e negative results (due to errors). However, a negative item appears in at least 2e +1- e = e +1 > e negative results. It implies that a positive item appears in at most e negative results.
My T. Thaimythai@cise.ufl.edu
39
Decoding Algorithm
Algorithm: For each item, we just need to count the number of negative results containing it. If this number is less than e, then this item is positive.
Linear Decoding.
top related