efficient probe selection in micro-array design

31
http://www.csc.liv.ac.uk/ ~cindy 1 Efficient Probe Selection in Micro-array Design Speaker: Cindy Y. Li Joint work with: Leszek Gąsieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg Algorithmics Group, Dept. of Computer Science, University of Liverpool

Upload: airlia

Post on 22-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Efficient Probe Selection in Micro-array Design. Algorithmics Group, Dept. of Computer Science, University of Liverpool. Speaker: Cindy Y. Li Joint work with: Leszek G ą sieniec, Paul Sant, and Prudence Wong Special thanks go to: David Peleg. Talk Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

1

Efficient Probe Selectionin Micro-array Design

Speaker: Cindy Y. Li Joint work with:

Leszek Gąsieniec, Paul Sant, and Prudence WongSpecial thanks go to: David Peleg

Algorithmics Group, Dept. of Computer Science, University of Liverpool

Page 2: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

2

Talk Overview

Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

Page 3: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

3

Hybridization Process

DNA 5’... TGTGCTTGACAACATAGTTG... 3’

|| | |

Short DNA Fragments 3’-CTACGGACCGAT-5’A single-stranded DNA probe (middle panel) is linked to an enzyme and allowed to base pair (hybridize) with the mRNA. After a series of washes, only fragments that are hybridized with the target mRNA remain.

Page 4: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

4

Tool: DNA Microarrays

Labeled DNA/RNA mixture flushed over array of short DNA fragments

Laser activation of fluorescent labels

Page 5: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

5

Talk Overview

Background: Microarrays & Hybridization Problem Statement Our Approach Experimental Work Conclusion

Page 6: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

6

Probe concept

A probe is a substring of a gene, which acts as its fingerprint (a.k.a., signature)

Probes are relatively short DNA sequences. Usually, a probe is ~ 20-25 base pairs long.

For example:

DNA ...TGTGCTTGGCAACATAGATAGATGC...

Probe TGCTTGGCAACATAGATAGA

Page 7: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

7

Finding unique probesP5P1 P2 P3 P4

G1

G3

G4

Probes

Genes G2

We are interested in finding a single (or a small group of) unique probe(s) for each gene

The search process should be both time and space efficient

Page 8: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

8

Finding unique probes Given a database S of gene sequences For each sequence g in S try to find a single

probe P which hybridizes only with g If P cross-hybridizes with some other

sequences in S (i.e., P has a close occurrence in S) then find a small set of probes that uniquely identifies g.

Sometimes multiple probes are required due to the error prone wet lab environment

Page 9: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

9

The use of probes

The uniqueness of probes allows us to identify the genes taking part in the experiment in the wet lab

I.e., seeing the trace (green color) of a number of probes on the microarray we can identify precisely which genes were involved in the experiment

Page 10: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

10

Finding Unique Probes - Performance Measure

Each gene in the database S should be uniquely identified by a smallest possible number of probes

The search for probes should be time/space efficient The time of the search for probes should be “fairly”

independent of the length of the probes All probes should be far (Hamming distance) from each other Probes should satisfy some extra (e.g., related to hybridization

process) conditions

Naive approach:Scans through the whole length-n genome for every length-m probe and determine if the Hamming distance is big enough, which takes O(mn2) time. For example, 72 hours for S. pombe genome of length 7.1 x 106 bps and thus impractical for large genome.

Page 11: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

11

Previous Work – Approaches based on

Suffix array and fast pattern matching [Li F. and Stormo G., 2001]

BLAST to avoid cross-hybridization [Rouillard J. M., Herbert C. J. and Zuker M., 2002]

Longest common substrings [Rahmann S. 2002]

Various filtering techniques [Lockhart DJ et al, 1996]

Methods based on pigeon hole principle [Lee W. H. and Sung W. K., 2003]

etc

Page 12: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

12

Previous Work – The probe selection criteria

No single base exceeds 50% of the probe size The length of any contiguous As and Ts or Cs and

Gs is less than 25% of the probe size (G+C)% is between 40% and 60% of the probe Sensitivity - No self-complementarity within the probe

sequence Homogeneity - Melting Temperature not being too

low or too high Specificity – probes are unique to each gene

Page 13: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

13

Previous Work – Test data

Test data

Genome Name

E. coli S. pombe Yeast

Genome

Length

4,752,411 13,149,651 8,783,280

Number of Genes

5,253 521 5,888

Page 14: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

14

Previous Work – Test data

Yeast

0

500

1000

1500

2000

2500

<1K [1K, 2K) [2K, 3K) [3K, 4K) [4K, 5K) [5K, 6K) [6K, 7K) >= 7K

gene length

# o

f g

enes

Total length 8,783,280

Total # of genes5,888

Page 15: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

15

Previous WorkLi and

Stormo BIBE,2000

Rouillard et al,Bioinfor-

matics,2002

Rahmann,WABI, 2002

Lee and Sung,CSB, 2003

E.Coli 23-bps

1.5 days

50-bps

31 minutes

Yeast 24-bps

4 days

50-bps

1 day

50-bps

49 minutes

Neurospora crassa

25-bps

4 hours

50-bps

3.5 hours

# of probes Top 10 All probes Top 50 All probes

Page 16: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

16

Talk Overview Background: Microarrays & Hybridization Problem Statement Our new alternative approach - main observations - the algorithm Experimental work Conclusion

Page 17: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

17

Main Observations

In general randomness help! 80% of “randomly” (based on our algorithm) chosen

candidates for probes satisfy the probe selection criteria related to hybridization process

[this suggests that random sequences hybridize properly more likely]

The expected Hamming distance between two randomly chosen sequences of a length n over 4 letter alphabet is ~ 3n/4.

[this suggests that randomly chosen probes will be far from each other]

Page 18: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

18

An interesting observation

In general, fragments of DNA sequences representing genes are more deterministic (contain more organized information) comparing to the rest of the sequence.

In contrary, the best probes (signatures) representing genes are very likely to be random or almost random!

Page 19: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

19

The Algorithm

(*) For every gene g in the database S:

a) generate a random base-pair sequence of length m

b) find the closest length-m substring P in gene g

c) check P for good probe criteria [80% pass this test] If P does not pass the criteria go to a)

d) cross-hybridization checking for P [98% pass this test] For every length-m substring Q in other sequences S-{g}:

If H(P,Q) > d, P is chosen as the probe for g, goto (*) Otherwise, P can possibly cross-hybridize and we must

generate another length-m random substring P', go to a)

Page 20: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

20

The algorithm

R

(*) For every gene g in the database S:

a) generate a random base-pair sequence of length m

gP

b) find the closest length-m substring P in gene g

c) Check P for good probe criteria, if P does not pass the criteria, go to a)

Page 21: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

21

The algorithm

gP

d) Check P for cross-hybridization checking For every length-m substring Q in other sequences (S - {g}):If H(P,Q) > d, P is chosen as the probe for g, goto (*);Otherwise, P can possibly cross-hybridize and we must generate another length-m random substring, go to a)

g1P is far from g1 √

H(P,Q)<d XgiQ

BackgroundSequences …g2 P is far from g2 √

Generate another length-m random substring

Page 22: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

22

Talk Overview

Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

Page 23: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

23

Experimental Work

For Yeast: 1.80% genes with no probes

Duplicated / very similar / too short apart from that

98.0% genes need only one probe 1.5% genes need two probes 0.5% genes need three probes

Similar result with genome E.coli

Page 24: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

24

Talk Overview

Background: Microarrays & Hybridization Problem Statement Algorithm Experimental Work Conclusion

Page 25: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

25

Conclusion

Almost all (98%) genes can be uniquely identified by a single probe; the others need at most three probes

Our method is: Suitable for large scale probe design Fairly independent from the length of probes Both time and space efficient Useful in design of fault-tolerant system of probes

Page 26: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

26

Ongoing Work

Distinguish multiple targets in a sample

P1

g1

P2’P1’g3

P2

g2

Page 27: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

27

Questions

???? ??

Page 28: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

28

Thank You!

Presented By Cindy Y. Li

Page 29: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

29

self-complementarity

Probe 5‘ TTTCAGTAATAAAAGATTTCTGT 3‘ |||| Probe 3‘ TGTCTTTAGAAAAATTAGACTTT 5‘

Page 30: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

30

Melting Temperature TM can be used as a parameter to evaluate probe hybridization

behavior TM is calculated for each probe as (SantaLucia et al., 1996)

 

is the sum of the nearest neighbor enthalpy changes

is the sum of the nearest neighbor entropy changes

R is the Gas Constant (1.987 cal deg-1 mol-1)

CT is the total molar concentration of strands ( )

Page 31: Efficient Probe Selection in Micro-array Design

http://www.csc.liv.ac.uk/~cindy

31

Melting Temperature thermodynamic stability / nearest neighbour/

A T C G

A -1.2

-0.9

-1.5

-1.5

T -0.9

-1.2

-1.5

-1.7

C -1.7

-1.5

-2.1

-2.8

G -1.5

-1.5

-2.3

-2.1

TTTCAGTAATTAAAAAGATTTCTGT

-1.2 -1.5-1.7 kcal/mol