team members: joshua wu 11174269 shuyu (christine) xu 11161640

62
ADVANCED COMPUTATIONAL BIOLOGY PROJECT PRESENTATION Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Upload: andrew-waters

Post on 23-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

ADVANCED COMPUTATIONAL BIOLOGY

PROJECT PRESENTATION

Team Members:

Joshua Wu 11174269

Shuyu (Christine) Xu 11161640

Page 2: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 3: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Project Description

Explicit Suffix Trees Suppose that we want to store explicitly

all strings that are edge labels of a suffix tree.

The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees.

Implement suffix tree algorithm and run it on substrings of real data.

Page 4: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 5: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Introduction

Any string of length m can be degenerated into m suffixes, and these suffixes can be stored in a suffix tree.

Setup time O(m) (m is length of string)

searching time O(n) (n is length of pattern)

Page 6: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 7: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Motivation

"Suffix trees are widely used in the computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).

Page 8: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 9: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Bioinformatics Application

1. multiple genome alignment (Michael Hohl et al., 2002)

2. selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002)

3. identification of sequence repeats (Kurtz and Schleiermacher, 1999)

Page 10: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 11: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Explicit vs Implicit ABC $ Explicit 1 2 3 4 ABC$ $

BC$ C$

Implicit

1,4 4,4

2,4 3,4

Page 12: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 13: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Problem Analysis

Best Case for explicit and implicit suffix trees: All different characters

Best case not likely with DNA inputs: total of 4 characters

Worst case: same characters throughout

Page 14: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Assumptions

In implicit trees, each number will only take up one bit. (the number 10 takes up 1 bit)

Only alphabets will be in the sequence

Page 15: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Example: all different char ABCD $ 1,5 5,5 1 2 3 4 5 2,5 3,5 4,5

N: string length N = 5 Memory = 10 best case

Page 16: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Example

ABCABC $ 7,7 1 2 3 4 5 6 7 1,3 2,3 6,6 N: string length N = 7 4,7 7,7 7,7 7,7 Memory = 20 4,7 4,7

Page 17: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Example: all same character AAAA $ 1 2 3 4 5 1,1 5,5 N=string length N = 5, 6, 7 2,2 5,5 Memory = 16, 20, 24 Memory = 4n-4 3,3 5,5

Worse case

4,5 5,5

Page 18: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Program Input Data

DNA for all kinds of creatures:

Homo Sapiens, Monkeys, Chickens, …

Page 19: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 20: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample input: Homo Sapien

cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa

Page 21: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample result

Page 22: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample input 2: plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 23: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample output:

Page 24: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 25: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Homo Sapien

Page 26: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Homo Sapiens

atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa

Page 27: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Comparisons: Homo Sapiens

Page 28: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Comparisons: Homo Sapiens

Page 29: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Monkey Virus

Page 30: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Monkey Virus

GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH

Page 31: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Monkey Virus

Page 32: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Plants

Page 33: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Plants

EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG

Page 34: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Plants

Page 35: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Tobacco

Page 36: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample input: tobacco

SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT

Page 37: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Tobacco

Page 38: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Insects

Page 39: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Insects

DCLSGRYKGPCAVWDNETCRRVCKEEGRSSGHCSPSLKCWCEGC

Page 40: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Insects

Page 41: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Birds

Page 42: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Birds

IDTCRLPSDRGRCKASFERWYFNGRTCAKFIYGGCGGNGNKFPTQEACMKRCAKA

Page 43: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Birds

Page 44: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

SARS

Page 45: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: SARS

ALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEV

Page 46: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

SARS

Page 47: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Fish

Page 48: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Fish

GHHHHHHLEDPSGGTPYIGSKISLISKAEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM

Page 49: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Fish

Page 50: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Chicken

Page 51: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Sample Input: Chicken

RVKRVWPLVIRTVIAGYNLYRAIKKK

Page 52: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Chicken

Page 53: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

files

Code

Results

Page 54: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 55: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Conclusion

Explicit suffix trees require more space than implicit suffix trees in real datas.

Data comparison: worst case is DNA input (least variety of characters)

results Implicit trees should be used for smaller

use of storage

Page 56: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

1 3 5 7 9 11 13 15 17 19 21 23 250

500

1000

1500

2000

2500

3000

variety of string vs tree size

variety of string vs tree size

# of alphabets

Page 57: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Conclusion

Application:it is easier to compare structures for implicit

than explicit suffix trees (number comparisons)

Save spaceEasy to implement

Further improvement?

Page 58: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work

Now we are here

Page 59: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

Possible Future Work

Program speed is too slow

The interface of our program should be improved. (Matlab)

More variety of input

Page 61: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

References

Online info http://en.wikipedia.org/wiki/Suffix_tree http://marknelson.us/1996/08/01/suffix-tr

ees/ http://homepage.usask.ca/~ctl271/857/s

uffix_tree.shtml http://www.cs.uku.fi/~kilpelai/BSA05/lect

ures/print07.pdf

Page 62: Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640

THANK YOU!