sudhindra r. gadagkar, ph.d

31
Sudhindra R. Gadagkar, Ph.D. Computational Biology University of Dayton

Upload: pammy98

Post on 02-Aug-2015

706 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Sudhindra R. Gadagkar, Ph.D

Sudhindra R. Gadagkar, Ph.D.

Computational Biology

University of Dayton

Page 2: Sudhindra R. Gadagkar, Ph.D

Some background material…

• BS in Fisheries Science from University of Agricultural Sciences, Bangalore, India

• MS in Fisheries Science (Statistics)

Page 3: Sudhindra R. Gadagkar, Ph.D

Tilapia (Oreochromis niloticus)

Page 4: Sudhindra R. Gadagkar, Ph.D

Genetics of fish behavior

Page 5: Sudhindra R. Gadagkar, Ph.D

Ph. D. research (contd.)

• Complex behaviors are heritable (behaviors governed by genes)

• Behavior and growth rate are correlated at the genetic level

(the same gene(s) are responsible for both traits or they are closely linked)

Page 6: Sudhindra R. Gadagkar, Ph.D

Post-doctoral research in Bioinformatics at Arizona State University

Page 7: Sudhindra R. Gadagkar, Ph.D

What I do now

• There is information in DNA and this information is used by the body.

Source for image: www.nigms.nih.gov/.../ genetics/science.html

Page 8: Sudhindra R. Gadagkar, Ph.D

• DNA is an incredibly long strand, made up of four different molecules (called nucleotides), abbreviated as A, C, G and T.

• For example, the DNA from the longest human chromosome is 12 cm long!

Page 9: Sudhindra R. Gadagkar, Ph.D

• Each cell of the human body contains DNA.

• The total length of all this DNA is >3 billion nucleotides!

• That’s a large number!

Page 10: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective• A DNA sequence can look like the following:• ACTGTTTGAAATTGACCCAGCACTTCTCCCTCGCGCAGACAGAGAGCAGTGTAG

ACGGAGCCTTAATCGCTAGAGCGAATCCCGATGCCCCACCTTCCGTCGGTGCATAAGTCGCACGGCGTCTCCCCCCCGTATGTGGTCTTAGGTAACCGCCGCCGGGCGTAGGGTTCACGGTCGAGGATGAAGATGGCGATTCGTCACCTCGCCAACGGGAGGGACCTCATTCGATCGATCCGCAAGTCTTCGCGGGAGCTCGTCATGCGGAACGCAGGAGACAACACTCTGCGTCGGATGCGCGCCGTATCAGTCGGGTGAGGCACGCCTAGCGATTCGACCTTAATTCCCGGACGCGACGCGAGGAGTTGGGAGATTGCTGCCCAAACCGGTCCGCGCTACTTAGGCTGCCGGACCCTTCTCGCCCCACGGGTGGCGGTGGTAATAGAGTTGGCCCGCCCTCTATGTGTCGGAAAGGGGGAGCCGGGGGCCGTGAGGATGCCCACACTGTCGGCGAGACCATGCTATCGAGCCTCCCTGGGACCCTCGGGGACTTTAGTTCCCACTCGGTTGGGGATTCAGTAGCCACGAATCAGACCGCCCCGGGTGGGGGCTTCGTCGTCTTGTCTTTCCAGCCCCCCTCTACTCTTCCTACTACGCCCGTCTGTCGAGGGTGCCGAGCGCGCAGTGTGCTCCCAGCGGCTCGTGCCAGGTTAGGTAGCCATATGTATTTATCGGCTGAGGACCGCCCGCCGTGTACCGACGATTTTGTTATAATTCTAGAGATGGGCTGGCACTTACCTGCTAGGTTTCTTGTCTGCTATGACTCGTGCGAACAGTCTTACTCTTGGCACAGCCGCGATGGCGATGGTTTAGCGGTTCCCATGGGGGGAATCGCGCGACGGCACCCAGTTCTGTTTCGACCGGACCCTGCTTACTCCTGGCCGAGAGGCCTCATTCTCGTTCGAGTCGATCGCTTATGTTATCGCGCCATTGGGAGTGCTCTGACCAATTACCGACCCGGAGTGTG

Page 11: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• What if we try and write down the entire sequence (all 3.5 billion of them)?

• After all, now we do know what the entire human DNA sequence is.

Page 12: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• Let’s see…if we can fit 75 letters in each line and if there are 50 lines in a page, then a page will contain 3,750 nucleotides.

• That does sound like a lot (the earlier slide had 1024).

Page 13: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective• A book that contains 100 pages can hold

100 x 3,750 = 375,000 nucleotides.

• That is a lot!

• How thick do you think a book of 100 pages might be?

• An inch maybe.

• We need to write down at least 3 billion letters.

Page 14: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• Therefore, we need (3,000,000,000)/375,000

• = 8000 inches

• = 667 feet.

Page 16: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• ... is 555 feet!

• So imagine a stack of books taller than the Washington monument crammed with letters – no spaces, no commas, no paragraphs.

Page 17: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• And we would have written down the data for one strand in one cell of one human being!

• We need to understand this data.

• Remember, there are no words, no punctuations, no “parts of speech” in this “text”.

• Yet, we have to make sense out of this information.

Page 18: Sudhindra R. Gadagkar, Ph.D

Another example

• This is the evolutionary tree of primates.

• There are 10 species here whose evolutionary relationship we are interested in.

Source for image: locus.umdnj.edu/nigms/ special/primate.html

Page 19: Sudhindra R. Gadagkar, Ph.D

How many possible trees?

• Do you know how many possible ways there are for drawing the evolutionary history (“tree”) for 10 species?

2

2 3 !Formula:

2 2 !n

n

n

where n is the number of species

Page 20: Sudhindra R. Gadagkar, Ph.D

How many trees!

0

400

600

800

1000

1200

0 100 200 300 400

Millions

Billions

10200

10

10

10

10

10N

o. o

f P

oss i

ble

Tre

es

No. of Sequences

1079 atoms in the universe

1037 atoms in the bodies of all humans by year 2035

5 1030 prokaryotes living today

5 1011 stars in the milky way

How many trees represent the true relationship?

Page 21: Sudhindra R. Gadagkar, Ph.D

• And only one of them is the correct tree because evolution has happened only once.

• And we need to find it!

Page 22: Sudhindra R. Gadagkar, Ph.D

One final example

Page 23: Sudhindra R. Gadagkar, Ph.D

Pairwise Alignment – contd.• Consider these two DNA sequences

– AATCTATA– AAGATA

• We want to compare them site by site, so we need to align them by introducing gaps.

• Gaps can be introduced in various places, and in various combinations, as shown next.

Page 24: Sudhindra R. Gadagkar, Ph.D

Pairwise Alignment – contd.

AATCTATA

AAG--ATA

AATCTATA

AA-G-ATA

AATCTATA

AA--GATA

Page 25: Sudhindra R. Gadagkar, Ph.D

Pairwise Alignment – contd.• Clearly, if the sequences are long, it would

become impossible for manual introduction of gaps; we would need a computer to help us find the optimal gaps.

• But let us first see what is involved in asking the computer to do this.

• One way, the looooooooong way is to:– introduce gaps in every possible position.

Page 26: Sudhindra R. Gadagkar, Ph.D

The Brute Force Method(the Perspiration approach)

• For the long way, to get an idea of what is involved, let us first look at the first position.

• There are three possible choices:1. gap in the first sequence2. gap in the second sequence, or3. gap in neither sequenceThat is,• - A A• A - A

Page 27: Sudhindra R. Gadagkar, Ph.D

The Brute Force Method(the Perspiration approach)

• These options are the same for every position.

• Therefore, the number of possible paths, y, for a pair of sequences of length 1 base is 3

• If the sequences are 2 bases long it is 32 = 9.

Page 28: Sudhindra R. Gadagkar, Ph.D

The Brute Force Method(the Perspiration approach)

• In general, if they are n bases long, then there are 3n paths.

• If n = 20, then y = 320 = 3.4 x 109

Page 29: Sudhindra R. Gadagkar, Ph.D

The Brute Force Method(the Perspiration approach)

• If n = 200, then y = 3200 = 2.6 x 1095

• If one path takes 1 nanosecond (10-9 seconds), then for a pair of sequences that is 200 bases long, the computer will need

– 8.4 x 1078 years!!

Page 30: Sudhindra R. Gadagkar, Ph.D

Let’s get some perspective

• Needs a super-human effort, eh?

• That’s absolutely right!

• That super-human is the computer.

• But it’s not enough to just use the computer to solve such problems.

• The computer does not have to work hard. It needs to work smart!

Page 31: Sudhindra R. Gadagkar, Ph.D

Need Computer!