david evans cs.virginia/evans
DESCRIPTION
Class 35: Decoding DNA. David Evans http://www.cs.virginia.edu/evans. Sign up for your PS8 team design review!. DNA Helix Photomosaic from cover of Nature , 15 Feb 2001 (made by Eric Lander). CS150: Computer Science University of Virginia Computer Science. Speculations. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/1.jpg)
David Evanshttp://www.cs.virginia.edu/evans
CS150: Computer ScienceUniversity of VirginiaComputer Science
Class 35:Decoding DNA
DNA Helix Photomosaic from cover ofNature, 15 Feb 2001 (made by Eric Lander)
Sign up for your PS8team design review!
![Page 2: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/2.jpg)
2CS150 Fall 2005: Lecture 35: Decoding DNA
Speculations• Must study math for 15+ years before
understanding an (important) open problem– Was ~10 until Andrew Wiles proved Fermat’s Last
Theorem
• Must study physics for ~6 years before understanding an open problem
• Must study computer science for 1 semester before understanding the most important open problem– Unless you’re a 6-year old at Cracker Barrel
• But, every 5 year-old understands the most important open problems in biology!
![Page 3: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/3.jpg)
3CS150 Fall 2005: Lecture 35: Decoding DNA
Biology’s Open Problem
Which came first, the chicken or the egg?
How can a (relatively) simple, single cell turn into a chicken?
![Page 4: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/4.jpg)
4CS150 Fall 2005: Lecture 35: Decoding DNA
Brief History of Biology
Life isaboutmagic.(“vitalism”)
Most biologists work on ClassificationAristotle (~300BC) - genera and species
Descartes (1641) explain life mechanically
Life isaboutchemistry.
Life isaboutinformation.
Life isaboutcomputation.
1850 1950 2000
Schrödinger (1944) life is information crack the information code
Watson and Crick (1953) DNA stores the information
![Page 5: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/5.jpg)
5CS150 Fall 2005: Lecture 35: Decoding DNA
DNA
• Sequence of nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T)
• Two strands, A must attach to T and G must attach to C
A
G
T
CC
![Page 6: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/6.jpg)
6CS150 Fall 2005: Lecture 35: Decoding DNA
Central Dogma of Biology
• RNA makes copies of DNA segments• RNA describes sequences of amino acids• Chains of amino acids make proteins
DNA
Transcription
RNA
Translation
ProteinImage from http://www.umich.edu/~protein/
![Page 7: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/7.jpg)
7CS150 Fall 2005: Lecture 35: Decoding DNA
Encoding Proteins
• There are 4 nucleotides: adenine (A), guanine (G), cytosine (C), and thymine (T) (replaced with uracil (U) in RNA)
• There are 20 different amino acids, and a stop marker (to separate proteins)
• How many nucleotides are needed to encode one amino acid?
with 2, could encode 16 things: 4 * 4with 3, could encode 64 things: 4 * 4 * 4
![Page 8: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/8.jpg)
8CS150 Fall 2005: Lecture 35: Decoding DNA
Codons
• Three nucleotides encode an amino acid
• But, there are only 20 amino acids, so there may be several different ways to encode the same one
From http://web.mit.edu/esgbio/www/dogma/dogma.html
![Page 9: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/9.jpg)
9CS150 Fall 2005: Lecture 35: Decoding DNA
Shortest (Known) Life Program • Nanoarchaeum equitans
– 490,885 bases (522 genes)= 490,885 * ¼ * 21/64 = 40,268 bytes
– Parasite: no metabolic capacity, must steal from host
– Complete components for information processing: transcription, replication, enzymes for DNA repair
• Size of compiling C++ “Hello World”: Windows (bcc32): 112,640 bytes
Linux (g++): 11,358 bytes
http://www.mediscover.net/Extremophiles.cfmKO Stetter and Dr Rachel Reinhard
![Page 10: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/10.jpg)
10CS150 Fall 2005: Lecture 35: Decoding DNA
How Big is the Make-a-Human Program?
• 3 Billion Base Pairs– Each nucleotide is 2 bits (4 possibilities)– 3 B pairs * 1 byte/4 pairs = 750 MB
• Every sequence of 3 base pairs one of 20 amino acids (or stop codon)– 21 possible codons, but 43 = 64 possible – So, really only 750MB * (21/64) ~ 250 MB
• Most of it (> 95%) is probably junk
![Page 11: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/11.jpg)
11CS150 Fall 2005: Lecture 35: Decoding DNA
1 CD ~ 650 MB
![Page 12: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/12.jpg)
12CS150 Fall 2005: Lecture 35: Decoding DNA
People are almost all the Same
• Genetic code for 2 humans differs in only 2.1 million bases– 4 million bits = 0.5 MB
• How big is 0.5MB?– 1/3 of a floppy disk– ~22 times the size of the PS6 adventure
game code
![Page 13: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/13.jpg)
13CS150 Fall 2005: Lecture 35: Decoding DNA
Is DNA Really a Programming
Language?
![Page 14: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/14.jpg)
14CS150 Fall 2005: Lecture 35: Decoding DNA
Stuff Programming Languages are Made Of
• Primitives
• Means of Combination
• Means of Abstraction
codons (sequence of 3 nucleotides that encodes a protein)
?? Morphogenesis? Not well understood (by anyone).
DNA itself – separate proteins from their encodingGenes – group DNA by function (sort of)Chromosomes – package Genes togetherOrganisms – packages for reproducing Genes
This is where most of the expressiveness comes from!
![Page 15: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/15.jpg)
15CS150 Fall 2005: Lecture 35: Decoding DNA
Jacob and Monod, 1959
• Not so simple: cells in an organism have the same DNA, but do different things– Structural genes: make proteins that make us– Regulator genes: control rate of transcription
of other genes
The genome contains not only a series of blue-prints,but a coordinated program of protein synthesis and
the means for controlling its execution.François Jacob and Jacques Monod, 1961
![Page 16: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/16.jpg)
16CS150 Fall 2005: Lecture 35: Decoding DNA
Split Genes
• Richard Roberts and Phillip Sharp, 1977 • Not so simple – genome is spaghetti
code (exons) with lots of noops/comments (introns)
• Exons can be spliced together in different ways before transcription
• Possible to produce 100s of different proteins from one gene
![Page 17: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/17.jpg)
17CS150 Fall 2005: Lecture 35: Decoding DNA
Saccharomyces Cerevisiae(Yeast) Protein Interactions,4825 proteins, ~15,000 interactionsBader and Hogue, Nature 2002
![Page 18: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/18.jpg)
18CS150 Fall 2005: Lecture 35: Decoding DNA
Most Important Science/Technology Races1930-40s:Decryption Nazis vs. British
Winner: BritishReason: Bletchley Park had computers (and Alan Turing), Nazi’s didn’t
1940s: Atomic Bomb Nazis vs. USWinner: USReason: Heisenberg miscalculated, US had better physicists, computers, resources
1960s: Moon Landing Soviet Union vs. USWinner: USReason: Many, better computing was a big one
1990s-2001: Sequencing Human Genome
![Page 19: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/19.jpg)
19CS150 Fall 2005: Lecture 35: Decoding DNA
Human Genome Race
• UVa CLAS 1970• Yale PhD• Tenured Professor at U.
Michigan
Francis Collins(Director of public National Center for Human Genome Research)(Picture from UVa Graduation 2001)
Craig Venter(President of Celera Genomics)vs.
• San Mateo College• Court-martialed• Denied tenure at
SUNY Buffalo
![Page 20: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/20.jpg)
20CS150 Fall 2005: Lecture 35: Decoding DNA
Reading the Genome
Whitehead Institute, MIT
![Page 21: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/21.jpg)
21CS150 Fall 2005: Lecture 35: Decoding DNA
Gene Reading Machines
• One read: about 700 base pairs• But…don’t know where they are on
the chromosome
AGGCATACCAGAATACCCGTGATCCAGAATAAGCActualGenome
ACCAGAATACCRead 1
TCCAGAATAARead 2
TACCCGTGATCCARead 3
![Page 22: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/22.jpg)
22CS150 Fall 2005: Lecture 35: Decoding DNA
Genome Assembly
Input: Genome fragments (but without knowing where they are from)Ouput: The full genome
ACCAGAATACCRead 1
TCCAGAATAARead 2
TACCCGTGATCCARead 3
![Page 23: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/23.jpg)
23CS150 Fall 2005: Lecture 35: Decoding DNA
Genome Assembly
Input: Genome fragments (but without knowing where they are from)Ouput: The smallest genome sequence such that all the fragments are substrings.
ACCAGAATACCRead 1
TCCAGAATAARead 2
TACCCGTGATCCARead 3
![Page 24: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/24.jpg)
24CS150 Fall 2005: Lecture 35: Decoding DNA
Common SuperstringInput: A set of n substrings and a maximum length k.
Output: A string that contains all the substrings with total length k, or no if no such string exists.
ACCAGAATACC
TCCAGAATAA
TACCCGTGATCCATACCCGTGATCCA
TCCAGAATAA
ACCAGAATACC
n = 26ACCAGAATACCCGTGATCCAGAATAA
![Page 25: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/25.jpg)
25CS150 Fall 2005: Lecture 35: Decoding DNA
Common SuperstringInput: A set of n substrings and a maximum length k.
Output: A string that contains all the substrings with total length k, or no if no such string exists.
ACCAGAATACC
TCCAGAATAA
TACCCGTGATCCAn = 25
Not possible
![Page 26: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/26.jpg)
26CS150 Fall 2005: Lecture 35: Decoding DNA
Common Superstring
• In NP:– Easy to verify a “yes” solution: just
check the letters match up, and count the superstring length
• NP-Complete– Similar to Smiley Puzzle!– Could transform 3SAT into Common
Superstring problem
![Page 27: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/27.jpg)
27CS150 Fall 2005: Lecture 35: Decoding DNA
Shortest Common Superstring
Input: A set of n substrings Output: The shortest string that
contains all the substrings.
ACCAGAATACC
TCCAGAATAA
TACCCGTGATCCATACCCGTGATCCA
TCCAGAATAA
ACCAGAATACC
ACCAGAATACCCGTGATCCAGAATAA
![Page 28: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/28.jpg)
28CS150 Fall 2005: Lecture 35: Decoding DNA
Shortest Common Superstring
• Also is NP-Complete:function scsuperstring (pieces) maxlen = sum of lengths of all pieces for k = 1 to k = maxlen step 1 do if (commonSuperstring (pieces, k)) return commonSuperstring
(pieces, k) end for
![Page 29: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/29.jpg)
29CS150 Fall 2005: Lecture 35: Decoding DNA
Human Genome
• 3 Billion base pairs• 600-700 bases per read• ~8X coverage required
• So, n 37 Million sequence fragments• Celera used 27.2 Million reads (but could
get more than 700 bases per read)
> (/ (* 8 (* 3 1000 1000 1000)) 650) 36923076 12/13
![Page 30: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/30.jpg)
30CS150 Fall 2005: Lecture 35: Decoding DNA
Give up?No way to solve an NP-Complete problem (best known solutions being O(2n) for n 20 Million) 1
100
10000
1E+06
1E+08
1E+10
1E+12
1E+14
1E+16
1E+18
1E+20
1E+22
1E+24
1E+26
1E+28
1E+30
2 4 8 16 32 64 128
2ntime since “Big Bang”
![Page 31: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/31.jpg)
31CS150 Fall 2005: Lecture 35: Decoding DNA
Approaches
• Human Genome Project (Collins)– Start by producing a genome map (using
biology, chemistry, etc) to have a framework for knowing where the fragments should go
• Celera Solution (Venter)– Approximate: we can’t guarantee finding
the shortest possible, but we can develop clever algorithms that get close most of the time in O(n log n)
![Page 32: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/32.jpg)
32CS150 Fall 2005: Lecture 35: Decoding DNA
Result: Draw
President Clinton announces Human Genome Sequenceessentially complete (with Venter and Collins), June 26, 2000
But, Human Genome Project mostly adopted Venter’s approach.
![Page 33: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/33.jpg)
33CS150 Fall 2005: Lecture 35: Decoding DNA
So Why Haven’t We Cured Cancer Yet?
0
5
10
15
20
25
30
NIH NSF DARPA
$ B
illio
ns
2004 Projected Budgets
![Page 34: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/34.jpg)
34CS150 Fall 2005: Lecture 35: Decoding DNA
Why Biologists Haven’t Done Much Useful with the Human
Genome YetThey are trying to debug highly concurrent, asynchronous, type-unsafe, multiple entry/exit, self-modifying programs that create programs that create programs running on an undocumented, unstable, environmentally-sensitive OS by looking at the bits (and just figuring out the shape of a protein is an NP-hard problem)
![Page 35: David Evans cs.virginia/evans](https://reader035.vdocument.in/reader035/viewer/2022062408/56813827550346895d9fd654/html5/thumbnails/35.jpg)
35CS150 Fall 2005: Lecture 35: Decoding DNA
Charge
• Meet with your project teams before the design review meeting– You don’t need a formal presentation,
but should have notes prepared