*less than 10% dinosaur content
DESCRIPTION
*. Jeffrey Boucher. *Less than 10% Dinosaur content. Talk Outline. Talk 1: “How to Raise the Dead: The Nuts & Bolts of Ancestral Sequence Reconstruction” Talk 2: Ancestral Sequence Reconstruction Lab Talk 3: “Ancestral Sequence Reconstruction: What is it Good for?”. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/1.jpg)
![Page 2: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/2.jpg)
*
*Less than 10% Dinosaur content
Jeffrey Boucher
![Page 3: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/3.jpg)
Talk Outline
• Talk 1:– “How to Raise the Dead: The Nuts & Bolts of
Ancestral Sequence Reconstruction”• Talk 2:– Ancestral Sequence Reconstruction Lab
• Talk 3:– “Ancestral Sequence Reconstruction: What is it
Good for?”
![Page 4: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/4.jpg)
How to Raise the Dead: The Nuts and Bolts of Ancestral
Sequence Reconstruction
Jeffrey BoucherTheobald Laboratory
![Page 5: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/5.jpg)
Orientation for the Talk
• The Central Dogma:
DNA RNA Protein
![Page 6: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/6.jpg)
Orientation for the Talk (cont.)
• Chemistry of side chains govern structure/function
• Mutations to sequences occur over time
![Page 7: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/7.jpg)
We Live in The Sequencing Era
19821984
19861988
19901992
19941996
19982000
20022004
20062008
20100
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
110,000,000
120,000,000
130,000,000
140,000,000
150,000,000
GenBank Database Growth by Year
Since inception, database size has doubled every 18 months.http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
Num
ber o
f Ent
ries
Year
![Page 8: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/8.jpg)
What Can We Learn From This Data?
• Individually…not much
• Too many sequences to characterize individually– Today:
1.5 Ε 8 sequences ÷ 7 E 9 people = 1 sequence/50 people– By 2019
1.2 Ε 9 sequences ÷ 7.5 E 9 people = 1 sequence/6 people
>gi|93209601|gb|ABF00156.1| pancreatic ribonuclease precursor subtype Na[Nasalis larvatus]MALDKSVILLPLLVVVLLVLGWAQPSLGRESRAEKFQRQHMDSGSSPSSSSTYCNQMMKRRNMTQGRCKPVNTFVHEPLVDVQNVCFQEKVTCKNGQTNCFKSNSRMHITDCRLTNGSKYPNCAYRTTPKERHIIVACEGSPYVPVHFDASVEDST
![Page 9: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/9.jpg)
Bioinformatics!
• Bioinformatic methods developed to deal with this backlog
• Methods covered:– Sequence Alignment (& BLAST)– Phylogenetics– Sequence Reconstruction
![Page 10: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/10.jpg)
Sequence Alignment
• How can we compare sequences?
• Simple scoring function– 1 for match– 0 for mismatch
OrangutanChimpanzee
1 10 0 = 50010100000100
![Page 11: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/11.jpg)
Not All Mismatches Are Created Equal
• How can scoring function account for this?
Aspartate Glutamate
OrangutanChimpanzee
Glutamate Leucine
Vs.
* *
![Page 12: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/12.jpg)
Substitution Matrix
GlutamateAspartate GlutamateLeucine
![Page 13: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/13.jpg)
Calculating A Substitution Matrix
• How are the rewards/penalties determined?
• Determined by log-odds scores:
Si,j = log pi,j
qi * qj
pi,j is probability amino acid i transforms to amino acid j
qi & qj represent the frequencies of those amino acids
Why not just pi,j ?
![Page 14: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/14.jpg)
Neither Are All Matches
Cysteine LeucineLeucine Cysteine
![Page 15: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/15.jpg)
BLOSUM62 (BLOcks of Amino Acid SUbstitution Matrix)
≥62% Identity
<62% Identity
STOP
How did you get an alignment?You’re talking about ‘How to Make an Alignment’!Blocks used align well with 1/0 scoring function
![Page 16: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/16.jpg)
BLOSUM62 Matrix Calculation
Si,j = log pi,j
qi * qj
pG,A
qG
qA
= 14/900 = 0.016
= 2 + 9 + 9 = 21/225 = 0.093= 7 + 9 = 16/225 = 0.071
≥62% Identity
<62% Identity
G-G G-A A-A 6 2 0 5 2 0 4 2 0 0 4 1 3 1 0 2 1 0 1 1 0 0 1 0 21 14 1 = 36
![Page 17: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/17.jpg)
Pairwise Alignment Examples
• No Gaps allowed:
4 2 -2 0 6 -1 -3 -4 -2 -2 4 0 4 -1 7 1 1 = 14
• Gap Penalty of -8:
- Penalty heuristically determined
4 -8 5 4 0 6 2 4 6 5 4 0 3 4 -8 7 1 1 = 40
OrangutanChimpanzee
OrangutanChimpanzee
![Page 18: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/18.jpg)
Pairwise Alignment Examples (cont.)
• If gap penalty is too low…
• Alignment of multiple sequences similar method
OrangutanChimpanzee
![Page 19: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/19.jpg)
(& BLAST)• Alignment can identify similar sequences• BLAST (Basic Local Alignment Search Tool)
• How does alignment compare to alignment of random sequences?– E-value of 1E-3 is a 1:1000 chance of alignment of
random sequences
![Page 20: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/20.jpg)
Homology vs. Identity• Significant BLAST hits inform us about
evolutionary relationships
• Homologous - share a common ancestor– This is binary, not a percentile
– Identity is calculated, homology is a hypothesis
– Homology does not ensure common function
![Page 21: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/21.jpg)
Visual Depiction of Alignment Scores
• Suppose alignment of 3 sequences…
OrangutanChimpanzeeMouse
OCM
M C O
19 40 -18 - 40- 18 19
M O C
![Page 22: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/22.jpg)
Phylogenetics
• Relationships between organisms/sequences
• On the Origin of Species (1859) had 1 figure:
![Page 23: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/23.jpg)
Phylogenetics
• Prior to 1950s phylogenies based on morphology
• Sequence data/Analytical methods– Qualitative Quantitative
![Page 24: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/24.jpg)
PhylogenyTI
ME
A B GFEDC
InternalBranch
PeripheralBranch
Taxa (observed data)
Branch lengths representtime/change
Node
![Page 25: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/25.jpg)
A Tale of Two Proteins
• Significant sequence similarity & the same structure
Protein X-Binds Single Stranded RNA
Protein Y-Binds Double Stranded RNA
![Page 26: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/26.jpg)
TIM
E
A B GFEDC
“Gene”alogy
Single-Stranded Double-Stranded
Last Common Ancestorof All Double-Stranded
Last Common Ancestorof All Single-Stranded
Last Common Ancestor of All
![Page 27: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/27.jpg)
Back to the Future
• Resurrecting extinct proteins 1st proposed Pauling & Zuckerkandl in 1963
• In 1990, 1st Ancestral protein reconstructed, expressed & assayed by S.A. Benner Group– RNaseA from ~5Myr old extinct ruminant
![Page 28: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/28.jpg)
What Took So Long ?
![Page 29: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/29.jpg)
How to Resurrect a Protein
1) Acquire/Align Sequences
2) Construct Phylogeny(from Chang et al. 2002)
3) Infer Ancestral Nodes
4) Synthesize Inferred Sequence
![Page 30: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/30.jpg)
So Really…What Took So Long?
• Advances in 3 areas were required:
– Sequence availability
– Phylogenetic reconstruction methods
– Improvements in DNA synthesis
![Page 31: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/31.jpg)
Sequence Availability
19821983
19841985
19861987
19881989
19901991
19921993
19941995
19961997
19981999
20002001
20022003
20042005
20062007
20082009
20102011
0
10,000,000
20,000,000
30,000,000
40,000,000
50,000,000
60,000,000
70,000,000
80,000,000
90,000,000
100,000,000
110,000,000
120,000,000
130,000,000
140,000,000
150,000,000
GenBank Database Growth by Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
0
10,000
20,000
30,000
40,000
50,000
60,000
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
606
Num
ber o
f Seq
uenc
es
Year
![Page 32: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/32.jpg)
• Advances in 3 areas were required:
✓ Sequence availability
– Phylogenetic reconstruction methods
– Improvements in DNA synthesis
![Page 33: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/33.jpg)
Advances in Reconstruction Methods
Consensus
Parsimony
Maximum Likelihood
![Page 34: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/34.jpg)
Consensus
• Advantage: Easy & fast• Disadvantages: Ignores phylogenetic relationships
X X
![Page 35: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/35.jpg)
Parsimony• Parsimony Principle– Best-supported evolutionary inference requires fewest
changes– Assumes conservation as model
• Advantage:– Takes phylogenetic relationships into account
• Disadvantage:– Ignores evolutionary process & branch lengths
![Page 36: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/36.jpg)
ParsimonyA B C D E F G H
ABC D EFGH
![Page 37: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/37.jpg)
Parsimony
V VVILL
Example adapted from David Hillis
IL
{V}{L}
{V, I}
{V, I, L}
{V, I, L}
{V, I, L}
{V, I, L}
Changes = 4
V
L
I
I
I
VL
![Page 38: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/38.jpg)
Parsimony - Alternate Reconstructions
• Is conservation the best model?
• Resolve ambiguous reconstructions
![Page 39: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/39.jpg)
Maximum Likelihood
• Likelihood:
– How surprised we should be by the data– Maximizing the likelihood, minimize your surprise
• Example:– Roll 20-sided die 9 times:
Likelihood = Probability(Data|Model)
![Page 40: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/40.jpg)
Maximum Likelihood
• Fair Die Model:– 5% chance of rolling a 20
• Trick Die Model:– 100% chance of rolling a 20
Likelihood = Probablity(Data|Model)
Likelihood = (0.05)9 = 2E-11
Likelihood = (1)9 = 1
Assuming trick model maximizes the likelihood
![Page 41: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/41.jpg)
From Dice to Trees
• Likelihood=– Data - Sequences/Alignment– Model - Tree topology, Branch lengths & Model of
evolution
or or
• Choose model that maximizes the likelihood
![Page 42: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/42.jpg)
Improvements Over Parsimony
• Includes of evolutionary process & branch lengths– Reduction in ambiguous sites
• Fit of model included in calculation– Removes a priori choices– Use more complex models (when applicable)
• Confidence in reconstruction– Posterior probabilities
![Page 43: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/43.jpg)
• Advances in 3 areas were required:
✓ Sequence availability
✓ Phylogenetic reconstruction methods
– Improvements in DNA synthesis
![Page 44: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/44.jpg)
Advances in DNA Synthesis
DNA synthesis work starts 1950s
late 1970sAutomated
1983PCR
199020 nts Fragments
2002~200 nts Fragments
Advances in Molecular Biology increased speed & fidelity
PAST PRESENT
![Page 45: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/45.jpg)
How to Synthesize a Gene
1 - 150
151 - 300
301 - 450
451 - 600
5’- -3’
DNA Ligase
600 nts5’- -3’
FW Primer5’-3’-
-5’RV Primer
-5’
5’- -3’3’- -5’
Schematic adapted from Fuhrmann et al 2002
-5’-5’ -5’3’- 3’-3’-
DNA Polymerase
1 - 150 151 - 300 301 - 450 451 - 600
![Page 46: *Less than 10% Dinosaur content](https://reader035.vdocument.in/reader035/viewer/2022070404/56813c5f550346895da5e5bd/html5/thumbnails/46.jpg)
On to the Easy Part…