origami with strings: protein folding by computer
TRANSCRIPT
Origami with strings:protein folding by computer
Kevin Karplus
Biomolecular Engineering Department
Undergraduate and Graduate Director, Bioinformatics
University of California, Santa Cruz
origami with strings – p.1/49
Outline of Talk
What is Biomolecular Engineering? Bioinformatics?
What is a protein?
The folding problem and variants on it:Local structure predictionFold recognitionComparative modeling“Ab initio” methodsContact prediction
origami with strings – p.2/49
What is Biomolecular Engineering?
Engineering with , of , or for biomolecules. For example,
with: using proteins as sensors or for self-assembly.
of: protein engineering—designing or artificially evolvingproteins to have particular functions
for: designing high-throughput experimental methods tofind out what molecules are present, how they arestructured, and how they interact.
origami with strings – p.3/49
What is Bioinformatics?
Bioinformatics: using computers and statistics to makesense out of the mountains of data produced byhigh-throughput experiments.
Genomics: finding important sequences in the genomeand annotating them.
Phylogenetics: “tree of life”.
Systems biology: piecing together various controlnetworks.
DNA microarrays: what genes are turned on underwhat conditions.
Proteomics: what proteins are present in a mixture.
Protein structure prediction.
origami with strings – p.4/49
What is a protein?
There are many abstractions of a protein: a band on agel, a string of letters, a mass spectrum, a set of 3Dcoordinates of atoms, a point in an interactiongraph, . . . .
For us, a protein is a long skinny molecule (like a stringof letter beads) that folds up consistently into aparticular intricate shape.
The individual “beads” are amino acids, which have 6atoms the same in each “bead” (the backbone atoms: N,H, CA, HA, C, O).
The final shape is different for different proteins and isessential to the function.
The protein shapes are important, but are expensive todetermine experimentally.
origami with strings – p.5/49
Folding Problem
The Folding Problem:If we are given a sequence of amino acids (the letters on astring of beads), can we predict how it folds up in 3-space?
MTMSRRNTDA ITIHSILDWI EDNLESPLSL EKVSERSGYS KWHLQRMFKK
ETGHSLGQYI RSRKMTEIAQ KLKESNEPIL YLAERYGFES QQTLTRTFKN
YFDVPPHKYR MTNMQGESRF LHPLNHYNS
↓
Too hard!
origami with strings – p.6/49
Fold-recognition problem
The Fold-recognition Problem:Given a sequence of amino acids A (the target sequence)and a library of proteins with known 3-D structures (thetemplate library),figure out which templates A match best, and align thetarget to the templates.
The backbone for the target sequence is predicted to bevery similar to the backbone of the chosen template.
Progress has been made on this problem, but we canusefully simplify further.
origami with strings – p.7/49
Remote-homology Problem
The Homology Problem:Given a target sequence of amino acidsand a library of protein sequences,figure out which sequences A is similar to and align them toA.
No structure information is used, just sequenceinformation. This makes the problem easier, but theresults aren’t as good.
This problem is fairly easy for recently diverged, verysimilar sequences, but difficult for more remoterelationships.
origami with strings – p.8/49
New-fold prediction
What if there is no template we can use?
We can try to generate many conformations of theprotein backbone and try to recognize the mostprotein-like of them.
Search space is huge, so we need a good conformationgenerator and a cheap cost function to evaluateconformations.
origami with strings – p.9/49
Hidden Markov Models
Hidden Markov Models (HMMs) are a very successful way tocapture the variability possible in a family of proteins.
An HMM is a stochastic model—that is, it assigns aprobability to every possible sequence.
An HMM is a finite-state machine with a probability foremitting each letter in each state, and with probabilitiesfor making each transition between states.
Probabilities of letters sum to one for each state.
Probabilities of transitions out of each state sum to onefor that state.
We also include null states that emit no letters, but havetransition probabilities on their out-edges.
origami with strings – p.10/49
Profile Hidden Markov Model
a1
a2 b4
a -
B1
A3
B2
A4
B3
A5
B5
EndStart
a1 a2 A3 - A4 . A5. . B1 B2 B3 b4 B5
Circles are null states.
Squares are match states, each of which is paired with anull delete state. We call the match-delete pair a fat state.
Each fat state is visited exactly once on every path fromStart to End.
Diamonds are insert states, and are used to representpossible extra amino acids that are not found in most ofthe sequences in the family being modeled.
origami with strings – p.11/49
What is single-track HMM looking for?
1
2
3
XKNESDG
G
101
XA
AXD
SXGD
G
E
XFY
YXMVIL
VXL
VXEQRKK
T
XAEGSDPPXF
F
110
XS
TXL
A
H
XE
AXE
TXMFVILLXL
EXA
EXR
KXVLI
LXR
N
120
XA
KXL
IXL
FXR
EXKR
K LXG
G M
1
2
3
XENGSD
G
51
XAIVL
F
E
XALIV
VXMFVIL
IXL
SXGSENDDXVIL
W
T
NXFVILM
MXEDGSAPP
60
XG
NXM
MXNSD
DXNSAGG
H
XL
LXDE
EXVL
LXL
LXR
KXR
T70
XVIL
IXR
RXE
AX
L
D
T
GXA
A MXP
SXD
AXL
L
80
XP
PXLIV
V
E
XMAVLI
LXILV
MXIVL
VXASTTXGA
A
T
X
L
EXA
AXD
K
90
XE
K
H
XE
EXD
NXV
IXV
IXR
AXA
AXL
AXE
QXA
A
100
1
2
3
A
1
X
L
D
T
XS
KXG
EXL
LXR
KXALIV
F
E
XVL
LXALIV
VXTLAIVV
10
XEDDXGSNEDDXSNED
FXP
S
H
XL
TXV
MXR
RXE
RXL
IXVIL
V
20
XA
RXL
NXL
LXMFVILLXE
KXK
EXL
LXSNDG
GXY
FXE
N
30
XV
NXIV
V
E
XL
EXE
EXVA
AXA
EXNSD
DXG
G
H
XE
VXQDE
D
40
XLA
AXL
LXE
NXL
KXL
LXE
QXE
A GXK
GXP
Y
50
nostruct-align/3chy.t2k w0.5
origami with strings – p.12/49
Secondary structure Prediction
Instead of predicting the entire structure, we can predictlocal properties of the structure.
What local properties do we choose?
We want properties that are well-conserved throughevolution, easily predicted, and useful for finding andaligning templates.
One popular choice is a 3-valued helix/strand/otheralphabet—we have investigated many others. Typically,predictors get about 80% accuracy on 3-stateprediction.
Many machine-learning methods have been applied tothis problem, but the most successful is neuralnetworks.
origami with strings – p.13/49
CASPCompetition Experiment
Everything published in literature “works”
CASP set up as true blind test of prediction methods.
Sequences of proteins about to be solved released toprediction community.
Predictions registered with organizers.
Experimental structures compared with solution byassessors.
“Winners” get papers in Proteins: Structure, Function, andBioinformatics.
origami with strings – p.14/49
Predicting Local Structure
Want to predict some local property at each residue.
Local property can be emergent property of chain (suchas being buried or being in a beta sheet).
Property should be conserved through evolution (atleast as well as amino acid identity).
Property should be somewhat predictable (we gaininformation by predicting it).
Predicted property should aid in fold-recognition andalignment.
For ease of prediction and comparison, we look only atdiscrete properties (alphabets of properties).
origami with strings – p.15/49
Using Neural Net
We use neural nets to predict local properties.
Input is profile with probabilities of amino acids at eachposition of target chain, plus insertion and deletionprobabilities.
Output is probability vector for local structure alphabetat each position.
Each layer takes as input windows of the chain in theprevious layer and provides a probability vector in eachposition for its output.
We train neural net to maximize∑log(P (correct output)).
origami with strings – p.16/49
Neural Net
Typical net has 4 layers and 6471 weight parameters:input/pos window output/pos weights
22 5 15 1665
15 7 15 1590
15 9 15 2040
15 13 6 1176
o oo
o oo
o o
o o o
o o
Hidden Layer 2
P(E) P(B) P(L)
Inputs
Output Layer
Hidden Layer 3
Hidden Layer 1
o
o
9
13
Hidden Layer 2 15 units/position
Hidden Layer 3 25 units/position
Output Layer 3−12 units/position
5
7Hidden Layer 1 15 units/position
Input layer 22 values/position
origami with strings – p.17/49
DSSP
DSSP is a popular program to define secondarystructure.
7-letter alphabet: EBGHSTLE = β strandB = β bridgeG = 310 helixH = α helixI = π helix (very rare, so we lump in with H)S = bendT = turnL = everything else (DSSP uses space for L)
origami with strings – p.18/49
STR: Extension to DSSP
Yael Mandel-Gutfreund noticed that parallel andanti-parallel strands had different hydrophobicitypatterns, implying that parallel/antiparallel can bepredicted from sequence.
We created a new alphabet, splitting DSSP’s E into 6letters:
A
M
P
E
Z
Q
origami with strings – p.19/49
HMMSTR φ-ψ alphabet
For HMMSTER, Bystroff did k-means classification ofφ-ψ angle pairs into 10 classes (plus one class for cispeptides).
We used just the 10 classes, ignoring the ω angle.
origami with strings – p.20/49
ALPHA11: α angle
Backbone geometry can be mostly summarized withone angle per residue:
CA(i−1)
CA(i)
CA(i+1)
CA(i+2)
We discretize into 11 classes:
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
8 31 58 85 140 165 190 224 257 292 343
G H I S T A B C D E F
origami with strings – p.21/49
de Brevern’s Protein Blocks
Clustered on 5-residue window of φ-ψ angles:
origami with strings – p.22/49
Burial alphabets
Our second set of investigations was for a sampling of themany burial alphabets, which are discretizations of variousaccessibility or burial measures:
solvent accessible surface area
relative solvent accessible surface area
neighborhood-count burial measures
origami with strings – p.23/49
Solvent Accessibility
Absolute SA: area in square Ångstroms accessible to awater molecule, computed by DSSP.
Relative SA: Absolute SA/ max SA for residue type(using Rost’s table for max SA).
1e-05
0.0001
0.001
0.01
0.1
17 24 46 71 106
Freq
uenc
y of
occ
urre
nce
solvent accessibility
AB C D E F G
origami with strings – p.24/49
Burial
Define a sphere for each residue.
Count the number of atoms or of residues within thatsphere.
Example: center= Cβ, radius=14Å, count= Cβ, quantizein 7 equi-probable bins.
1e-05
0.0001
0.001
0.01
0.1
27 34 40 47 55 66
Freq
uenc
y of
occ
urre
nce
burial
A B C D E F G
origami with strings – p.25/49
Mutual Information
Mutual information between two random variables(letters of alphabet):
MI(X,Y ) =∑
i,j
P (i, j) logP (i, j)
P (i)P (j),
We look at mutual information between differentalphabets at same position in protein. (redundancy)
We look at mutual information with one alphabetbetween corresponding positions on alignments ofsequences.
origami with strings – p.26/49
Information Gain
Information gain is how much more we know about avariable after making a prediction.
I(X) = average logP̂i(Xi)
P0(Xi)
P̂i is predicted probability vector for position i
Xi is actual observation at position i
P0 is background probability vector
origami with strings – p.27/49
Conservation and Predictability
conservation predictability
alphabet MI info gain
Name size entropy with AA mutual info per residue Q|A|str 13 2.842 0.103 1.107 1.009 0.561
protein blocks 16 3.233 0.162 0.980 1.259 0.579
stride 6 2.182 0.088 0.904 0.863 0.663
DSSP 7 2.397 0.092 0.893 0.913 0.633
stride-EHL 3 1.546 0.075 0.861 0.736 0.769
DSSP-EHL 3 1.545 0.079 0.831 0.717 0.763
CB-16 7 2.783 0.089 0.682 0.502
CB-14 7 2.786 0.106 0.667 0.525
CB-12 7 2.769 0.124 0.640 0.519
rel SA 7 2.806 0.183 0.402 0.461
abs SA 7 2.804 0.250 0.382 0.447
origami with strings – p.28/49
Multi-track HMMs
We can also use alignments to build a two- or three-tracktarget HMM:
Amino-acid track (created from the multiple alignment).
Local-structure track(s) with probabilities from neuralnet.
Can align template (AA+local) to target model.
AA
start stop
AA
2ry
AA AA AA
2ry
2ry2ry2ry
origami with strings – p.29/49
What is second track looking for?
1
2
3
XTCG
101
XTCAXTC
SXE
G
E
XTCEYXCEVXTCE
VXETC
K
T
XETC
PXETC
F
110
XTCTXTGHA
H
XGHAXHTXHLXHEXHEXHKXHLXHN
120
XHKXHIXHFXHEXTCHKXTCHLXTHCGXCM
1
2
3
XTCE
G
51
XEF
E
XEVXEIXESXTCEDXTEC
W
T
XETC
NXECTMXGCTP
60
XCTNXCTMXTC
DXCTHG
H
XHLXHEXHLXHLXHKXHT
70
XHIXHRXCTHAXTCHD
T
XTH
GXCHT
AXCGTMXGCTSXCTAXCET
L
80
XTCEPXCEV
E
XELXEMXEVXTCETXETC
A
T
XCT
EXCT
AXHTC
K
90
XHK
H
XHEXHNXHIXHIXHAXHAXHAXHQXCHA
100
1
2
3
XTCA
1
XTC
D
T
XCT
KXECT
EX
TEC
LX
TCEKXEF
E
XELXEVXCEV
10
X
TCE
DXETCDXHTCFXTGHS
H
XHTXHMXHRX
HRX
HIX
HV
20
XHRXHNXHLXHLXCTHKXTCHEXTCH
LXTCGXETCFXTCEN
30
XCENXEV
E
XCEEXCEEXCTEAX
ECT
EXTCDXCTHG
H
XHVXHD
40
XHAXHLXHNXHKXHLXGTHQXTCHAXHTC
GXTCGXTCY
50
nostruct-align/3chy.t2k EBGHTL
origami with strings – p.30/49
Target-model Fold Recognition
Find probable homologs of target sequence and makemultiple alignment.
Make secondary structure probability predictions basedon multiple alignment.
Build an HMM based on the multiple alignment andpredicted 2ry structure (or just on multiple alignment).
Score sequences and secondary structure sequencesfor proteins that have known structure (all sequences forAA-only, 8,000-11,000 representatives for multi-track).
Select the best-scoring sequence(s) to use astemplates.
origami with strings – p.31/49
Template-library Fold Recognition
Build an HMM for each protein in the template library,based on the template sequence (and any homologsyou can find).
The HMM library has over 12,000 templates from PDB.
For the fold-recognition problem, structure informationcan be used in building these models (though wecurrently don’t).
Score target sequence with all models in the library.
Select the best-scoring model(s) to use as templates.
origami with strings – p.32/49
Combined SAM-Txx method
template HMMs
combined scores
target model scores template model scores
template alignments
template sequencestarget sequence
target alignment
target HMM
local structureprediction
Combine the costs from the template library search andthe target library searches using different local structurealphabets.
Choose one of the many alignments of the target andtemplate (whatever method gets best results in testing).
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
origami with strings – p.33/49
Fold recognition results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.001 0.01 0.1 1 10
Tru
e po
sitiv
es /
poss
ible
fold
mat
ches
False positives per query
Fold Recognition for 1415 SAM-T05 HMMs with w(amino-acid)=1
average w(near-backbone-11)=0.6 w(n_sep)=0.1average w(near-backbone-11)=0.6 w(str2)=0.25
w(near-backbone-11)=0.4w(n_sep)=0.1
w(str2)=0.1aa-only
T2K w(str2)=0.2T2K aa-only
origami with strings – p.34/49
Comparative modeling: T0232
RMSD= 5.158Å all-atom, 4.463Å Cα
origami with strings – p.35/49
T0298 domain 2 (130–315)
RMSD= 2.468Å all-atom, 1.7567Å Cα, GDT=82.5%best model 1 submitted to CASP7 (red=real)
origami with strings – p.36/49
Comparative modeling: T0348
RMSD= 11.8 Å Cα, GDT=58.2% (cartoon=real)best model 1 by CASP7 GDT, Robetta1 slightly better.
origami with strings – p.37/49
Undertaker
Undertaker is UCSC’s attempt at a fragment-packingprogram.
Named because it optimizes burial.
Representation is 3D coordinates of all heavy atoms(not hydrogens).
Can replace backbone fragments (a la Rosetta) or fullalignments—chain need not remain contiguous.
Conformations can borrow heavily from fold-recognitionalignments, without having to lock in a particularalignment.
Use genetic algorithm with many conformation-changeoperators to do stochastic search.
origami with strings – p.38/49
Fragfinder
Fragments are provided to undertaker from 3 sources:
Generic fragments (2-4 residues, exact sequencematch) are obtained by reading in 500–1000 PDB files,and indexing all fragments.
Long specific fragments (and full alignments) areobtained from the various target and templatealignments generated during fold recognition.
Medium-length fragments (9–12 residues long) forevery position are generated from the HMMs withfragfinder , a new tool in the SAM suite.
origami with strings – p.39/49
Cost function
Cost function is modularly designed—easy to add orremove terms.
Cost function can include predictions of local propertiesby neural nets.
Clashes and hydrogen bonds are importantcomponents.
There are over 40 cost function components available:burial functions, disulfides, contact order, rotamerpreference, radius of gyration, constraints, ...
origami with strings – p.40/49
Target T0201 (NF)
We tried forcing various sheet topologies and selected4 by hand.
Model 1 has right topology (5.912Å all-atom, 5.219ÅCα).
Unconstrained cost function not good at choosingtopology (two strands curled into helices).
Helices were too short.
origami with strings – p.41/49
Target T0201 (NF)
origami with strings – p.42/49
Contact prediction
Use mutual information between columns.
Thin alignments aggressively (30%, 35%, 40%, 50%,62%).
Compute e-value for mutual info (correcting forsmall-sample effects).
Compute rank of log(e-value) within protein.
Feed log(e-values), log rank, contact potential, jointentropy, and separation along chain for pair, andamino-acid profile, predicted burial, and predictedsecondary structure for each residue of pair into aneural net.
origami with strings – p.43/49
Evaluating contact prediction
Two measures of contact prediction:
Accuracy: ∑χ(i, j)∑
1
(favors short-range predictions, where contactprobability is higher)
Weighted accuracy:
∑ χ(i,j)
Prob“contact|separation=|i−j|
”∑
1
(1 if predictions no better than chance based onseparation).
origami with strings – p.44/49
Contact prediction results
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.01 0.1 1
true
pos
itive
s/pr
edic
ted
predictions/residue
Accuracy of contact prediction, by protein
Neural netthin62, e-valuethin62, raw mi
0
5
10
15
20
25
30
0.01 0.1 1
avg
is_c
onta
ct/p
rob(
cont
act|s
ep)
predictions/residue
Weighted-accuracy of contact prediction, by protein
Neural netthin62, e-valuethin62, raw mi
origami with strings – p.45/49
Target T0230 (FR/A)
Good except for C-terminal loop and helix floppedwrong way.
We have secondary structure right, including phase ofbeta strands.
Contact prediction helped, but we put too much weighton it—decoys fit predictions better than real structuredoes.
origami with strings – p.46/49
Target T0230 (FR/A)
origami with strings – p.47/49
Target T0230 (FR/A)
Real structure with contact predictions:
origami with strings – p.48/49
Web sites
These slides:
http://www.soe.ucsc.edu/˜karplus/papers/origami-with-strings-jan-2007.pdf
SAM-T06 prediction server:
http://www.soe.ucsc.edu/research/compbio/SAM_T06/T06-query.html
CASP6 all our results and working notes:
http://www.soe.ucsc.edu/˜karplus/casp6/
Predictions for all yeast proteins:
http://www.soe.ucsc.edu/˜karplus/yeast/
UCSC bioinformatics (research and degree programs) info:
http://www.soe.ucsc.edu/research/compbio/
SAM tool suite info:
http://www.soe.ucsc.edu/research/compbio/sam.htmlorigami with strings – p.49/49