class 7: protein secondary structure. protein structure u amino-acid chains can fold to form...
Post on 21-Dec-2015
218 views
TRANSCRIPT
.
Class 7: Protein Secondary Structure
Protein Structure
Amino-acid chains can fold to form 3-dimensional structures
Proteins are sequencesthat have (more or less) stable 3-dimensional configuration
Why Structure is Important?
The structure a protein takes is crucial for its function Forms “pockets” that can recognize an enzyme
substrate Situates side chain of
specific groups to co-locate to form areas with desired chemical/electrical properties
Creates firm structures such ascollagen, keratins, fibroins
Determining Structure
X-Ray and NMR methods allow to determine the structure of proteins and protein complexes
These methods are expensive and difficult Could take several work months to process one
proteins
A centralized database (PDB) contains all solved protein structures
XYZ coordinate of atoms within specified precision
~11,000 solved structures
Structure is Sequence Dependent
Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence
Force the protein to loose its structure, by introducing agents that change the environment
After sequences put back in water, original conformation/activity is restored
However, for complex proteins, there are cellular processes that “help” in folding
Secondary Structure
-helix -strands
Helix
Single protein chain Shape maintained by
intramolecular H bondingbetween -C=O and H-N-
Hydrogen Bonds in -Helixes
-Strands form Sheets
parallel Anti-parallel
These sheets hold together by hydrogen bonds across strands
Angular Coordinates
Secondary structures force specific angles between residues
Ramachandran Plot
We can related angles to types of structures
Define "secondary structure"
3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE
DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT
DSSP= Database of Secondary Structure in Proteins
DSSP symbolsH = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4)
E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished)
S= beta-bridge (isolated backbone H-bonds)
T=beta-turn (specific sets of angles and 1 i->i+3 H-bond)
G=3-10 helix or turn (i,i+3 H-bonds)
I=Pi-helix (i,i+5 Hbonds) (rare!)
_= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding.
L
Prediction of Secondary Structure
Input: amino-acid sequence
Output: Annotation sequence of three classes:
alpha beta other (sometimes called coil/turn)
Measure of success: Percentage of residues that were correctly labeled
Accuracy of 3-state predictions
Q3-score = % of 3-state symbols that are correct
Measured on a "test set"
Test set == An independent set of cases (protein) that were not used to train, or in any way derive, the method being tested.
Best methods:
PHD (Burkhard Rost) -- 72-74% Q3
Psi-pred (David T. Jones) -- 76-78% Q3
True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TTPrediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL
What can you do with a secondary structure prediction?
(1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand.
(2) Find out whether a helix or strand is extended/shortened in the homolog.
(3) Model a large insertion or terminal domain
(4) Aid tertiary structure prediction
Statistical Methods
From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type
( | ) ( , )
( ) ( ) ( )i i i
i
P aa p aaP
p p p aa
Example:#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(,aa) = 500/20,000, p(p(aa) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
Used in Chou-Fasman algorithm (1974)
Chou-Fasman: Initiation
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Identify regions where 4/6 have a P(H) >1.00 “alpha-helix nucleus”
Chou-Fasman: Propagation
T S P T A E L M R S T GP(H) 69 77 57 69 142 151 121 145 98 77 69 57
Extend helix in both directions until a set of four residues have an average P(H) <1.00.
Chou-Fasman Prediction
Predict as -helix segment with E[P] > 1.03 E[P] > E[P] Not including proline
Predict as -strand segment with E[P] > 1.05 E[P] > E[P]
Others are labeled as turns.
(Various extensions appear in the literature)
Achieved accuracy: around 50% Shortcoming of this method: ignoring the context of
the sequence when predicting using amino-acids
We would like to use the sequence context as an input to a classifier
There are many ways to address this. The most successful to date are based on neural
networks
A Neuron
Artificial Neuron
…
W1
W2
Wk
i
iiaWbf )(
Input Output
a1
a2
ak
•A neuron is a multiple-input -> single output unit
•Wi = weights assigned to inputs; b = internal “bias”
•f = output function (linear, sigmoid)
Artificial Neural Network
…
W1
W2
Wk
a1
a2
a3
Input
…
Hidden Output
…
o1
om
•Neurons in hidden layers compute “features” from outputs of previous layers
•Output neurons can be interpreted as a classifier
Example: Fruit Classifer
Shape Texture Weight ColorApple ellipse hard heavy redOrange round soft light yellow
Qian-Sejnowski Architecture
......
...
...
oooo
HiddenInput Output
Si
Si-w
Si+w
ss
kksss
ajajkajkk
jiaj
os
ubo
wialh
as1i
maxarg
)(
}{
,
,,,,
,
xe11
xl )(
Neural Network Prediction
A neural network defines a function from inputs to outputs
Inputs can be discrete or continuous valued In this case, the network defines a function from a
window of size 2w+1 around a residue to a secondary structure label for it
Structure element determined by max(o, o, oo)
Training Neural Networks
By modifying the network weights, we change the function
Training is performed by Defining an error score for training pairs
<input,output> Performing gradient-descent minimization of the
error score Back-propagation algorithm allows to compute
the gradient efficiently We have to be careful not to overfit training
data
Smoothing Outputs
The Qian-Sejnowski network assigns each residue a secondary structure by taking max(o, o, oo)
Some sequences of secondary structure are impossible:
To smooth the output of the network, another layer
is applied on top of the three output units for each residue:
Neural network Markov model
Success Rate
Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins
Depending on the exact choice of training/test sets
Breaking the 70% Threshold
A innovation that made a crucial difference uses evolutionary information to improve prediction
Key idea: Structure is preserved more than sequence Surviving mutations are not random Suppose we find homologues (same structure) of
the query sequence The type of replacements at position i during
evolution provides us with information about the use of the residue i in the secondary structure
Nearest Neighbor Approach
Select a window around the target residues Perform local alignment to sequences with known
structure Choice of alignment weight matrix to match
remote homologies Alignment weight takes into account the
secondary structure of aligned sequence Use max (na, nb, nc) or max(sa, sb, sc) Key: Scoring measure of evolutionary similarity.
PHD Approach
Multi-step procedure: Perform BLAST search to find local alignments Remove alignments that are “too close” Perform multiple alignments of sequences Construct a profile (PSSM) of amino-acid
frequencies at each residue Use this profile as input to the neural network A second network performs “smoothing”
PHD Architecture
Psi-pred : same idea
(Step 1) Run PSI-Blast --> output sequence profile
(Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position.
(Step 3) 45 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction.
Performs slightly better than PHD
Other Classification Methods
Neural Networks were used as a classifier in the described methods.
We can apply the same idea, with other classifiers. E.g.: SVM
Advantages: Effectively avoid overfittingSupplies prediction confidence
SVM based approach
Suggested by S. Hua and Z. Sun, (2001). Multiple sequence alignment from HSSP database
(same as PHD) Sliding window of w 21 w input dimension Apply SVM with RBF kernel Multiclass problem:
Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E)
maximum output score Decision tree method Jury decision method
Decision tree
H / ~H
E / CYes
Yes
No
No
H E C
E / ~E
C/ HYes
Yes
No
No
E C H
C / ~C
H / EYes
Yes
No
No
C H E
Accuracy on CB513 set
Classifier Q3 QH QE QC SOV
Max 72.9 74.8 58.6 79.0 75.4
Tree1 68.9 73.5 54.0 73.1 72.1
Tree2 68.2 72.0 61.0 69.0 71.4
Tree3 67.5 69.5 46.6 77.0 70.8
NN 72.0 74.7 57.7 77.4 75.0
Vote 70.7 73.0 74.7 76.6 73.2
Jury 73.5 75.2 60.3 79.5 76.2
State of the Art
Both PHD and Nearest neighbor get about 72%-74% accuracy
Both predicted well in CASP2 (1996) PSI-Pred slightly better (around 76%) Recent trend: combining classification methods
Best predictions in CASP3 (1998)
Failures: Long term effects: S-S bonds, parallel strands Chemical patterns Wrong prediction at the ends of helices/strands