bayesian classification of protein data
DESCRIPTION
Bayesian Classification of Protein Data. Thomas Huber [email protected] Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland. Today’s talk. Protein score functions from mining protein data Bayesian classification A toy example - PowerPoint PPT PresentationTRANSCRIPT
Bayesian Classification of Protein Data
Thomas Huber [email protected]
Computational Biology andBioinformatics Environment
ComBinE
Department of Mathematics The University of Queensland
Today’s talk
• Protein score functions from mining protein data– Bayesian classification
• A toy example• A protein scoring function for fold
recognition
• Where are score/energy functions useful?– A few examples
Why do we care about Protein Structures/Prediction?
• Academic curiosity?– Understanding how nature works
• Urgency of prediction 104 structures are determined
• insignificant compared to all proteins
– sequencing = fast & cheap
– structure determination = hard & expensive
Transistors in
Intel processo
rs
TrEMBL sequences
(computer annotated)
SwissProt sequences (annotated)
structures in PDB
Three basic choices in (molecular) modelling
• Representation– Which degrees of freedom are treated
explicitly
• Scoring– Which scoring function (force field)
• Searching– Which method to search or sample
conformational space
Protein Scoring Functions from Mining Protein Data
• Classification Theory– Find a set of classes and their
descriptors (a classification) for n data
q attributes (shape, amino acid type, etc.)
},...,,{ 21 qiiii xxxX
mi cccmX ,...,, classes from }{ 21• Theory of finite mixtures
Class attribute probability distribution of all members
Bayesian approach
)(
)( )|()|(
i
jijiiiji
XP
cXPcXXPXcXP
• Simplifications– Stating a simplified model
– Assume attributes are independently distributed
• P(Xicj|S) requires class description– Expectation Maximization (EM)
)|(
)|( ),|(),|(
SXP
ScXPScXXPSXcXP
i
jijiiiji
m
k
jiikjii ScXxPScXXP1
),|(),|(
How many classes
• Again Bayes’ rule
)(
)|( )()|(
XP
mXPmPXmP
• P(m) favours smaller number of classes– No over-fitting of data (like with
maximum likelihood methods)
A Toy ExampleDihedral preference of Valine• Four interesting degrees of freedom
-,-dihedral
angle
– Adjacent amino
acid types
• Data:893 non-redundant proteins– 12074 four-dimensional data points
i-1 i+1
Valine Data Classification
• AutoClass classification– Model: Gaussian distribution for /,
discrete probabilities for amino acids
– Total of 50 tries with #classes [2:11]
– Each try refined until fully converged Best classification has 5 classes
Amino Acid Attribute vectors of -helix Classes
• Log-Preferences
Class 1 Class 2 i-1 i+1 i-1 i+1 G -0.245 0.044 -0.332 -0.935
A 0.340 0.474 0.347 0.005
V 0.218 -1.100 -1.560 0.548
L 0.483 -0.636 -2.010 0.869
I 0.385 -1.580 -4.530 0.679
F 0.384 -0.575 -1.700 0.579
P -0.297 -0.916 0.003 -2.350
S -0.227 0.169 0.415 -0.759
T -0.127 -0.280 -0.131 -0.444
C 0.180 -0.076 -1.110 0.357
M 0.436 -0.175 -1.580 0.820
W 0.456 -0.123 -1.490 -0.370
Y -0.184 -0.389 -0.145 0.260
N -0.428 0.008 0.335 -0.050
Q -0.081 0.524 0.406 -0.581
D 0.020 0.266 0.440 -0.805
E -0.438 0.540 0.602 -1.020
K -0.981 0.300 0.634 -0.362
R -0.545 0.494 0.405 -0.860
H -0.337 0.017 0.313 -0.459
Re-invention of the Wheel
• Textbook secondary structure pattern– Helices are likely on outside of proteins
– I, I+3 and I+4 hydrophobic interface
From C.-I. Branden and J. Tooze, Introduction to Protein Structure
Fragment-based Protein Scoring
• Find classification for fragments of size 7 residues– 237566 fragments (1494 non-redundant
protein chains)
– 28 descriptors• 7 amino acid type
• 14 -/-dihedral angles
• 7 number of neighbours of each amino acid
200 CPU hours on National Facility computers
325 classes (modelling the probability distribution of native fragments)
• Use this classification to evaluate likelihood of a fragment sequence-structure match
• Total score = fragment scores
Fold Recognition = Computer Matchmaking
• Structure Disco
Does it work?
• Discrimination (TIM 1amk_)
• Generalisation
123
4
5
1
2
5
3
4
Sequence-Structure MatchingThe search problem
• Gapped alignment = combinatorial nightmare
Why is Fold Recognition better than Sequence
Comparison?
• Comparison is done in structure space not in sequence space
Finding Remote Homologueswith sausage
• 572 sequence-structure pairs• Structures are similar (FSSP)• > 70% structurally aligned• < 20% sequence identity
250
300
350
400
0 50 100 150 200
alignment quality(arb. units)
sequence similarity weight (arb. Units)
RN
A-d
epen
den
t RN
A P
olymerases
A Real Case Example RNA-dependent RNA polymerases
• Dengue virus
• Bacteriophage 6
Is this Yet Another Profile Method?
• Yes, but a much more general profile method– Profile is not residue based (like
profile-like threading force fields)– Profiles not for protein families (like in
HMMs or -Blast) – BUT local sequence profiles for
optimally chosen classes of fragments
• Local profiles can be arbitrarily assembled– Extreme flexibility
• Sequence-structure alignment (=assembling best profile matches)– Deterministic, using dynamic
programming
People
• sausage– Andrew Torda (RSC)
– Oliver Martin (RSC)
• GlnB/GlnK, RdR polymerases– Subhash Vasudevan (JCU)
Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda