bayesian classification of protein data

Bayesian Classification of Protein Data

Thomas Huber [email protected]

Computational Biology andBioinformatics Environment

ComBinE

Department of Mathematics The University of Queensland

mailto:[email protected]

Today’s talk

• Protein score functions from mining protein data– Bayesian classification

• A toy example• A protein scoring function for fold

recognition

• Where are score/energy functions useful?– A few examples

Why do we care about Protein Structures/Prediction?

• Academic curiosity?– Understanding how nature works

• Urgency of prediction 104 structures are determined

• insignificant compared to all proteins

– sequencing = fast & cheap

– structure determination = hard & expensive

Transistors in

Intel processo

rs

TrEMBL sequences

(computer annotated)

SwissProt sequences (annotated)

structures in PDB

Three basic choices in (molecular) modelling

• Representation– Which degrees of freedom are treated

explicitly

• Scoring– Which scoring function (force field)

• Searching– Which method to search or sample

conformational space

Protein Scoring Functions from Mining Protein Data

• Classification Theory– Find a set of classes and their

descriptors (a classification) for n data

q attributes (shape, amino acid type, etc.)

},...,,{ 21 qiiii xxxX

mi cccmX ,...,, classes from }{ 21• Theory of finite mixtures

Class attribute probability distribution of all members

Bayesian approach

)(

)( )|()|(

i

jijiiiji

XP

cXPcXXPXcXP

• Simplifications– Stating a simplified model

– Assume attributes are independently distributed

• P(Xicj|S) requires class description– Expectation Maximization (EM)

)|(

)|( ),|(),|(

SXP

ScXPScXXPSXcXP

i

jijiiiji

m

k

jiikjii ScXxPScXXP1

),|(),|(

How many classes

• Again Bayes’ rule

)(

)|( )()|(

XP

mXPmPXmP

• P(m) favours smaller number of classes– No over-fitting of data (like with

maximum likelihood methods)

A Toy ExampleDihedral preference of Valine• Four interesting degrees of freedom

-,-dihedral

angle

– Adjacent amino

acid types

• Data:893 non-redundant proteins– 12074 four-dimensional data points

i-1 i+1

Valine Data Classification

• AutoClass classification– Model: Gaussian distribution for /,

discrete probabilities for amino acids

– Total of 50 tries with #classes [2:11]

– Each try refined until fully converged Best classification has 5 classes

Amino Acid Attribute vectors of -helix Classes

• Log-Preferences

Class 1 Class 2 i-1 i+1 i-1 i+1 G -0.245 0.044 -0.332 -0.935

A 0.340 0.474 0.347 0.005

V 0.218 -1.100 -1.560 0.548

L 0.483 -0.636 -2.010 0.869

I 0.385 -1.580 -4.530 0.679

F 0.384 -0.575 -1.700 0.579

P -0.297 -0.916 0.003 -2.350

S -0.227 0.169 0.415 -0.759

T -0.127 -0.280 -0.131 -0.444

C 0.180 -0.076 -1.110 0.357

M 0.436 -0.175 -1.580 0.820

W 0.456 -0.123 -1.490 -0.370

Y -0.184 -0.389 -0.145 0.260

N -0.428 0.008 0.335 -0.050

Q -0.081 0.524 0.406 -0.581

D 0.020 0.266 0.440 -0.805

E -0.438 0.540 0.602 -1.020

K -0.981 0.300 0.634 -0.362

R -0.545 0.494 0.405 -0.860

H -0.337 0.017 0.313 -0.459

Re-invention of the Wheel

• Textbook secondary structure pattern– Helices are likely on outside of proteins

– I, I+3 and I+4 hydrophobic interface

From C.-I. Branden and J. Tooze, Introduction to Protein Structure

Fragment-based Protein Scoring

• Find classification for fragments of size 7 residues– 237566 fragments (1494 non-redundant

protein chains)

– 28 descriptors• 7 amino acid type

• 14 -/-dihedral angles

• 7 number of neighbours of each amino acid

200 CPU hours on National Facility computers

325 classes (modelling the probability distribution of native fragments)

• Use this classification to evaluate likelihood of a fragment sequence-structure match

• Total score = fragment scores

Fold Recognition = Computer Matchmaking

• Structure Disco

Does it work?

• Discrimination (TIM 1amk_)

• Generalisation

123

4

5

1

2

5

3

4

Sequence-Structure MatchingThe search problem

• Gapped alignment = combinatorial nightmare

Why is Fold Recognition better than Sequence

Comparison?

• Comparison is done in structure space not in sequence space

Finding Remote Homologueswith sausage

• 572 sequence-structure pairs• Structures are similar (FSSP)• > 70% structurally aligned• < 20% sequence identity

250

300

350

400

0 50 100 150 200

alignment quality(arb. units)

sequence similarity weight (arb. Units)

RN

A-d

epen

den

t RN

A P

olymerases

A Real Case Example RNA-dependent RNA polymerases

• Dengue virus

• Bacteriophage 6

Is this Yet Another Profile Method?

• Yes, but a much more general profile method– Profile is not residue based (like

profile-like threading force fields)– Profiles not for protein families (like in

HMMs or -Blast) – BUT local sequence profiles for

optimally chosen classes of fragments

• Local profiles can be arbitrarily assembled– Extreme flexibility

• Sequence-structure alignment (=assembling best profile matches)– Deterministic, using dynamic

programming

People

• sausage– Andrew Torda (RSC)

– Oliver Martin (RSC)

• GlnB/GlnK, RdR polymerases– Subhash Vasudevan (JCU)

Sausage and Cassandra freely available http://rsc.anu.edu.au/~torda

[email protected]

bayesian classification of protein data

Documents

protein structuresprediction

amino acidstotal

converged best classification

gaussian distribution

set of classes

amino acid200 cpu hours

scoreenergy functions

fragments of size