methods for improving protein disorder prediction slobodan vucetic1, predrag radivojac3, zoran...

Methods for Improving Protein Disorder Prediction

Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2

1 School of Electrical Engineering and Computer Science, 2 Department of Biochemistry and Biophysics

Washington State University, Pullman, WA 991643 Center for Information Science and Technology

Temple University, Philadelphia, PA 19122

ABSTRACTAttribute construction, choice of classifier and post-processing were explored for improving prediction of protein disorder. While ensembles of neural networks achieved the higher accuracy, the difference as compared to logistic regression classifiers was smaller then 1%.

Bagging of neural networks, where moving averages over windows of length 61 were used for attribute construction, combined with postprocessing by averaging predictions over windows of length 81 resulted in 82.6% accuracy for a larger set of ordered and disordered proteins than used previously.

This result was a significant improvement over previous methodology, which gave an accuracy of 70.2%. More-over, unlike the previous methodology, the modified attribute construction allowed prediction at protein ends.

Standard ``Lock and Key’’ Paradigm for Protein Structure/Function Relationships

(Fischer, Ber. Dt. Chem. Ges.,1894)

Amino Acid Sequence

3-D Structure

Protein Function

Motivation

(Kissinger et al, Nature, 1995)

?

?

Protein Disorder - Part of a Protein without a Unique 3D Structure

Example: Calcineurin Protein

Overall ObjectiveBetter Understand Protein Disorders

Hypothesis:• Since amino acid sequence determines

structure, sequence should determine lack of structure (disorder) as well.

Test• Construct a protein disorder predictor • Check its accuracy• Apply it on large protein sequence databases

Objective of this Study

• Previous results showed that disorder can be predicted from sequence with ~70% accuracy (based on 32 disordered proteins)

• Our goals are to increase accuracy by – Increasing database of disordered proteins– Improving knowledge representation and attribute

selection – Examining predictor types and post-processing– Perform extensive cross-validation using different

accuracy measures

Data Sets

• Searching disordered proteins (DIFFICULT)– Keyword search of PubMed

(http://www.ncbi.nlm.nih.gov) for disorders identified by NMR, Circular dichroism, protease digestion

– Search over Protein Data Bank (PDB) for disorders identified by X-ray crystallography

• Searching ordered proteins (EASY)– Most proteins in Protein Data Bank (PDB)

are ordered

• Set of protein disorders (D_145)– Search revealed 145 nonredundant

proteins (<25% identity) with long disordered regions (>40 amino acids) with 16,705 disordered residues

• Set of ordered proteins (O_130)– 130 nonredundant completely ordered

proteins with 32,506 residues were chosen to represent examples of protein order

Data Sets

Data representation Background

• Conformation is mostly influenced by locally surrounding amino acids

• Higher order statistics not very useful in proteins [Nevill-Manning, Witten, DCC 1999]

• Domain knowledge is a source of potentially discriminative features

W C Y L A A M A H Q F A G A G K L K C T S A L S C T

class: (1 / 0) (disordered/ordered)

WINDOW (size = Win)

SEQUENCE

Calculate over window:20 Compositions K2 entropy14Å Contact NumberHydropathy Flexibility Coordination NumberBulkiness CFYW VolumeNet Charge

Attribute Selection (including protein ends)

• Attribute construction resembles low-pass filtering. Consequence– effective data size of D_145 is ~ 2*16,705/Win – effective data size of O_130 is ~ 2*32,506/Win

• K2 entropy - low complexity proteins are likely disordered

• Flexibility, Hydropathy, etc. - correlated with disorder

• 20 AA compositions - occurrence or lack of some AA from the window is correlated with disorder incidence

Attribute Selection (including protein ends)

Disorder Predictor Models

We examine:• Logistic Regression (LR)

Classification model, stable, linear

• Neural NetworksSlow training, unstable, powerful, need much data

• Ensemble of Neural Networks (Bagging, Boosting)Very slow, stable, powerful

Postprocessing

• We examine LONG disordered regions:– neighboring residues likely belong to the

same ordered/disordered region

• Predictions can be improved:– Perform moving averaging of prediction

over a window of length Wout

Data Disorder Predictor

Wout Filter Prediction

Accuracy Measures

• Length of disordered regions in different proteins varies from 40 to 1,800 AA

• We measure two types of accuracy– Per-residue (averaged over residues)– Per-protein (averaged over proteins)

• ROC curve - measures True Positive (TP) against False Negative (FN) predictions

Experimental Methodology

• Balanced data sets of order/disorder examples

• Cross-validation:– 145 disordered proteins divided into 15 subsets

(15-fold cross validation for TP accuracy)– 130 ordered proteins divided into 13 subsets (13-

fold CV for TN accuracy)

• To prevent collinearity and overfitting 20 attributes are selected (18 AA compositions, Flexibility and K2 entropy)

• 2,000 examples randomly selected for training • Feedforward Neural Networks with one hidden layer

and 5 hidden nodes. • 100 epochs of resilient backpropagation• Bagging and Boosting ensembles with 30 neural

networks• Examined Win, Wout = {1, 9, 21, 41, 61, 81, 121} • For each pair (Win, Wout) CV repeated 10 times for

neural networks and once for Logistic Regression, Bagging and Boosting

Experimental Methodology

Results – Model Comparison Per-protein accuracy, (Win, Wout) = (41,1)

Accuracy Model TN TP Average

Logistic Regression 79.7 69.9 73.5 Neural Networks 79.21.3 72.51.4 75.8 Bagging 81.4 72.8 77.1 Boosting 81.5 73.1 77.3

•Neural networks slightly more accurate then linear predictors

•Ensemble of NNs slightly better then individual NN

•Boosting and Bagging result in similar accuracy

• TN rate is significantly higher then TP rate (~ 10%)

ORDERDISORDER

• Indication that attribute space coverage of disorder is larger then coverage of order

Disorder is more diverse then order

Results – Influence of Filter SizePer-protein accuracy with bagging

0 20 40 60 80 100 1200.65

0.7

0.75

0.8

0.85

Acc

ura

cy

Wout

Win=9

Win=61

Win=21

•Different pairs of (Win, Wout) can result in similar accuracy

•Wout=81 seems to be the optimal choice

Results – Optimal (Win, Wout)Per-protein and per-residue accuracy of bagging

Per-protein Acc Win Wout*

TN TP Average Accuracy Per-residue

9 81 93.5 65.2 79.3 81.1 21 81 93.5 71.5 82.5 84.3 41 81 90.3 73.7 82.0 84.5 61 81 88.8 76.5 82.6 85.3 81 61 86.1 77.9 82.0 85.3 121 61 85.3 76.8 81.0 85.4

Per-residue accuracy gives higher values

For a wide range of Win, optimal Wout=81

The best result achieved with (Win, Wout) = (41,1)

Results – ROC CurveCompare (Win, Wout) = (21,1) and (61,81)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

TP

FN

•(Win, Wout) = (61,81) is superior: ~10% improvement in per-protein accuracy

•(Win, Wout) = (21,1) corresponds to our previous predictor

(61,81) (21,1)

Results – Accuracy at Protein Ends Comparison on O_130 proteins

• Comparison of accuracies at the first 20 (Region I) and last 20 (Region II) positions of O_130 proteins

20 1 20 0.4

0.5

0.6

0.7

0.8

0.9

I II

Acc

ura

cy Solid:

(Win=61, Wout=81)

Dashed:

(Win=21, Wout=1)

Results – Accuracy at Protein Ends

Comparison on D_145 proteins

• Averaged accuracies of the first 20 positions of 91 disorder regions that start at the beginningstart at the beginning of protein sequence (Region I) and 54 disordered regions that do not start at do not start at the beginningthe beginning of protein sequence (Region II)

Averaged accuracies of the last 20 positions of 76 disordered regions that do not end at the end of protein sequence (Region III) and 69 disorder regions that end at the end of protein sequence (Region IV).

20 1 20 1 20 1 20 0.5

0.6

0.7

0.8

0.9

I II III IV

Acc

ura

cy Solid:

(Win=61, Wout=81)

Dashed:

(Win=21, Wout=1)

Conclusions

• Modifications in data representation, attribute selection, and prediction post-processing were proposed

• Predictors of different complexity were proposed• Achieved 10% accuracy improvement over our

previous predictors• Difference in accuracy between linear models

and ensembles of neural networks is fairly small

Acknowledgements

Support from NSF-CSE-IIS-9711532 and NSF-IIS-0196237 to Z.O. and A.K.D. and from N.I.H. 1R01 LM06916 to A.K.D. and Z.O is gratefully acknowledged.

methods for improving protein disorder prediction slobodan vucetic1, predrag radivojac3, zoran...

Documents

calcineurin protein

protein ends

prediction of protein

protein disorder predictor

protein disorders hypothesis

set of protein disorders

protein data bank pdb

nonredundant proteins