improving the state of the art in machine learning...

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

Improving the State of the Art in

Machine Learning Methods for

Protein-RNA Interface Prediction

Rasna Walia

Cornelia Caragea, Drena Dobbs & Vasant Honavar

Bioinformatics & Computational Biology

Computer Science

Genetics Development & Cell Biology




Graduate Program

Importance of Protein-RNA Interactions

• Essential roles in many biological processes

– Transcription

– Translation

– Viral infectivity

WiML, December, 2010

Source: http://www.youtube.com/watch?v=NsVEx1nFHJc&feature=player_embedded



Graduate Program


Protein-RNA Interface Prediction • Experimental methods:

– Too time and labor intensive

• Computational methods

• Accurate prediction of protein-RNA interactions

can contribute to:

– New molecular tools for modifying gene expression

– Novel therapies for infectious & genetic diseases

– Detailed understanding of molecular mechanisms involved in protein-RNA recognition



Graduate Program

State of the Art in RNA Binding Site Prediction?


• Problem: difficult to compare results due to different

datasets, experimental design (cross-validation procedures)

& evaluation procedures

• Our goal: systematic comparison of methods and features

Sequence-based methods

• Primary amino acid sequence

• Interface propensities

Structure-based methods

• Properties extracted from structures, e.g., solvent accessible surface area, roughness

Evolution-based methods

• Sequence conservation

• MSAs, PSSMs

Combination/Ensemble methods

• Combinations of sequence, physicochemical properties, structural features and residue conservation



Graduate Program


Feature Representations

• Sequence

• Sequence PSSMs

• Smoothed Sequence

PSSMs

• Structure

• Structure PSSMs

MPVGSLEKQVIFARFPGD



Graduate Program


Interface Definition

• Interface residue = an amino acid whose atom is within 5Å of an atom in bound RNA

RNA atom

Protein atom

Sphere of 5Å radius



Graduate Program


Datasets

Description Sequence identity

Resolution

RB106 106 non-redundant protein chains from the PDB

≤ 30% ≤ 3.5 Å


≤ 30% ≤ 3.5 Å


≤ 30% ≤ 3.5 Å



Graduate Program


RB144 Evaluation

Naïve Bayes

Accuracy Sensitivity F-Measure AUC of

ROC

Sequence 83.06 0.26 0.37 0.74

Sequence

PSSM 71.50 0.62 0.45 0.75

Smoothed

Seq PSSM 69.76 0.66 0.45 0.76

Structure 83.70 0.34 0.44 0.77

Structure

PSSM 73.65 0.63 0.48 0.78

Support Vector Machines

Accuracy Sensitivity F-Measure AUC of

ROC

Sequence 81.94 0.18 0.29 0.72

Sequence

PSSM 84.05 0.36 0.46 0.79

Smoothed

Seq PSSM 83.75 0.31 0.43 0.78

Structure 83.76 0.31 0.42 0.77

Structure

PSSM 84.73 0.38 0.49 0.81

Naïve Bayes

• Best accuracy: Structure

• Best AUC of ROC:

Structure PSSM

SVMs

• Best accuracy: Structure PSSM

• Best AUC of ROC:

Structure PSSM

Unbalanced data



Graduate Program


Ensemble of Classifiers PDB ID: 1DI2 Chain: A

True Positives: Red

False Positives: Blue

Actual Interface Residues M P V G S L Q E L A V Q K G W R L P E Y T V A Q E S G P P H K R E F T I T C R V E T F V E T G S G T S K Q V A K R V A A E K L L T K F K T

Seq − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + + + + − − − + − − − − − − − − − − − −

Seq_PSSM + + + + − − + + − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + − − + + + + + + + + + + + + + + + + + + + − − − − − − −

Smoothed Seq_PSSM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + − − + + + + + + + + + + + + + + + + + + + − − − − − − − − −

Structure − − − − − − − − − − − − − − − − − − − − − − − + + + + + + + + + + − − − − − − − − − − − − − − − + + + − − − − − − − − − − − − − − − − − −

Structure_PSSM − + + + + − − + + − − − − − − − − − − − − − − − − + + + + + + + + + − − − − − − − − − − − − − + + + + + + + + + + + + + + + − − − − − − −



Graduate Program

Conclusions

• Both evolutionary information (PSSMs) and structural

information improve prediction of RNA-binding residues in

proteins

• Need to evaluate SVM modifications that optimize metrics

other than accuracy

• Evaluate additional representations and new machine

learning algorithms




Graduate Program


Acknowledgments

• Honavar Lab

– Dr. Vasant Honavar

– Dr. Yasser El-Manzalawy

– Fadi Towfic

– Cornelia Caragea

http://bindr.gdcb.iastate.edu/PRIDB

http://bindr.gdcb.iastate.edu/RNABindR

• Dobbs Lab

– Dr. Drena Dobbs

– Ben Lewis








Graduate Program

New Directions

• So far, “protein-centric” approach used to predict interface residues

• Use homology-based methods & partner-specific predictions:

– Given a query protein-RNA complex, find homologs of interacting protein & RNA partners; transfer interface information to query

– Approach has been successful for protein-protein interface prediction

• (Li X., Dobbs, D. & Honavar, V., in press)

• Collect features from RNA side of complexes as input for interface residue predictors

– Challenge: RNA forms myriad tertiary structures




Graduate Program

PRIDB: Protein-RNA Interface DataBase


• Comprehensive

database: 926 protein-

RNA complexes from the

PDB (Nov 2010)

• Provides – atomic level information

regarding interfacial

contacts

– pre-calculated benchmark

datasets of protein-RNA

complexes for evaluating

the performance of

interface prediction

methods

• Lewis, Walia et al. (2010)

Nucleic Acids Research





Graduate Program

References

• Cheng CW, Emily Chia-Yu ES, Hwang JK,Sung TY, Hsu WL. (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 9:S6.

• Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, Dobbs D. (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA12:1450-1462,

• Towfic F, Caragea C, Gemperline DC, Dobbs D, Honavar V. (2010) Struct-NB: predicting protein-RNA binding sites using structural features. IJDMB 4:21-43.

• Lewis B, Walia R, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D. (2010) PRIDB: a protein-RNA interface database. Nucleic Acids Res. doi: 10.1093/nar/gkq1108

• Maetschke SR, Yuan Z. (2009) Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 10:341

• Liu, ZP, Wu LY, Wang Y, Zhang XS, Chen L. (2010) Prediction of protein–RNA binding sites by a random forest method with combined features. Bioinformatics, 26:13.




Graduate Program

Evaluation Metrics


improving the state of the art in machine learning...

Documents