improving the state of the art in machine learning...

15
Iowa State University Bioinformatics and Computational Biology Graduate Program Improving the State of the Art in Machine Learning Methods for Protein-RNA Interface Prediction Rasna Walia Cornelia Caragea, Drena Dobbs & Vasant Honavar Bioinformatics & Computational Biology Computer Science Genetics Development & Cell Biology Iowa State University

Upload: others

Post on 21-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

Improving the State of the Art in

Machine Learning Methods for

Protein-RNA Interface Prediction

Rasna Walia

Cornelia Caragea, Drena Dobbs & Vasant Honavar

Bioinformatics & Computational Biology

Computer Science

Genetics Development & Cell Biology

Iowa State University

Page 2: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

Importance of Protein-RNA Interactions

• Essential roles in many biological processes

– Transcription

– Translation

– Viral infectivity

WiML, December, 2010

Source: http://www.youtube.com/watch?v=NsVEx1nFHJc&feature=player_embedded

Page 3: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Protein-RNA Interface Prediction • Experimental methods:

– Too time and labor intensive

• Computational methods

• Accurate prediction of protein-RNA interactions

can contribute to:

– New molecular tools for modifying gene expression

– Novel therapies for infectious & genetic diseases

– Detailed understanding of molecular mechanisms involved in protein-RNA recognition

Page 4: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

State of the Art in RNA Binding Site Prediction?

WiML, December, 2010

• Problem: difficult to compare results due to different

datasets, experimental design (cross-validation procedures)

& evaluation procedures

• Our goal: systematic comparison of methods and features

Sequence-based methods

• Primary amino acid sequence

• Interface propensities

Structure-based methods

• Properties extracted from structures, e.g., solvent accessible surface area, roughness

Evolution-based methods

• Sequence conservation

• MSAs, PSSMs

Combination/Ensemble methods

• Combinations of sequence, physicochemical properties, structural features and residue conservation

Page 5: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Feature Representations

• Sequence

• Sequence PSSMs

• Smoothed Sequence

PSSMs

• Structure

• Structure PSSMs

MPVGSLEKQVIFARFPGD

Page 6: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Interface Definition

• Interface residue = an amino acid whose atom is within 5Å of an atom in bound RNA

RNA atom

Protein atom

Sphere of 5Å radius

Page 7: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Datasets

Description Sequence identity

Resolution

RB106 106 non-redundant protein chains from the PDB

≤ 30% ≤ 3.5 Å

RB144 144 non-redundant protein chains from the PDB

≤ 30% ≤ 3.5 Å

RB199 199 non-redundant protein chains from the PDB

≤ 30% ≤ 3.5 Å

Page 8: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

RB144 Evaluation

Naïve Bayes

Accuracy Sensitivity F-Measure AUC of

ROC

Sequence 83.06 0.26 0.37 0.74

Sequence

PSSM 71.50 0.62 0.45 0.75

Smoothed

Seq PSSM 69.76 0.66 0.45 0.76

Structure 83.70 0.34 0.44 0.77

Structure

PSSM 73.65 0.63 0.48 0.78

Support Vector Machines

Accuracy Sensitivity F-Measure AUC of

ROC

Sequence 81.94 0.18 0.29 0.72

Sequence

PSSM 84.05 0.36 0.46 0.79

Smoothed

Seq PSSM 83.75 0.31 0.43 0.78

Structure 83.76 0.31 0.42 0.77

Structure

PSSM 84.73 0.38 0.49 0.81

Naïve Bayes

• Best accuracy: Structure

• Best AUC of ROC:

Structure PSSM

SVMs

• Best accuracy: Structure PSSM

• Best AUC of ROC:

Structure PSSM

Unbalanced data

Page 9: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Ensemble of Classifiers PDB ID: 1DI2 Chain: A

True Positives: Red

False Positives: Blue

Actual Interface Residues M P V G S L Q E L A V Q K G W R L P E Y T V A Q E S G P P H K R E F T I T C R V E T F V E T G S G T S K Q V A K R V A A E K L L T K F K T

Seq − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + + + + − − − + − − − − − − − − − − − −

Seq_PSSM + + + + − − + + − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + − − + + + + + + + + + + + + + + + + + + + − − − − − − −

Smoothed Seq_PSSM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + − − + + + + + + + + + + + + + + + + + + + − − − − − − − − −

Structure − − − − − − − − − − − − − − − − − − − − − − − + + + + + + + + + + − − − − − − − − − − − − − − − + + + − − − − − − − − − − − − − − − − − −

Structure_PSSM − + + + + − − + + − − − − − − − − − − − − − − − − + + + + + + + + + − − − − − − − − − − − − − + + + + + + + + + + + + + + + − − − − − − −

Page 10: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

Conclusions

• Both evolutionary information (PSSMs) and structural

information improve prediction of RNA-binding residues in

proteins

• Need to evaluate SVM modifications that optimize metrics

other than accuracy

• Evaluate additional representations and new machine

learning algorithms

WiML, December, 2010

Page 11: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

WiML, December, 2010

Acknowledgments

• Honavar Lab

– Dr. Vasant Honavar

– Dr. Yasser El-Manzalawy

– Fadi Towfic

– Cornelia Caragea

http://bindr.gdcb.iastate.edu/PRIDB

http://bindr.gdcb.iastate.edu/RNABindR

• Dobbs Lab

– Dr. Drena Dobbs

– Ben Lewis

Page 12: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

New Directions

• So far, “protein-centric” approach used to predict interface residues

• Use homology-based methods & partner-specific predictions:

– Given a query protein-RNA complex, find homologs of interacting protein & RNA partners; transfer interface information to query

– Approach has been successful for protein-protein interface prediction

• (Li X., Dobbs, D. & Honavar, V., in press)

• Collect features from RNA side of complexes as input for interface residue predictors

– Challenge: RNA forms myriad tertiary structures

WiML, December, 2010

Page 13: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

PRIDB: Protein-RNA Interface DataBase

WiML, December, 2010

• Comprehensive

database: 926 protein-

RNA complexes from the

PDB (Nov 2010)

• Provides – atomic level information

regarding interfacial

contacts

– pre-calculated benchmark

datasets of protein-RNA

complexes for evaluating

the performance of

interface prediction

methods

• Lewis, Walia et al. (2010)

Nucleic Acids Research

http://bindr.gdcb.iastate.edu/PRIDB

Page 14: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

References

• Cheng CW, Emily Chia-Yu ES, Hwang JK,Sung TY, Hsu WL. (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 9:S6.

• Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, Dobbs D. (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA12:1450-1462,

• Towfic F, Caragea C, Gemperline DC, Dobbs D, Honavar V. (2010) Struct-NB: predicting protein-RNA binding sites using structural features. IJDMB 4:21-43.

• Lewis B, Walia R, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D. (2010) PRIDB: a protein-RNA interface database. Nucleic Acids Res. doi: 10.1093/nar/gkq1108

• Maetschke SR, Yuan Z. (2009) Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 10:341

• Liu, ZP, Wu LY, Wang Y, Zhang XS, Chen L. (2010) Prediction of protein–RNA binding sites by a random forest method with combined features. Bioinformatics, 26:13.

WiML, December, 2010

Page 15: Improving the State of the Art in Machine Learning …pami.uwaterloo.ca/.../2010/program/WiML2010_RasnaWalia.pdfIowa State University Bioinformatics and Computational Biology Graduate

Iowa State University

Bioinformatics and Computational Biology

Graduate Program

Evaluation Metrics

WiML, December, 2010