improving the state of the art in machine learning...
TRANSCRIPT
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
Improving the State of the Art in
Machine Learning Methods for
Protein-RNA Interface Prediction
Rasna Walia
Cornelia Caragea, Drena Dobbs & Vasant Honavar
Bioinformatics & Computational Biology
Computer Science
Genetics Development & Cell Biology
Iowa State University
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
Importance of Protein-RNA Interactions
• Essential roles in many biological processes
– Transcription
– Translation
– Viral infectivity
WiML, December, 2010
Source: http://www.youtube.com/watch?v=NsVEx1nFHJc&feature=player_embedded
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Protein-RNA Interface Prediction • Experimental methods:
– Too time and labor intensive
• Computational methods
• Accurate prediction of protein-RNA interactions
can contribute to:
– New molecular tools for modifying gene expression
– Novel therapies for infectious & genetic diseases
– Detailed understanding of molecular mechanisms involved in protein-RNA recognition
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
State of the Art in RNA Binding Site Prediction?
WiML, December, 2010
• Problem: difficult to compare results due to different
datasets, experimental design (cross-validation procedures)
& evaluation procedures
• Our goal: systematic comparison of methods and features
Sequence-based methods
• Primary amino acid sequence
• Interface propensities
Structure-based methods
• Properties extracted from structures, e.g., solvent accessible surface area, roughness
Evolution-based methods
• Sequence conservation
• MSAs, PSSMs
Combination/Ensemble methods
• Combinations of sequence, physicochemical properties, structural features and residue conservation
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Feature Representations
• Sequence
• Sequence PSSMs
• Smoothed Sequence
PSSMs
• Structure
• Structure PSSMs
MPVGSLEKQVIFARFPGD
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Interface Definition
• Interface residue = an amino acid whose atom is within 5Å of an atom in bound RNA
RNA atom
Protein atom
Sphere of 5Å radius
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Datasets
Description Sequence identity
Resolution
RB106 106 non-redundant protein chains from the PDB
≤ 30% ≤ 3.5 Å
RB144 144 non-redundant protein chains from the PDB
≤ 30% ≤ 3.5 Å
RB199 199 non-redundant protein chains from the PDB
≤ 30% ≤ 3.5 Å
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
RB144 Evaluation
Naïve Bayes
Accuracy Sensitivity F-Measure AUC of
ROC
Sequence 83.06 0.26 0.37 0.74
Sequence
PSSM 71.50 0.62 0.45 0.75
Smoothed
Seq PSSM 69.76 0.66 0.45 0.76
Structure 83.70 0.34 0.44 0.77
Structure
PSSM 73.65 0.63 0.48 0.78
Support Vector Machines
Accuracy Sensitivity F-Measure AUC of
ROC
Sequence 81.94 0.18 0.29 0.72
Sequence
PSSM 84.05 0.36 0.46 0.79
Smoothed
Seq PSSM 83.75 0.31 0.43 0.78
Structure 83.76 0.31 0.42 0.77
Structure
PSSM 84.73 0.38 0.49 0.81
Naïve Bayes
• Best accuracy: Structure
• Best AUC of ROC:
Structure PSSM
SVMs
• Best accuracy: Structure PSSM
• Best AUC of ROC:
Structure PSSM
Unbalanced data
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Ensemble of Classifiers PDB ID: 1DI2 Chain: A
True Positives: Red
False Positives: Blue
Actual Interface Residues M P V G S L Q E L A V Q K G W R L P E Y T V A Q E S G P P H K R E F T I T C R V E T F V E T G S G T S K Q V A K R V A A E K L L T K F K T
Seq − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + + + + − − − + − − − − − − − − − − − −
Seq_PSSM + + + + − − + + − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − − + + − − + + + + + + + + + + + + + + + + + + + − − − − − − −
Smoothed Seq_PSSM + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + − − + + + + + + + + + + + + + + + + + + + − − − − − − − − −
Structure − − − − − − − − − − − − − − − − − − − − − − − + + + + + + + + + + − − − − − − − − − − − − − − − + + + − − − − − − − − − − − − − − − − − −
Structure_PSSM − + + + + − − + + − − − − − − − − − − − − − − − − + + + + + + + + + − − − − − − − − − − − − − + + + + + + + + + + + + + + + − − − − − − −
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
Conclusions
• Both evolutionary information (PSSMs) and structural
information improve prediction of RNA-binding residues in
proteins
• Need to evaluate SVM modifications that optimize metrics
other than accuracy
• Evaluate additional representations and new machine
learning algorithms
WiML, December, 2010
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
WiML, December, 2010
Acknowledgments
• Honavar Lab
– Dr. Vasant Honavar
– Dr. Yasser El-Manzalawy
– Fadi Towfic
– Cornelia Caragea
http://bindr.gdcb.iastate.edu/PRIDB
http://bindr.gdcb.iastate.edu/RNABindR
• Dobbs Lab
– Dr. Drena Dobbs
– Ben Lewis
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
New Directions
• So far, “protein-centric” approach used to predict interface residues
• Use homology-based methods & partner-specific predictions:
– Given a query protein-RNA complex, find homologs of interacting protein & RNA partners; transfer interface information to query
– Approach has been successful for protein-protein interface prediction
• (Li X., Dobbs, D. & Honavar, V., in press)
• Collect features from RNA side of complexes as input for interface residue predictors
– Challenge: RNA forms myriad tertiary structures
WiML, December, 2010
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
PRIDB: Protein-RNA Interface DataBase
WiML, December, 2010
• Comprehensive
database: 926 protein-
RNA complexes from the
PDB (Nov 2010)
• Provides – atomic level information
regarding interfacial
contacts
– pre-calculated benchmark
datasets of protein-RNA
complexes for evaluating
the performance of
interface prediction
methods
• Lewis, Walia et al. (2010)
Nucleic Acids Research
http://bindr.gdcb.iastate.edu/PRIDB
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
References
• Cheng CW, Emily Chia-Yu ES, Hwang JK,Sung TY, Hsu WL. (2008) Predicting RNA-binding sites of proteins using support vector machines and evolutionary information. BMC Bioinformatics 9:S6.
• Terribilini M, Lee JH, Yan C, Jernigan RL, Honavar V, Dobbs D. (2006) Prediction of RNA binding sites in proteins from amino acid sequence. RNA12:1450-1462,
• Towfic F, Caragea C, Gemperline DC, Dobbs D, Honavar V. (2010) Struct-NB: predicting protein-RNA binding sites using structural features. IJDMB 4:21-43.
• Lewis B, Walia R, Terribilini M, Ferguson J, Zheng C, Honavar V, Dobbs D. (2010) PRIDB: a protein-RNA interface database. Nucleic Acids Res. doi: 10.1093/nar/gkq1108
• Maetschke SR, Yuan Z. (2009) Exploiting structural and topological information to improve prediction of RNA-protein binding sites. BMC Bioinformatics, 10:341
• Liu, ZP, Wu LY, Wang Y, Zhang XS, Chen L. (2010) Prediction of protein–RNA binding sites by a random forest method with combined features. Bioinformatics, 26:13.
WiML, December, 2010
Iowa State University
Bioinformatics and Computational Biology
Graduate Program
Evaluation Metrics
WiML, December, 2010