the subcons web-server: a user friendly web interface for...

1
Funding This work has been supported by the Sven och Lilly Lawskis fond för naturvetenskaplig forskning, the Swedish Research Council (VR-NT 2012-5046, VR-M 2010-3555) and the Swedish E-science Research Center. Knowledge of the correct protein subcellular localization is necessary for understanding the functions of a protein. Unfortunately large-scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large-scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons. References SubCons: a new ensemble method for improved human subcellular localization predictions. Salvatore, M. Et al. Bioinformatics, 2017. (33) 16, 2464-2470. The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction. Salvatore, M., Shu, N., Elofsson, A. Protein Science. 2017 Sep 13. doi: 10.1002/pro.3297 Workflow of SubCons. The figure shows the SubCons workflow. SubCons combines predictions from four predictors using a Random Forest classifier. These tools can either accept a fasta sequence(s) (CELLO2.5, MultiLoc2 and SherLoc2) or a fasta plus an MSA profile (LocTree2). The latest is constructed using PRODRES. The predicted localizations are first mapped to a standard 3 letters code. Thereafter, a vector of 9 X 4 values is used as an input for a Random forest classifier that output 9 values (one for each class). The value of each class corresponds to the average score of the class into the forest. Introduction Conclusions The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction 1 Science for Life Laboratory, Stockholm University, 171 21, Solna, Sweden. 2 Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden. 3 Sweden Bioinformatics Infrastructure for Life Sciences (BILS), Stockholm University, Stockholm, Sweden • SubCons is an ensemble method that combines four predictors using a Random Forest classifier and it is freely available as a web-server at http:subcons.bioinfo.se • SubCons outperforms earlier methods in a dataset of proteins where two independent methods confirm the subcellular localization. • Given nine subcellular localizations, SubCons achieves an F1- Score of 0.79 compared to 0.70 of the second best method. Furthermore, at a false positive rate of 1% the true positive rate is over 58% for SubCons compared to less than 50% for the best individual predictor. Materials Venn diagram showing the training and golden dataset. The figure shows the three verified experimental datasets used to train SubCons (left). Additionally, it shows the golden dataset used to test SubCons, in which at least two independent methods must confirm the subcellular localization (right). MassSpec SLHPA UniProt 4305 2431 1080 212 95 127 72 Overall performance in the golden dataset. Roc Curve showing the performance of the tools benchmarked in the golden dataset for the entire range of sensitivity and specificity (left). Performance of each predictor in the golden dataset in terms of F1-Score (right). Results Overall performance 0.7 0.7 0.53 0.66 0.66 0.7 0.69 0.79 MultiLoc2 SherLoc2 WoLF PSORT Majority Vote CELLO2.5 LocTree2 YLoc SubCons 0.0 0.2 0.4 0.6 0.8 F1SCORE Performance for different localizations 0.85 0.53 0.85 0.43 0.67 0.56 0.67 0.61 PEX CYT GLG MEM ERE LYS NUC MIT 0.0 0.2 0.4 0.6 0.8 F1SCORE 0.0 1.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.001 0.010 0.100 0.500 Log10 False Positive Rate (1Specificity) True Positive Rate (Sensitivity) CELLO2.5 LocTree2 Majority vote MultiLoc2 SherLoc2 SubCons WoLF PSORT YLoc Performance for different localizations in the golden dataset. Performance for different localizations of each predictor in the golden dataset, in terms of F1-Score (left). Performance of SubCons in the golden dataset in terms of F1-Score for every single localization (right). 0.0 0.2 0.4 0.6 0.8 MIT NUC LYS ERE MEM GLG CYT PEX F1Score CELLO2.5 LocTree2 Majority Vote MultiLoc2 SherLoc2 SubCons WoLF PSORT Yloc Salvatore Marco 1,2 , Warholm Per 1,2 , Basile Walter 1,2 , Shu Nanjiang 1,2,3 and Elofsson Arne 1,2 . Stockholm University and Science for Life Laboratory Marco Salvatore, PhD student E-mail: [email protected] Website: http://bioinfo.se/members/marco-salvatore/ Venn diagram showing the three experimental datasets. The figure shows the three verified experimental datasets used to train and test SubCons: Mass- Spec (Localization of Organelle Proteins by Isotope Tagging (LOPIT)), Human Protein Atlas (SLHPA) and UniProt. Train dataset Test (golden) dataset

Upload: others

Post on 02-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The SubCons web-server: A user friendly web interface for ...2018.ds3-datascience-polytechnique.fr/wp-content/uploads/2018/06/DS3-447.pdfThese tools can either accept a fasta sequence(s)

Funding This work has been supported by the Sven och Lilly Lawskis fond för naturvetenskaplig forskning, the Swedish Research Council (VR-NT 2012-5046, VR-M 2010-3555) and the Swedish E-science Research Center.

Knowledge of the correct protein subcellular localization is necessary for understanding the functions of a protein. Unfortunately large-scale experimental studies are limited in their accuracy. Therefore, the development of prediction methods has been limited by the amount of accurate experimental data. However, recently large-scale experimental studies have provided new data that can be used to evaluate the accuracy of subcellular predictions in human cells. Using this data we examined the performance of state of the art methods and developed SubCons.

References SubCons: a new ensemble method for improved human subcellular localization predictions. Salvatore, M. Et al. Bioinformatics, 2017. (33) 16, 2464-2470. The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction. Salvatore, M., Shu, N., Elofsson, A. Protein Science. 2017 Sep 13. doi: 10.1002/pro.3297

Workflow of SubCons. The figure shows the SubCons workflow. SubCons combines predictions from four predictors using a Random Forest classifier. These tools can either accept a fasta sequence(s) (CELLO2.5, MultiLoc2 and SherLoc2) or a fasta plus an MSA profile (LocTree2). The latest is constructed using PRODRES. The predicted localizations are first mapped to a standard 3 letters code. Thereafter, a vector of 9 X 4 values is used as an input for a Random forest classifier that output 9 values (one for each class). The value of each class corresponds to the average score of the class into the forest.

Introduction

Conclusions

The SubCons web-server: A user friendly web interface for state-of-the-art subcellular localization prediction

1 Science for Life Laboratory, Stockholm University, 171 21, Solna, Sweden. 2 Department of Biochemistry and Biophysics, Stockholm University, 106 91, Stockholm, Sweden. 3 Sweden Bioinformatics Infrastructure for Life Sciences (BILS), Stockholm University, Stockholm, Sweden

• SubCons is an ensemble method that combines four predictors

using a Random Forest classifier and it is freely available as a

web-server at http:subcons.bioinfo.se • SubCons outperforms earlier methods in a dataset of proteins

where two independent methods confirm the subcellular

localization.

• Given nine subcellular localizations, SubCons achieves an F1-

Score of 0.79 compared to 0.70 of the second best method.

Furthermore, at a false positive rate of 1% the true positive rate is

over 58% for SubCons compared to less than 50% for the best

individual predictor.

Materials

Venn diagram showing the training and golden dataset. The figure shows the three verified experimental datasets used to train SubCons (left). Additionally, it shows the golden dataset used to test SubCons, in which at least two independent methods must confirm the subcellular localization (right).

Mass−Spec

SLHPA

UniProt

43052431

1080

21295 12772

Overall performance in the golden dataset. Roc Curve showing the performance of the tools benchmarked in the golden dataset for the entire range of sensitivity and specificity (left). Performance of each predictor in the golden dataset in terms of F1-Score (right).

Results

Overall performance

0.7

0.7

0.53

0.66

0.66

0.7

0.69

0.79

MultiLoc2

SherLoc2

WoLF PSORT

Majority Vote

CELLO2.5

LocTree2

YLoc

SubCons

0.0 0.2 0.4 0.6 0.8F1−SCORE

Performance for different localizations

0.85

0.53

0.85

0.43

0.67

0.56

0.67

0.61

PEX

CYT

GLG

MEM

ERE

LYS

NUC

MIT

0.0 0.2 0.4 0.6 0.8F1−SCORE

0.0

1.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.001 0.010 0.100 0.500

Log10 False Positive Rate (1−Specificity)

Tru

e P

osi

tive

Ra

te (

Se

nsi

tivity

)

CELLO2.5LocTree2Majority voteMultiLoc2SherLoc2SubConsWoLF PSORTYLoc

Performance for different localizations in the golden dataset. Performance for different localizations of each predictor in the golden dataset, in terms of F1-Score (left). Performance of SubCons in the golden dataset in terms of F1-Score for every single localization (right).

●●

●●

●●

●●

●●

●●

●●

0.0

0.2

0.4

0.6

0.8

MIT NUC LYS ERE MEM GLG CYT PEX

F1−S

core

●●●●●●●●

CELLO2.5LocTree2Majority Vote MultiLoc2SherLoc2SubConsWoLF PSORTYloc

Salvatore Marco 1,2, Warholm Per 1,2, Basile Walter 1,2, Shu Nanjiang 1,2,3 and Elofsson Arne1,2.

Stockholm University and Science for Life Laboratory Marco Salvatore, PhD student E-mail: [email protected] Website: http://bioinfo.se/members/marco-salvatore/

Venn diagram showing the three experimental datasets. The figure shows the three verified experimental datasets used to train and test SubCons: Mass-Spec (Localization of Organelle Proteins by Isotope Tagging (LOPIT)), Human Protein Atlas (SLHPA) and UniProt.

Train dataset Test (golden) dataset