subcellular localization development of a prediction tool · 11/21/2013 · protein9 vector9...
TRANSCRIPT
![Page 1: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/1.jpg)
Tatyana Goldberg November 21, 2013
Subcellular Localization Development of a Prediction Tool
@
![Page 2: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/2.jpg)
Eukaryotic Cell
2 K Rogers (2011) Britannica
Nucleus
Plasma membrane
Smooth Endoplasmic reticulum
Rough Endoplasmic reticulum
Mitochondrion
Cytoplasm
Peroxisome
Golgi apparatus
![Page 3: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/3.jpg)
Protein Function: Examples
G Protein: Signaling &Transport
Across Cell Membranes
© http://www.pdb.org
Collagen: Structural Support
Crystalline: Capturing Light
ATP Synthase: Molecular Motor
Nucleosome: DNA Maintenance
Ribosome: Translation
3
![Page 4: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/4.jpg)
Sequence to Function Gap
• UniProtKB: more than 48 Mio protein sequences (2013_11)
• UniProtKB/Swiss-Prot: 0.5 Mio manually annotated entries (2013_11)
Need for reliable automatic predictions of protein
function from amino acid sequence alone
UniProt Consortium (2011) Nucleic Acids Res A Bairoch and R Apweiler (1997) J. Mol. Med.
MTSHSYYKDRLGFDPNEQQPGSNNSMKRSSSRQTTHHHQSYHHATTSSSQSPARISVSPGGNNGTLEYQQVQRENNW…
Predictor Protein Function
4
![Page 5: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/5.jpg)
Describing Protein Function
Gene Ontology (GO)
Molecular Function
Cellular Component
Biological Process
3461 terms
26116 terms
10543 terms
The Gene Ontology Consortium (2008) Nucleic Acids Res
GO: Cellular Component for HIST1H2AI 5
Stats: Nov 2013
![Page 6: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/6.jpg)
Common compartment Common physiological function K Rogers (2011) Britannica
Cellular Compartmentalization
Prokaryotic Cell Eukaryotic Cell (bacillus type)
6
![Page 7: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/7.jpg)
TM Bakheet and AJ Doig (2008) Bioinformatics
Subcellular Location in Drug Targets
Pero 1%
Microsome 2%
Other 2% Mito
2%
Nucleus 3%
ER 7%
Extra-cellular
13%
Cytoplasm 13%
Membrane 57%
Drug targets tend to be found in membranes, cytoplasm or are extra-cellular!
7
![Page 8: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/8.jpg)
8 L Rajendran, K Simons et al. (2010) Nature Reviews
![Page 9: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/9.jpg)
Mislocalizations Associated with Human Diseases
MC Hung and W Link (2012) JCS 9
![Page 10: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/10.jpg)
Mislocalization of PKC
© http://www.jbc.org/content/279/6.cover-expansion D Chen, AC Newton, et al. (2004) JBC
=> Cytokinesis Failure
Control
10
Control
Microtubules Nuclei
![Page 11: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/11.jpg)
Why Subcellular Localization?
• Knowledge of subcellular localization of a protein provides information about its functional role
• Improved target identification during the drug discovery process
• Aberrant protein subcellular location has been observed in the cells of several diseases
• Can help validate or analyze protein protein interactions
11
![Page 12: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/12.jpg)
Localization Predictions Indispensable
High-throughput methods
• cost money
• not accurate
• not complete
Intracellular mitochondrial net-
work, microtubules, nuclei
© http://www.microscopyu.com
SWISS-PROT: Saccharomyces cerevisiae
• ~6,600 manually annotated entries
• ~4,500 have localization information (Nov 2013) 12
![Page 13: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/13.jpg)
Localization is ideal for Predictions
• Localization is an easily identifiable functional feature
• Process of protein trafficking is quite well understood
• Localization data available in public protein databases
• Most proteins active in only one compartment 13
![Page 14: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/14.jpg)
Trafficking in the Cell
Key:
gated transport
transmembrane transport
vesicular transport
B Alberts, et al. (1994) The Cell 14
![Page 15: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/15.jpg)
Sorting Signals
B Alberts et al. (1994) The Cell
• A majority of signal peptides are still unknown
• For signal patches the situation is even worse
15
![Page 16: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/16.jpg)
Protein Trafficking via Sorting Signals
1999 Nobel Prize in Physiology/ Medicine given to Günter Blobel
„for the discovery that proteins have intrinsic signals that govern their transport and localization in the cell“
=> „address tag“ or „zip code“
Nucleus Intracellular Space Eukaryotic Cell PKKKRKV
http://www.time.com
16
![Page 17: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/17.jpg)
17
![Page 18: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/18.jpg)
Predicting Localization
1) Sorting signals (signalP, chloroP)
2) Homology-based inference (R Nair and B Rost, 2002)
3) Text-based analysis (LOCkey)
4) De novo (CELLO v.2.5)
5) Hybrid approaches (LOCtree, MultiLoc2, WoLF PSORT)
18
![Page 19: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/19.jpg)
Problems to Consider
• Biological data is „noisy“ - at the biological, experimental and curation levels
• Prediction performance must be established properly • Sequence similarity between training and test data sets
• Sequence redundancy in public databases
• Prediction results must be interpretable by users • Benchmarking prediction methods can be a difficult task
19
![Page 20: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/20.jpg)
New Method: LocTree2
Archaea
Bacteria
Eukaryota
20
![Page 21: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/21.jpg)
Archaea
Bacteria
Eukaryota
21
New Method: LocTree2
• One framework for all domains of life
• Robustness to sequencing errors
• Predictions for the largest number of classes so far:
• 3 for Archaea • 6 for Bacteria • 18 for Eukaryota
• Membrane predictions
• Decision tree resembling cellular protein sorting
• Reliability score measuring the strength of a prediction
• High performance
![Page 22: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/22.jpg)
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
22
![Page 23: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/23.jpg)
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
23
![Page 24: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/24.jpg)
Data Set for Development
- Exclusion of proteins with non-experimental annotations and of proteins with unclear or multiple localizations
- Identification of transmembrane proteins by „Single-pass“ or „Multi-pass“ keywords in the CC lines
24
![Page 25: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/25.jpg)
Homology Reduction
Bias in protein databases towards certain protein families Construction of sequence unique sets in two steps
1. HSSP-value ≤ 0
2. BLAST E-value > 10-3
0 100 200 300 400 500
0
20
40
60
80
100
Length of alignment
% id
enti
cal r
esid
ues
Below this threshold no reliable annotation of subcellular localization based on sequence homology
C Sander and R Schneider (1991) Proteins; Rost B. (1999) Protein Eng B Rost (2002) J Mol Biol SF Altschul, DJ Lipman, et al. (1990) J Mol Biol
25
![Page 26: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/26.jpg)
Data Statistics
0
5000
10000
15000
20000
25000
30000
35000
Archaea
Bacteria
Eukaryota
SWISS-PROT annotated
(2011_04 release)
# p
rote
ins
HSSP-value ≤ 0 BLAST E-value > 10-3
Homology Reduction 26
![Page 27: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/27.jpg)
Stratified k-fold Cross-Validation
K-fold Cross Testing: a random partitioning into k equally sized subsets, use each subset for testing exactly once and the remaining k-1 subsets for training
Stratified: each subset contains about the same proportion of class labels as the original data set
Performance!
27
![Page 28: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/28.jpg)
Size Increase of the Training Sets
Test Set Training Set SWISS-PROT annotated
Test Set Training Set
HSSP-value ≤0
HSSP-value ≤60
HSSP-value ≤0 HSSP-value ≤60
Size increase by almost a factor of 4!
28
![Page 29: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/29.jpg)
Performance Estimates
29
![Page 30: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/30.jpg)
Performance Estimate: Accuracy
• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins predicted in L
• How often the predicted localization class is correct (also called „precision“ or „specificity“)
30
𝐴𝑐𝑐 𝐿 = 100 𝑇𝑃
𝑇𝑃 + 𝐹𝑃
![Page 31: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/31.jpg)
Performance Estimate: Coverage
• Number of correctly predicted proteins that are observed to be in localization L divided by the total number of proteins observed in L
• How often the observed localization class is predicted correctly (also called „recall“ or „sensitivity“)
31
𝐶𝑜𝑣 𝐿 = 100 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
![Page 32: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/32.jpg)
Measure for the Precision of Estimates
• How reliable are the estimates?
• What is the standard error?
B Efron (1979) The Annals of Statistics
Bootstrapping
• “To pull oneself up by one’s bootstraps” (The Adventures of Baron Munchausen)
• Introduced by Bradley Efron in 1979
• Popularized in 1980s due to the introduction of computers in statistical practice
32
![Page 33: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/33.jpg)
Bootstrapping: the Algorithm
• Start with your set of predictions
• Randomly draw a subset of n predictions from the original set
• Compute an estimate x for this subset of predictions
• Repeat previous two steps m times
bootstrap estimates x1, …, xm
33
![Page 34: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/34.jpg)
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
34
![Page 35: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/35.jpg)
Linear SVM
- Linear separation by building a hyperplane
- Hyperplane with the maximum margin is the best
Support vector
Cortes C. and Vapnik V. Machine Learning (1995) 35
![Page 36: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/36.jpg)
What if data is linearly non-separable?
Map data to a higher-dimensional feature space for a linear separation
Feature Map
Φ: x → φ(x)
Protein1
Protein2
Protein3
Protein5
Protein6
Protein7
Vector1
Vector8
Protein8
Protein9
Vector9
Vector2
Vector3
Vector4
Vector5
Vector6
A Kernel function performs this mapping and computes the distance between two vectors in the feature space.
36
![Page 37: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/37.jpg)
String Kernel Selection
Idea: find similarity between two protein sequences by looking at the number of common k-mers
1. String Subsequence Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)
• Text classification • 3 parameters
2. Mismatch Kernel (H Lodhi, S Watkins, et al. 2002 JMLR)
• Protein classification • 2 parameters
3. Profile Kernel (R Kuang, C Leslie, et al. 2004 Bioinformatics)
• Protein classification • 2 parameters and evolutionary profile
37
![Page 38: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/38.jpg)
Profile Kernel: Example
Prot1 CATLGLLGLVAL Prot2 CARRGLLWAVAL
38 C Leslie, W Noble, et al. (2004) Bioinformatics
![Page 39: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/39.jpg)
Profile Kernel: Example
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
C Leslie, W Noble, et al. (2004) Bioinformatics 39
![Page 40: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/40.jpg)
Profile Kernel: Example
all possible 3-mers = 203
Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)
AAT CAT CTQ CAW
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
C Leslie, W Noble, et al. (2004) Bioinformatics 40
![Page 41: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/41.jpg)
Profile Kernel: Example
Φ(Prot1)=(0,…,0,…,0,…,0,…,0,…,0)
AAT CAT CTQ CAW
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
CAT
AAT CAW
CAT CQT …
… …
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
C Leslie, W Noble, et al. (2004) Bioinformatics 41
![Page 42: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/42.jpg)
Profile Kernel: Example
Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
CAT
AAT CAW
CAT CQT …
… …
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
AAT CAT CTQ CAW
C Leslie, W Noble, et al. (2004) Bioinformatics 42
![Page 43: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/43.jpg)
Profile Kernel: Example
Profile1
A R T Q …
C 2 3 1 1 …
A 1 4 3 1 …
T 4 2 1 4 …
…
Prot1 CATLGLLGLVAL
e.g. k-mer length=3, conservation threshold=5
Φ(Prot2)=(0,…,0,…,1,…,1,…,0,…,0)
K(Prot1, Prot2)= Φ(Prot1)· Φ(Prot2) = 32
CAT
AAT CAW
CAT CQT …
…
(2+1+1<5)
(1+1+1<5) (1+1+1<5)
(1+1+3=5)
…
Φ(Prot1)=(0,…,1,…,1,…,1,…,0,…,0)
AAT CAT CTQ CAW
C Leslie, W Noble, et al. (2004) Bioinformatics 43
![Page 44: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/44.jpg)
Kernel Selection
SSK Mismatch Kernel Profile Kernel
Archaea 2 classes
97 ± 3 97 ± 3 97 ± 3
Bacteria 6 classes
88 ± 2 88 ± 2 93 ± 2
Eukaryota 10 classes 69 ± 1 Not available 81 ± 1
• For each kernel a range of parameter combinations tested
• Averaged overall accuracies over five training sets are reported
• One-against-all multiclass classifier
The Profile Kernel is a clear winner 44
![Page 45: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/45.jpg)
MultiClass Classifier Selection
1. One – Against – All (V Vapnik 1995 The Nature of Statistical Learning Theory)
2. Ensemble on Nested Dichotomies (S Kramer and E Frank 2004 Proceedings of ICML)
3. Ensemble of Class Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)
4. Ensemble of Data Balanced Dichotomies (E Frank, S Dong, et al. 2005 PKDD)
5. Nested Dichotomy of a fixed structure
45
![Page 46: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/46.jpg)
MultiClass Classifier #1: One - Against -All
Classification of data with labels from >2 classes
• Train n binary classifiers, one for each class against all other classes
• Predicted class is the class of the classifier with the highest value
#1
1 2,…,n
#2
2 1,…,n
#n
n 1,…,n-1
…
V Vapnik (1995) The Nature of Statistical Learning Theory 46
![Page 47: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/47.jpg)
MultiClass Classifier #2: ENDs
Decompose classes into a system of nested dichotomies (NDs).
J Fox (1997) Applied regression analysis, linear models, and related methods. Sage. S Kramer and E Frank (2004) Proceedings of ICML
• Every ND is equally probable! • Take an Ensemble of Nested Dichotomies (ENDs) • The result of END is the average of estimates of individual NDs
… …
47
![Page 48: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/48.jpg)
MultiClass Classifier #3, 4: ECBNDs, EDBNDs
ECBNDs: Ensemble of Class Balanced NDs • Each internal node of a binary tree is class balanced
EDBNDs: Ensemble of Data Balanced NDs • Each internal node of a binary tree is data balanced
Limited number of possible NDs
E Frank, S Dong, et al. (2005) PKDD 48
![Page 49: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/49.jpg)
MultiClass Classifier #5: MyND
Nested Dichotomy of a fixed architecture
49
![Page 50: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/50.jpg)
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
50
![Page 51: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/51.jpg)
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
Plasma membrane
Cytosol
Plasma membrane
Extra-cellular
51
![Page 52: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/52.jpg)
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
Cytosol
Plasma membrane
Extra-cellular
52
![Page 53: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/53.jpg)
LocTree2: Prokaryota
SVM
CYT
PM EXT
Archaea (3 classes)
SVM
Cytosol
Plasma membrane
Extra-cellular
CYT: cytosol, EXT: extra-cellular, PM: plasma membrane
53
![Page 54: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/54.jpg)
LocTree2: Prokaryota
Archaea (3 classes) Bacteria (6 classes)
SVM
SVM SVM
PM
PERI
OM
FIM EXT
SVM
SVM
SVM
SVM
CYT
SVM
CYT: cytosol, EXT: extra-cellular, FIM: fimbrium, OM: outer membrane, PERI: periplasmic space, PM: plasma membrane
SVM
CYT
PM EXT
SVM
54
![Page 55: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/55.jpg)
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
CHL: chloroplast, CYT: cytosol, ER: endoplasmic reticulum, EXT: extra-cellular, GOL: Golgi apparatus, MIT: mitochondria, NUC: nucleus, PER: peroxisome, PM: plasma membrane, PLAS: plastid, VAC: vacuole
s
s
s s
s s s
s
s s
s s s
s s s s
s
s
s s s s s s
s sz s s s
s s s s s s s s
55
![Page 56: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/56.jpg)
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
56
![Page 57: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/57.jpg)
LocTree2: Eukaryota (18 classes)
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
SVM SVM
57
![Page 58: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/58.jpg)
Secreted Non-secreted SVM
Non-membrane Trans-membrane
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Secreted Non-secreted
SVM
LocTree2: Eukaryota (18 classes)
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
SVM SVM
58
![Page 59: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/59.jpg)
MultiClass Classifier Selection
• Averaged overall accuracies over five training sets are reported
No difference in accuracy between ND-based approaches
Selected ENDs (ensemble of random NDs) and MyND (ND of a fixed structure) for the next step
One-against-All
ENDs ECBNDs EDBNDs MyND
Archaea 2 classes
97 ± 3 97 ± 3 97 ± 3 97 ± 3 97 ± 3
Bacteria 6 classes
93 ± 2 96 ± 2 96 ± 2 96 ± 2 96 ± 2
Eukaryota 10 classes 81 ± 1 86 ± 1 86 ± 1 86 ± 1 86 ± 1
59
![Page 60: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/60.jpg)
MultiClass Classifier Selection: SVMs Parameter Optimization
• Increased performance by 1%
• No difference in accuracy between END and MyND
Runtime END MyND
Archaea 3 classes 6.8 · 103 40 · 103
Bacteria 6 classes 2 · 103 15 · 103
Eukaryota 10 classes 0.2 · 103 6.1 · 103
MyND significantly faster: up to 30 times more sequences classified per minute.
Precomputed PSI-BLAST profiles and kernel values!
60
![Page 61: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/61.jpg)
Novel Method (ab initio)
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
61
![Page 62: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/62.jpg)
Eukaryota: Cross-validated Q18=65±2%
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Q =#Correctly Predicted
#Proteins
s
s
s s
s s s
s
s s
s
s s
s s s s
s
s
s
s
s s s s s s
z s s s
s s s s s s s s
62
![Page 63: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/63.jpg)
Q =#Correctly Predicted
#Proteins
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
Eukaryota: Cross-validated Q2=94±2%
63
![Page 64: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/64.jpg)
L Käll, A Krogh and E Sonnhammer (2005) Bioinformatics
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
Non-membrane Trans-membrane
SVM SVM SVM SVM
s
s
s s
s
s s
s
s
s
s
s s
s s
s s
s
s
s
s
s s s s
s sz
s s s
s s s s s
s s s
PolyPhobius=95±2%
Eukaryota: Cross-validated Q2=94±2%
64
![Page 65: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/65.jpg)
Eukaryota: Cross-validation
Accuracy = observed predicted
TP Coverage =
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
80±4 91±3
42±33 29±24
44±13 29±9
38±48 27±30
45±8 44±8
44±15 42±14
45±10 46±10
60±15 44±13
67±6 77±6
68±22 48±19
50±50 21±28
65
![Page 66: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/66.jpg)
Eukaryota: Cross-validation
Accuracy = observed predicted
TP Coverage =
SVM
SVM
SVM SVM
SVM
ER
EXT GOL NUC CYT VAC
PER
MIT
CHL PLAS
PM GOL VAC
ER NUC
PER
MIT CHL
SVM SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
SVM
80±4 91±3
42±33 29±24
44±13 29±9
38±48 27±30
45±8 44±8
44±15 42±14
45±10 46±10
60±15 44±13
67±6 77±6
68±22 48±19
50±50 21±28
66
![Page 67: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/67.jpg)
LocTree2‘s Novelty
Archaea (3)
Bacteria (6)
Eukaryota (18)
67
![Page 68: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/68.jpg)
Novel Method
Data Set Preparation
Training the Prediction Method
Archaea: 3 classes, 59 sequences
Bacteria: 6 classes, 479 sequences
Eukaryota: 18 classes, 1682 sequences
Testing
Comparison to External Tools
Kernel Selection
Multiclass Classifier Selection
SVMs Parameter Optimization
SVMs
5-f
old
cro
ss-v
alid
atio
n
10
-fo
ld c
ross
-val
idat
ion
68
![Page 69: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/69.jpg)
CELLO V.2.5
• Two-level system of SVMs
• Predicts 5 localization classes for bacteria and 12 for eukaryota
• Bacteria: data set derived from Swiss-Prot release 40.29, not homology reduced Eukaryota: data set derived from Swiss-Prot release 39, homology reduced
• http://cello.life.nctu.edu.tw/
CS Yu, JK Hwang, et al. (2006) Proteins 69
![Page 70: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/70.jpg)
LOCtree
© http://rostlab.org/ • Hierarchical system of SVMs
• Predicts 3 localization classes for prokaryotes, 6 for plants and 5 for animals
• Data set derived from Swiss-Prot release 40, homology reduced
• http://www.predictprotein.org/
R Nair and B Rost (2005) J. Mol.Biol. 70
![Page 71: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/71.jpg)
MultiLoc2
• Two – level system of SVMs
• Predicts 10 localization classes for animals and fungi, 11 classes for plants
• Data set derived from Swiss-Prot release 42, homology reduced
• http://abi.inf.uni-tuebingen.de/Publications/MultiLoc2
T Blum, O Kohlbacher, et al. (2009) BMC Bioinformatics 71
![Page 72: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/72.jpg)
WoLF PSORT
• k-NN classifier, extension of PSORT II
• Predicts 12 localization classes for eukaryotic proteins
• Data set derived from Swiss-Prot release 45 and not homology reduced + Arabidopsis thaliana entries from the GO website
• http://wolfpsort.org/
P Horton, K Nakai, et al. (2007) NAR
©http://wolfpsort.org
72
![Page 73: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/73.jpg)
LocTree2 Competitive for New Proteins
* NY Yu et al. (2010) Bioinformatics ° R Nair and Rost (2005) JoMB ** P Horton et al. (2007) NAR
New Swiss-Prot : Bacteria
Q5:
LocTree2 PSORTb 3.0*
86 ± 16 71 ± 21
LocTree2 LocTree°
86 ± 18 72 ± 21 Q3:
New Swiss-Prot : Eukaryota
Q9:
LocTree2 WoLF PSORT**
65 ± 14 62 ± 14
LocTree2 LocTree
66 ± 15 62 ± 17 Q5:
73
![Page 74: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/74.jpg)
Best Even for Sequencing Mistakes
Sequence-unique eukaryotic proteins from Swiss-Prot and LocDB databases
Full length 30N removed 30C removed 1/3 randomly
removed
62 ± 2 54 ± 2 60 ± 2 53 ± 2
56 ± 2 35 ± 2 47 ± 2 48 ± 2
56 ± 2 40 ± 2 52 ± 2 49 ± 2
LocTree 2
CELLO v. 2.5
WoLF PSORT
S Rastogi and B Rost (2011) NAR 74
![Page 75: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/75.jpg)
Conclusion
LocTree2 is a method for subcellular localization of proteins that • is available at http://www.rostlab.org/services/loctree2 • has high performnace accurate in distinguishing membrane/water-soluble proteins
• is robust against sequencing errors suitable for large-scale experiments
• aids the prediction of protein function
• is fast
75
![Page 76: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/76.jpg)
Many Thanks to
Burkhard Rost
Tobias Hamp
RostLab@TUM 76
![Page 77: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/77.jpg)
One more thing…
![Page 78: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/78.jpg)
Experimental Annotation of Organisms: Related Species Differ in Localization?
Human (H. sapiens) Orangutan (P. abelii)
PM
CYT
PER EXT
VAC
Source: SWISS-PROT
Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
NUC
NUC
NUC NUC
NUC
ER GOL
MIT
CHL
NUC
78
![Page 79: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/79.jpg)
LocTree2 Annotation of Organisms: Related Species Are Similar in Localization
Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
Source: SWISS-PROT, LocTree2
PM
CYT
ER
GOL
MIT
NUC
PER
EXT
VAC
79
![Page 80: Subcellular Localization Development of a Prediction Tool · 11/21/2013 · Protein9 Vector9 Vector2 Vector3 Vector4 Vector5 Vector6 A Kernel function performs this mapping and computes](https://reader034.vdocument.in/reader034/viewer/2022050306/5f6ed2c92fb4007eb92cb2ec/html5/thumbnails/80.jpg)
Predicted Annotation of Organisms: Easy, Fast & High Coverage
Human (H. sapiens) Orangutan (P. abelii) Chicken (G. gallus)
Plant (A. thaliana) Fish (D. rerio) Yeast (S. pombe)
Source: SWISS-PROT, CELLO v.2.5, WoLF PSORT, Yloc, LocTree2
PM
CYT
ER GOL
MIT
NUC
EXT
PER
VAC
80