identifying extracellular plant proteins based on frequent subsequences of amino acids

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids Y. Wang, O. Zaiane, R. Goebel

Upload: ruby-yang

Post on 15-Mar-2016

16 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

DESCRIPTION

Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids. Y. Wang, O. Zaiane, R. Goebel. Introduction. Protein: linear sequence of amino acids Protein subcellular localization Plant: nuclear, cytoplamic, mitochondria, extracellular, … - PowerPoint PPT Presentation

TRANSCRIPT

Identifying Extracellular Plant Proteins Based on

Frequent Subsequences of Amino Acids

Y. Wang, O. Zaiane, R. Goebel

IntroductionProtein: linear sequence of amino acidsProtein subcellular localization Plant: nuclear, cytoplamic,

mitochondria, extracellular, …Intracellular vs. Extracellular Sequence information alone Class imbalance Transparency

Page 3: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Related WordN-terminal sorting signalsAmino acid compositionLexical analysisIntegrative approachSubsequence methods

Page 4: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Predicting Extracellular Proteins

Feature ExtractionSupport Vector MachineBoostingFrequent Pattern Method

Page 5: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Feature ExtractionFrequent subsequences: subsequences that occur in more than a certain percentage of extracellular proteins Strong discriminative power Perform similar functions via

relationed biochemical mechanism Capture local similarity

Page 6: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Generalized Suffix Tree

Page 7: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Support Vector MachineInput data represented as feature vectorsFind a linear separator that separate the data and maximize the marginKernel function: nonlinear separator

Page 8: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

SVM for extracellular protein prediction

Data Transformation(sequencevector) Frequent subsequences as features Transform protein sequence as binary

vectorsKernel Functions Linear kernel Polynomial kernel RBF kernel

Page 9: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

BoostingIterative algorithms to improve weak classifierDifferent weighted distribution of examples in each iterationIncrease the weights of incorrectly classified examples, and decrease the weights of correctly classified ones

Page 10: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

AdaBoost

Page 11: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Frequent Pattern MethodFrequent pattern: *X1*X2*…*Xn* extracellular X1,X2,…Xn are frequent

subsequences “*” can be substituted to zero or up to

MaxGap amino acids when matching a protein sequence

Page 12: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

FOIL algorithm

Page 13: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Z-number

:accuracy of rule R:support of rule R

Page 14: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Page 15: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

ExperimentsDataset(PASub project at UofA) Plant: 3293 proteins, 171 extracellularFive-cross validation

Page 16: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Evaluation MatrixOverall accuracy is not good enoughF-measure

Page 17: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Result(SVM with subsequence)

Page 18: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Result(Boosting with subsequence)

Page 19: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Result(Frequent Pattern)

MinLen=3Min_gain=0.1

03.08.0

MinSup=5%MinConf=80%MaxGap=300

Page 20: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Result(SVM with composition)

Page 21: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Result(Boosting with composition)

Page 22: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Cross Comparision

Page 23: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

SVM with combined features

Page 24: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Boosting with combined features

Page 25: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Effects of MinLen on SVM

Page 26: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Effects of MinLen on boosting

Page 27: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

ConclusionPresented three methods for identifying extracellular proteins based on frequent subsequence of amino acidsSVM achieves the best resultFSP method provides easily interpretable rules

Page 28: Identifying Extracellular Plant Proteins Based on Frequent Subsequences of Amino Acids

Future WorkUse for information about proteins (e.g., structure, function, …)Integrating amino acid composition into FSP methodIncorporate more biological knowledge

Subsequences; Bolzano-Weierstrass theorem - UCLA Department

Extracellular accumulations

Scalable Frequent Sequence Mining With Flexible ... · Frequent sequence mining (FSM) is a data mining task that ﬁnds frequent subsequences in a sequence database. FSM is ubiquitous

Extracellular Vesicles: Cell-Derived Biomarkers of ...€¦ · Extracellular vesicles • Kidney • Urine • Biomarker Abstract Extracellular vesicles (EVs) are important mediators

EXTRACELLULAR F - ece.mcmaster.ca · extracellular measurements quantitatively. (Extracellular ﬁelds from planar excitation waves, often used for cardiac excitation, are evaluated

Longest Common Subsequences

Autologous extracellular matrix scaffolds for tissue engineeringhome.iitk.ac.in/~skishore/tissue/Autologous extracellular matrix... · Autologous extracellular matrix scaffolds for

The Computational Power of an Algebra for Subsequences ...reports-archive.adm.cs.cmu.edu/anon/usr0/anon/usr0/... · The Computational Power of an Algebra for Subsequences Wilfred

Computing Longest Increasing Subsequences over Sequential ... · Computing Longest Increasing Subsequences over Sequential Data Streams Youhuan Liy, Lei Zouy, Huaming Zhangz, Dongyan

131A Week 6 Discussion - subsequences and countabilityazhou/teaching/20S/131a-week...Alan Zhou Subsequences Countability Countable and uncountable sets I By countable, we mean \ nite

PENCOCOKAN DNA MENGGUNAKAN LONGEST COMMON SUBSEQUENCES (LCS)

Simple and fast linear space computation of Longest common subsequences Claus Rick, 1999