protein family classification using sparse markov transducers proceedings of eighth international...

Protein Family Classification Protein Family Classification using Sparse Markov using Sparse Markov TransducersTransducers

Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB

2000), pp. 134-145E. Eskin, W.N. Grundy, and Y. Singer

Cho, Dong-Yeon

AbstractAbstract Classifying proteins into families using sparse Ma

rkov transducers (SMTs) Estimation of a probability distribution conditioned on

an input sequence Similar to probability suffix trees Allowing for wild-cards

Two models Efficient data structures

IntroductionIntroduction Protein Classification

Pairwise similarity Creating profiles for protein families Consensus patterns using motifs HMM-based approaches Probability suffix trees (PSTs)

A PST is a model that predicts the next symbol in a sequence based on the previous symbols.

This approach is based on the presence of common short sequences (motifs) through the protein family.

One drawback of PSTs is that they rely on exact matches to the conditional sequences (e.g., 3-hydroxyacyl-CoA dehydrogenase).

VAVIGSGT VGVLGLGT V*V*G*GT – wild cards

Sparse Markov Transducers (SMTs) A generalization of PSTs

It can condition the probability model over a sequence that contains wild-cards.

In a transducer, the input symbol alphabet and output symbol alphabet can be different.

Two methods Single amino acid Protein family

Efficient data structure Experiments

Pfam database of protein family

Sparse Markov TransducersSparse Markov Transducers A Markov Transducer of Order L

Conditional probability distribution

Xk are random variables over an input alphabet Yk is a random variable over an output alphabet

Sparse Markov Transducer Conditional probability distribution

: wild card

Two approaches for SMT-based protein classification A prediction model for each family: single amino acid A single model for the entire database: protein family

)...|( )1(321 Ltttttt XXXXXYP

)...|(2

2

1

1

k

kt

nt

nt

nt XXXYP

)1()(1

intt i

j ji

Sparse Markov Trees Representationally equivalent to SMTs

The topology of a tree encodes the positions of the wild-cards in the conditioning sequence of the probability distribution.

CACA

ACBCCA

CACAu

***

212

CAADCCADCCAAAA

CCCCu

CCCCCC

****

315

Training a Prediction Tree A set of training examples

The input symbols are used to identify which leaf node is associated with that training example.

The output symbol is then used to update the count of the appropriate predictor.

The predictor kept counts of each output symbol seen by that predictor.

We smooth each count by adding a constant value to the count of each output symbol. Cf) Dirichlet distribution

u1 DACDADDDCAA, C

CAAAACAD, D

AACCAAA, ? C0.5, D0.5

Mixture of Sparse Prediction Trees We do not know which tree topology can best estimate the

distribution. A mixture technique employs a weight sum of trees as a predictor.

Updating the weight of each tree for each input string in the data set based on how well the tree preformed on predicting the output

The prior probability of a tree is defined by the topology of the tree.

TtT

Tt

TtTtt

wXYPw

XYP)|(

)|(

t

i

iiTT

tT

ttT

tT

tT

xyPww

xyPww

1

11

1

)|(

)|(

Implementation of SMTs Two important parameters

MAX_DEPTH: the maximum depth of the tree

MAX_PHI: the maximum number of wild-cards at every node

Ten tress in the mixture if MAX_DEPTH=2 and MAX_PHI = 1

Template tree We only store these nodes which are reached during training.

AA, AC and CD

Efficient Data StructuresEfficient Data Structures Performance of the SMT typically improves with

higher MAX_PHI and MAX_DEPTH. The memory usage become bottleneck because it restricts

these parameters to values that will allow the tree to fit in memory.

Lazy Evaluation We store the tails of the training sequence and recomput

e the part of the tree on demand when necessary. EXPAND_SEQUENCE_COUNT = 4

ACDACAC(A), DACADAC(C), DACAAAC(D), ACACDAC(A), ADCADAC(D)

ACDACAC(D)

MethodologyMethodology Data

Two versions of the Pfam database Version 1.0: for comparing results to previous one Version 5.2: the latest version

175 protein families A total of 15610 single domain protein sequences containing a t

otal 3560959 residues Training and test data with a ratio of 4:1 for each family

transmembrane receptor: 530 protein sequence (424 + 106) The 424 sequences of the training set give 108858 subsequence

s that are used to train the model.

Building SMT Prediction Models A prediction model for each protein family A sliding window of size 11

Prediction of the middle symbol a6 using neighboring symbols The input symbols are a5a7a4a8a3a9a2a10a1a11.

MAX_DEPTH = 7 and MAX_PHI = 1 Classification of a Sequence using a SMT

Prediction Model Computation of the likelihood for an unknown sequence

A sequence is classified into a family by computing the likelihood of the fit for each of the 175 models.

Building the SMT Classifier Model Estimation of the probability over protein families given a

sequence of amino acids Input sequence: an amino acid sequence from a protein family Output symbol: the protein family name

A sliding window of 10 amino acids: a1,…,a10

MAX_DEPTH=5 and MAX_PHI=1 Classification of a Sequence using an SMT Classifier

Each position of the sequence gives us a probability over the 175 families measuring how likely the substring originated from each family.

Results Time-Space-Performance tradeoffs

Results of Protein Classification using SMTs The SMT models outperform the PST models. SMT Classifier > SMT Prediction > PST Prediction

DiscussionDiscussion Sparse Markov Transducers (SMTs)

We have presented two methods for protein classification using sparse Markov transducers (SMTs).

Future Work Incorporating biological information into the model suc

h as Dirichlet mixture priors Combining a generative and discriminative model Using both positive and negative examples in training

protein family classification using sparse markov transducers proceedings of eighth international...

Documents