author: jason weston et., al pans presented by tie wang protein ranking: from local to global...

Post on 20-Dec-2015

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Author: Jason Weston et., al

PANS

Presented by Tie Wang

Protein Ranking: From Local to global structure in

protein similarity network

Outline

Introduction; Background; Method; Experiment Analysis

Introduction

Pairwise subtle sequence similarities imply structural functional and evolutionary relations among DNA and protein seqences;

Search biosequences from online database is analogous to searching the WWW (search engine search the db for query and return a ranked list);

A protein ranking algorithm is presented for biosequence query;

Early algorithms only focus on pair-wise sequence similarity (SW LA search);

Statistical models use multiple alignments for similarity search (profile based, psi-blast);

Global similarity search can be mapped onto protein similarity network.

Background

How to perform protein ranking?

Underlying idea: Google ranking Key feature: Exploiting global structure by

interring it from local hyperlink structure. Construct a protein similarity network Add query sequence Weight diffusion Rank proteins upon convergence

Algorithm

Experiment

Use protein 3-D structure database SCOP as golden standard.

Sequences have no more than 95% similarity.

7329 proteins are splitted into 379 superfamilies as training and 332 for testing

3 networks are generated using BLAST and PSI-BLAST.

Experiment

Value

Compare with other two experiments: 1. only local structure are considered 2. non-local edges without weak edges The result shows that the second one is only slightly

worse than our algorithm

=

Where Sj(i) is E value assigned to protein I given query j.

Analysis

Bower et al, Science vol 306, 2004

Cluster structure

Author: Kuang Rui et., al

Bioinformatics

Presented by Tie Wang

Motif based protein ranking by network

propagation

Outline

Introduction; Background; Method; Experiment Analysis

Direct measure of pairwise sequence is proved to be effective on classification.

Performance is dropped down when detecting subtle remotely homology sequences.

Those sequences share a conserved structure at least at some components.

Formulate problem based on this statement.

Background

Protein motif bipartite network

• Each protein contains a set of motifs.

• Each motif belongs to a set of proteins.

• Their relationship are mapped to a Bipartite graph as shown on the left.

• The edge weight indicates the probi- lity that motif x is in protein y.

Motifdrop Algorithm

Set P represents protein sequences and set F represents motifs. H is the connectivity matrix.

is row normalized version of H.

is a vector of initial value for H.

is a vector of initial value for P.

MotifProp Algorithm

The convergence of motifdrop is guranteed. The problem is reformulated based on the

following rule,

is row normalized version of H.

is a vector of initial value for H.

is a vector of initial value for P.

Edge weighting scheme

PSI-BLAST E-value is assigned between pair-wise protein nodes.

Gaussian edge weights are calculated.

The Gaussian weights from query to each protein are assigned as initial value.

Value estimation

Sq(i) is the E-value of protein i and query q.

Eq(j) is the E-value of the jth motif and ith protein.

(1)

???

Estimation on substitution score

Substitutions score between a kmer f and sequence x can be estimated as,

where

and

sl is a log value which implied the S score below threshold can be a motif hits against sequence x.

Sequential MotifProp

Empirical experiments suggest that using a weighted linear combination of multiple motifs does not improve the results.

Apply a simple multiple motif sets scheme. Motif nodes F can be divided into n set partition in which F(i) is a set of motif from ith

motif set. F set represents the motifs instead of individual ones.

Motif-rich regions

Experiments

7329 protein domains with known 3D structure on SCOP.

They are divided into training (4246) and testing (3083).

Apply additional 10602 from swiss-prot db. Evaluation on ROC curve.

Results of classification

Results of classification (cont)

Results on Motif rich region

Conclusion

Two methods are presented on protein classification using protein ranking methods.

Similarity matrix and protein/motif propagation network are base structures.

Simple methods but innovative formulation. Better results compared with current

approaches. Analysis on results play an important roles.

top related