proteomics: analyzing proteins space. protein families why proteins? shift of interest from...

Proteomics: Analyzing proteins

space

Proteomics: Analyzing proteins

space

Protein familiesProtein familiesWhy proteins? • Shift of interest from “Genomics” to “Proteomics”

Classification of proteins to groups/families - what is it good for? • Explosion in biological sequence data => need to organize!

• Understanding relations/hierarchy of groups is interesting as is,

e.g. in evolutionary research.

• For applied research :

– Annotation of new proteins : predicting their function,

structure, cellular localization etc.

– Looking for new folds

Sequence-based classificationSequence-based classification

• By sequence similarity (domains, motifs

or complete proteins) : Pfam, PROSITE,

SMART, InterPro etc.

• InterPro – Synthesizes the data from Pfam,

PROSITE, Prints, ProDom, and SMART.

Considered as “best” domain-based classification

available

Other kinds of classificationOther kinds of classification• Global classification :

– Systers, Protomap, CLUSTr– MetaFam synthesizes global classification data

• By structure similarity : SCOP etc.

• By function : Albumin, RetNet, TumorGenes

etc.

• A long-term project in HUJI led by

Michal & Nati Linial.• Provides automatic global

classification of the known proteins.• Performs hierarchical clustering on sequence-based metric space of proteins.

• Allows to “place” an external protein into the hierarchy.

http://www.protonet.cs.huji.ac.il

Why clustering?

• We want to refine the “similarity” notion, compared to e.g. BLAST

• Exploit transitivity to improve grouping

• Can use a low threshold on similarity:

- uses vast information from low similarities

- allowable because clustering filters noise

Why hierarchical?Vertical Perspective

Horizontal Perspective

ProtoNet: Pre-Computation

• All-against-all gapped BLAST using BLOSUM62• SwissProt release 40.28 database (114,033 proteins)• BLAST identified ~2*107 relations between these

proteins with relatively high sequence similarity E-Score of 100 or less:

• Don’t want to lose information => very permissive!• But still less then ~6.5*109 => infeasible

),( 21 ppd

Clustering Method

• First, each cluster is considered a singleton

Clustering Method

• Next, we iteratively merge the pairs of clusters

• We choose to merge the ‘most similar’ pair of clusters.

Clustering Method

• As we progress the number of singletons drops

Clustering Method

• The clustering process gradually generates a tree of clusters

• Stop whenever we like

How to merge?

• The potential merging score is calculated for each pair of clusters relevant for merging at each level

• At the bottom equals

• Higher, designed to reflect the similarity of clusters.

• Depends on the inter-cluster similarities of pairs of proteins, each from a different cluster.

m n

),( 21 ppd

Potential Merging Score of

• Arithmetic Mean

VI

• Geometric Mean

VI

• Harmonic Mean

21

21)2,1(21 ),(CC

CCpp

ppd

),( 21 CC

21)2,1(

2121

),(1

CCpp

ppdCC

21)2,1(21

121

),(CCpp

ppd

CC

Missing Data Treatment

• For very low similarity pair (outside of ~2*107 ), its length is defined as

• Practically, the merging process should finish, when the weight of the “infinite” lengths in calculation of the score between new clusters is very large (losing signal)

)),((max),( 21,21 21ppdconstppd pp

Results: ProtoNet top 20Results: ProtoNet top 20

Why

cl

usteri

ng

at

all?

We

want

to

extend

the

range

of

“si

milarity”,

co

mpared

to

e.g.

BLASTExploit

transitivity

to

improve

groupingCan

use

a

low

threshold

on

si

milarity:

- uses

vast

infor

mation

fro

m

low

si

milarities

- allowable

because

clustering

filters

noise

20 largest clusters in the ProtoNet (Arithmetic) tree at a preselected level

Problem of result assessment: what is a “good” cluster?

• Contains all proteins in the family, does not

contain proteins not in family

• But what is family? Does any keyword define a

family?

• Stable as the merging events occur (long life-

time)?

Problem of result assessment: what is a “good” tree?

• Should we trust the resulting forest?

– Which clustering technique is better? Combined?

– Bootstrap?

• Do the clusters correspond to meaningful families of

proteins?

– Validation against InterPro, SCOP etc.

– Lack of will to automatically reconstruct them!!!

• What is the right level/cut to look at the forest?

Interpro Validation

• Interpro annotation allows systematic validation of the generated clustering

• The ‘geometric’ method exhibits high cluster purity– Corresponds to low FP

The Domain Problem

• Many proteins are composed of several domains

• The sequence similarity tools used are therefore local in

nature:

• The score of comparing two sequences is the edit

distance of the most similar subsequences of them

• This creates a false similarity problem:

The Modular Nature of Proteins

CSKP HUMAN

DLG3 MOUSE

K6A1 MOUSE

MPP3 HUMANSerine/Threonine protein kinase family active siteProtein kinase C-terminal domainPDZ domainSH3 domainGuanylate kinase

8e-78

2e-47

9e-41

1e-42

False Transitivity of Local Alignment

CSKP HUMAN

DLG3 MOUSE

MPP3 HUMAN

K6A1 MOUSE

We ran BLASTusing default parameters:

All these pairwise similarities havebetter than 1e-40 EScore

If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN

Alternative methods

• Different types of clustering– Non-binary– Goal-oriented => semi-guided– Graph theory insights

• Non-clustering ways of exploring the space of proteins

• Why BLAST E-score???• Enrichment of the metric using structure

proteomics: analyzing proteins space. protein families why proteins? shift of interest from...

Documents

proteins space slide

singleton slide

infeasible slide

clustering method

similarity of clusters

clustering filters noise

new folds slide

proteins blast