do not reproduce without permission 1 gerstein.info/talks (c) 2003 1 (c) mark gerstein, 2002, yale,...

2

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du

Do not reproduce without permission 2 G

ers

tein

.in

fo/t

alk

s

(c)

20

03

Computational Proteomicsof Protein Complexes

Mark B GersteinYale U

Talk at NIH2003.04.07

3

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

The Interactome: the Next ‘omic Step

Interactome

ProteomeTranscriptome

Genome

4

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

The popularity of interactome information

0

50

100

150

200

250

300

350

400

450

1999 2000 2001 2002 2003

Cit

atio

ns

per

yea

r

Gavin et al. p-p int dataset

Ho et al. p-p int dataset

Uetz et al. p-p int dataset

Ribosome Structure

Spellman et al. Expression Expt.

deRisi et al. Expression Expt.

5

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

Computational Proteomics of Complexes

1. Interactions provide a systematic way of defining protein function on a genomic scale

2. Known complexes provide a benchmark to validate and integrate genome-wide interaction experiments, providing a more accurate interactome

3. Known complexes provide a focus for the intergration of (non-interaction) genomic information – e.g. expression data

4. Extrapolating from known complexes, one can predict protein complexes on a genome-scale via integrating experimental interactions and non-interaction information (combining #1 and #2)

6

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

Circumscribing Protein Function in terms of Interactions

7

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

Understanding Protein Function on a Genomic Scale

• 250 of 650 known on chr. 22 [Dunham et al.]

• >>30K+ Proteins in Entire Human Genome(alt. splicing)

.…… ~650

8

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

Issues in defining protein function on a genomic scale

• Multi-functionality: 2 functions/protein (also 2 proteins/function)

• Role Conflation: molecular, cellular, phenotypic

• Fun terms… but do they scale? • Starry night• Sarah (affects female fertility); Sonic; Darkener of apricot &

suppressor of white apricot; Redtape, gridlock, roadblock (when mutated block transport along axons); ROP vs ROM ("Regulator of Copy Number" or RNA-I-II-complex-binding-protein)

• For now, definable aspects of function: interactions, location, enzymatic rxn. [Babbit]

9

(c

) M

ark

Ge

rste

in,

20

02

, Y

ale

, b

ioin

fo.m

bb

.ya

le.e

du


ers

tein

.in

fo/t

alk

s

(c)

20

03

Ontologies for function: Networks, Hierarchies, DAGs

All of SCOP entries

1Oxido-

reductases

3Hydrolases

1.1Acting on CH-OH

1.1.1.1 Alcohol dehydrogenase

ENZYME

1.1.1NAD and

NADP acceptor

NON-ENZYME

3.1Acting on

ester bonds

1 Meta-bolism

1.1 Carb.

metab.

3.8 Extracel.

matrix

3.8.2 Extracel.

matrixglyco-protein

1.1.1 Polysach.

metab.

3.8.2.1 Fibro-nectin

General similarity Functional class similarityPrecise functional similarity

3 Cell

structure

1.5Acting on

CH-NH

3.4Acting on

peptide bonds

1.1.1.3Homoserine

dehydrogenase

1.2Nucleotide

metab.

3.1 Nucleus

3.8.2.2Tenascin

1.1.1.1 Glycogenmetab.

1.1.1.2 Starchmetab.

3.1.1.1 Carboxylesterase

3.1.1Carboxylic

ester hydro-lases

3.1.1.8 Cholineesterase

10

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u

Do not reproduce without permission 10

Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Ontologies for function: Interaction vectors

Lan et al. IEEE (2002) & COSB (2003)

11

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Validating and Integrating Genomic Protein-Protein Interaction Datasets

with Known Complexes

12

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Protein interaction data

• Databases (BIND, DIP, MIPS etc.) literature

• High-throughput datasets in vivo pull down yeast two-hybrid

• Computational predictions Tangential genomic data

• Expression data• Phenotypic data• Localization Data

13

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Combining interaction data

• High-throughput data is less reliable than more careful, smaller scale experiments Orthogonal datasets

• Combining data increases accuracy coverage

• How to do this in a quantitative way? How to weight the different data sources? General classification problem (machine

learning) Bayesian networks: probabilistic

14

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Example of data integration:RNA polymerase II

Which subunits interact?-> protein-protein interaction

experiments

Kornberg et al., 2001

Compare with Gold Std. structure:

Edwards, Kus, Jansen, Greenbaum, Greenblatt, Gerstein, TIG (2002)

15

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Data integration: RNA polymerase II

Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12

16

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3


Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12

structural contact 1 0 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

17

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3


Interaction experiments before structure was known

Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12


Far western 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 1 0 1 0 0 1 0

Far western 1 0 0 0 1 0

18

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3


Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12


Far western 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 1 0 1 0 0 1 0


= false

= true

19

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3


Integrate using naive Bayes classifier

Subunit A 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 5 5 5 5 5 5 6 6 6 6 6 8 8 8 8 9 9 9 10 10 11

Subunit B 2 3 5 6 8 9 10 11 12 3 5 6 8 9 10 11 12 5 6 8 9 10 11 12 6 8 9 10 11 12 8 9 10 11 12 9 10 11 12 10 11 12 11 12 12


Far western 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0

Cross-linking 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 0 1

Far western 1 1 1 1 1 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 0 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0

Pull-down 1 1 1 1 0 1 0 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0

Pull-down 1 1 1 0 1 0 0 1 0


Combined (Bayesian) 0 1 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

= false

= true

20

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3


Integrate using naive Bayes classifier

Majority 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Intersection 1 1 1 0 1 0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Union 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0

21

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Data integration: RNA ploymerase II

Subunit pairs covered Fraction true [%]Far western 15 53Cross linking 20 65Far western 30 77Pull-down 35 57Pull-down 35 66Pull-down 9 44Far western 6 50

Combined (Naive Bayes) 45 80Union 45 60Intersection 45 76Majority 45 73

22

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Comparison of interaction data sets

.

Data set

Method

23

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Comparison of experimental data with gold standards

Positives8250 interactions in MIPS complexes

Negatives~2.7 M pairs in diff.

Subcellular compartments

TP

FP

Set of experimental“interactions”

24

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Gavin

Uetz Ho

90/556711/135

1357/6226

6/6

353/21218/6

15/1

TP / FP

Combining experimental data

Jansen et al. JSFG 2002

25

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Integrating Structural Complexes with Non-interaction Genomic Information:

Using them to Interpret Gene Expression data

26

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

MCM3MCM6CDC47MCM2CDC46CDC54

DPB3CDC45DPB2CDC2CDC7POL2HYS2POL32DBF4ORC2ORC6ORC5ORC4ORC3ORC1

MC

M3

MC

M6

CD

C4

7M

CM

2C

DC

46

CD

C5

4

DP

B3

CD

C4

5D

PB

2C

DC

2C

DC

7P

OL

2H

YS

2P

OL

32

DB

F4

OR

C2

OR

C6

OR

C5

OR

C4

OR

C3

OR

C1

Format of Gene Expression

Data

Conditions (e.g. Cancers) or Timepoints

A B A A A B B B A B B B B B A …..

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 …..

MCM3

MCM6

CDC47

MCM2

CDC46

S CDC54

E DPB3

N CDC45

E DPB2

G CDC2

CDC7

POL2

HYS2

POL32

DBF4

….

27

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

MCM3MCM6CDC47MCM2CDC46CDC54

DPB3CDC45DPB2CDC2CDC7POL2HYS2POL32DBF4ORC2ORC6ORC5ORC4ORC3ORC1

MC

M3

MC

M6

CD

C4

7M

CM

2C

DC

46

CD

C5

4D

PB

3C

DC

45

DP

B2

CD

C2

CD

C7

PO

L2

HY

S2

PO

L3

2D

BF

4O

RC

2O

RC

6O

RC

5O

RC

4O

RC

3O

RC

1

MCMsprots.

ORC

Polym.&

Expression Correlations Segment Replication

Complex into Component Parts

28

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Range of Expression Correlations within Complexes

Replication CplxOverall .05 ORC .19, MCMs .75Pol. .45, .75,

Ribosome Overall .80Large .80Small .81

ProteasomeOverall .43 20S .5019S .51

29

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Protein-Protein Interactions &

Expression

between selected expression timecourses

(all pairs, control)

(strong interactions in permanent complexes, clearly diff.)

Cell Cycle CDC28 expt. (Davis) Sets of interactions

(from MIPS)

(Uetz et al.)

Pairwise interactions

31

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Permanent v. Transient Complexes

Jansen et al., Genome Research, 2002

33

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Genome-wide prediction of protein complexes based on both high-

throughput interaction data and non-interaction, genomic information

34

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Global Network of 3 Different

Types of Relationships

~313K significant

relationshipsfrom ~18M

possible

35

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Global Network of 3 Different

Types of Relationships

Simultaneous 188KInverted 63KShifted 67K

~313K significant

relationshipsfrom ~18M

possible

36

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Globally, how well do expression relationships

predict known interactions?

Coverage of the 8250 Known Interactions in Complexes Found [MIPS]

Random ~2% 1x(313K/18M)

24x

EnrichmentCompared to RandomizedExpressionRelationships

CC: 313K relationships from ~18M possible from clustering cell-cycle expt.

CC 42%

37

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Combining Expression Data Sets Increases

Coverage & Decreases Noise


KO: 278K relationshipsfrom clusteringknock-out profiles [Rosetta]

KO 34% 22x


38

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Combining Expression Data Sets Increases

Coverage & Decreases Noise


CC: 313K relationships from ~18M possible from clustering cell-cycle expt.

CC 42% 24x

KO: 278K relationshipsfrom clusteringknock-out profiles [Rosetta]

KO 34% 22xKO v CC 55% 111xKO ^ CC 21% 254x


39

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Computational Proteomics of Complexes

1. Interactions provide a systematic way of defining protein function on a genomic scale

2. Known complexes provide a benchmark to validate and integrate genome-wide interaction experiments, providing a more accurate interactome

3. Known complexes provide a focus for the intergration of (non-interaction) genomic information – e.g. expression data

4. Extrapolating from known complexes, one can predict protein complexes on a genome-scale via integrating experimental interactions and non-interaction information (combining #1 and #2)

40

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

For the Future

• Developing an accurate interactome for the cell, from prediction and through integration of high-throughput information

• Development of statistical approaches to combine and integrate information

• Development of database technologies to store hetrogeneous and noisy genome-wide interaction datasets

• A moderate number of structural complexes are very useful as gold standard data

41

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Protein complexes &Structural Genomics

• A computational challenge following from the solution of the partslist Given many monomeric structures produced by structural genomics,

predict (or rationalize) the interactome through docking

• Maybe many structures will be only be solved as complexes….

43

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Bottlenecks in analysis of all of TargetDB (Interologs)

44

(

c)

Ma

rk G

ers

tein

, 2

00

2,

Ya

le,

bio

info

.mb

b.y

ale

.ed

u


Ge

rste

in.i

nfo

/ta

lks

(c

) 2

00

3

Acknowledgements

J Qian, R Jansen, A Drawid, C Wilson,

D Greenbaum, C Goh, N Lan, H Hegyi, R Das, S Douglas, B StengerJ Lin, Y Kluger

CollaboratorsM Snyder (A Kumar, H Zhu, …)

A Edwards, B Kus, J Greenblatt

NIH

GeneCensus.org

do not reproduce without permission 1 gerstein.info/talks (c) 2003 1 (c) mark gerstein, 2002, yale,...

Documents

c mark gerstein

infotalks c

copyright mark gerstein

yale university

babbit slide

understanding protein

known complexes

genomic scale