bioinformatics for protein interations and biological netw ... · cpi ppi tutorial 2004 - amh...
TRANSCRIPT
Bioinformatics for Protein Interations and
Biological Networks
Adrian HeilbutHogue Lab, Samuel Lunenfeld Research Institute, Mount Sinai Hospital
Department of Biochemistry, University of Toronto
[email protected] http://individual.utoronto.ca/amh
CPI Protein Protein Interactions Tutorial 2004 Montreal
1
CPI PPI Tutorial 2004 - AMH
1. Laboratory Information Management and Relational Databases
2. Experimental Design and Statistics
3. MS/MS Protein Identification
4. Graph models of biological networksa) Graph theory
b) Statistical mechanics of biological graphs
5. Accessing and Visualizing Data
6. Interaction Prediction & Confidence measures
Bioinformatics for Protein Interactions and Networks
data integrationand analysis
experimentation
2
CPI PPI Tutorial 2004 - AMH
Objectives
1. To appreciate the importance of effective laboratory information management systems.
2. To understand the usefulness of relational databases and to be able to use some basic SQL.
3. To think more critically about experimental design issues in large-scale interaction experiments.
4. To be able to integrate and visualize interaction data from public databases to help generate biological hypotheses.
5. To understand graph-theory approaches for local and global characterization and analysis of biological networks.
3
CPI PPI Tutorial 2004 - AMH
tools
1. relational databases
2. statistics and machine learning
3. graph theory
4
CPI PPI Tutorial 2004 - AMH
1. LIMS & Relational Databases
1. HMS-PCI protocol review2. Laboratory Information Managment Systems3. Relational databases4. An Example LIMS database5. Using SQL6. Summary
5
CPI PPI Tutorial 2004 - AMH
example: HMS-PCI workflow
• sample explosion and tracking
• grounding to biology
• process monitoring and quality control
Bait selection
Cloning
Transfection & Expression
Immuno-precipitation
Separation
Digestion
LC MS/MS
Protein Identification
Biologicaldatabases
LIMS
Interpretation
Data distribution
High-throughput mass spectrometry protein complex identification
Sequences
Localization
Qua
ntita
tion
GeneticsFunction
Expressio
n
Inte
ract
ions
6
CPI PPI Tutorial 2004 - AMH
Laboratory Information Management
• LIMS: laboratory information management system
• Research vs. Manufacturing vs. Clinical labs - very different requirements
• Homebrew vs. commercial
• Development time & resources
• it takes longer than you think...
• Relational databases and SQL
7
CPI PPI Tutorial 2004 - AMH
LIMS system architectures
SQL
database
server
client sw
application server
web application
web browser
Frustrated
ScientistSQL savvy
scientist
rest of biological data
in universe...
Genbank, BIND, ...
cytoscape
instrument sw
bioinformatics sw
8
CPI PPI Tutorial 2004 - AMH
Relational Databases for Biology
• Advantages of using a real database
• forces careful thought about data model
• centralized storage and security
• dissemination of data
• scalability
• ad-hoc queries with SQL
• ACID (Atomicity, Consistency, Isolation, Durability)
•• Commercial: DB2, SQL Server, Oracle, Access
• Free: mySQL, PostgreSQL, Firebird
•• Excel, XML: great for certain things, but not
relational databases
9
CPI PPI Tutorial 2004 - AMH
Relational Model - 1970
• Developed to solve problems of data independence - how do you store structured data when the structure of your data is evolving?
• Users should be insulated from internal representations of data
• Everything in a relational database can be viewed in terms of tables and operations on tables that result in new tables
• Firmly grounded in math and logic
Codd, E.F. “A Relational Model of Data for Large Shared Data Banks” Communications of the ACM, Vol. 13, No. 6, June 1970, pp. 377-387
10
CPI PPI Tutorial 2004 - AMH
Relational Model
• Data is modeled as “relations”
• Attributes: S1, S2, S3 from a specific domain (fields, each of a specific data type)
• Relations: set of n-tuples (rows), where first element from S1, second from S2, etc.
• Relations can be represented as tables
• All rows are distinct
• Columns are labelled
• Rows are in no particular order
11
CPI PPI Tutorial 2004 - AMH
Normalization
• Basic idea:
• Unrelated facts should be stored separately
• makes updating and querying much easier
• First Normal Form: all records have the same number of fields, and every attribute is atomic (single-valued)
• Second Normal form: 1NF + every nonkey attribute is dependent on the primary key
• Third Normal form: 2NF + every nonkey attribute is independent of all other attributes
ref: Kent, W. “A Simple Guide to Five Normal Forms in Relational Database Theory” Communications of the ACM 26(2), Feb 1983, 120-125. available at: http://www.bkent.net/Doc/simple5.htm
12
CPI PPI Tutorial 2004 - AMH
Relational Algebra
• Language for manipulating sets of relations
• ∪ Union
• ∩ Intersection
• ! Difference
• " Selection
• ! Projection
• " cartesian product
• # renaming
13
CPI PPI Tutorial 2004 - AMH
SQL
• Structured Query Language
• Based on relational algebra
• A (semi)standard way to define a database (DDL), store, and retrieve data
• Easy - just logic
• CREATE TABLE...
• SELECT ... INNER JOIN ... WHERE
• DELETE
• UPDATE
14
CPI PPI Tutorial 2004 - AMH
a very simple LIMS...
ORFVARCHAR(15)
TagPositionCHAR(1)
CloneIDINT
PulldownIDINT
CloneIDINT
CloneIDINT
PullDownIDINT
BandIDINT
PlateNumINT
WellNoINT PlateNum
INTWellNo
INTGI
INTScoreINT
PulldownsConstructs
Bands Hits
GIINT
ORFVARCHAR(15)
GIORF
15
CPI PPI Tutorial 2004 - AMH
and a little bit of biological data...
ORFNameVARCHAR(15)
CommonNameVARCHAR(15)
SGDIDVARCHAR(15)
DescriptionVARCHAR(250)
ORFNameVARCHAR(15)
PfamIDVARCHAR(25)
EvalueFLOAT
orfdomains
orfs
PfamIDVARCHAR(15)
DomDescVARCHAR(250)
domdesc
ORFNameVARCHAR(15)
GOIDVARCHAR(20)
localizations
ie. from
ie. from running hmmeron all the yeast proteins
ORFNameVARCHAR(15)
LocVARCHAR(20) ORFGO
16
CPI PPI Tutorial 2004 - AMH
Relational Integrity and Foreign Keys
• Foreign keys allow you to specify that a field in one table must relate to a field in another table
• Database will not allow changes (deletions, updates) that violate integrity of links
17
CPI PPI Tutorial 2004 - AMH
Indexing & Performance
• Query performance depends on having appropriate indexes, to avoid doing linear searches
• O(N) ! O(logN)
• with 108 rows, that makes a big difference
• once your data is in a database, easy as CREATE INDEX ON...
• Commercial database management systems do sophisticated cost-based query optimization to execute complicated queries efficiently
18
CPI PPI Tutorial 2004 - AMH
SQL: Data Definition
• To define a table:
• Give it a name
• Specify the columns, their data types, and references to other tables
• Specify a primary key• all other columns should depend (only) on primary key
• primary key is unique
19
CPI PPI Tutorial 2004 - AMH
SQL example: SELECT
• Find out more about Cdc2
> SELECT * FROM orfs WHERE CommonName = ‘Cdc2’
ORFNameVARCHAR(15)
CommonNameVARCHAR(15)
SGDIDVARCHAR(15)
DescriptionVARCHAR(250)
YDL102W Cdc2 S0002260Catalytic subunit of DNA polymerase delta; required for chromosomal DNA replication during mitosis ...
orfs
...
20
CPI PPI Tutorial 2004 - AMH
SQL: SELECT with INNER JOIN
• Find all proteins that have a protein kinase domain
ORFNameVARCHAR(15)
CommonNameVARCHAR(15)
SGDIDVARCHAR(15)
DescriptionVARCHAR(250)
YDL102W Cdc2 S0002260Catalytic subunit of DNA polymerase delta; required for chromosomal DNA replication during mitosis ...
orfs
ORFNameVARCHAR(15)
PfamIDVARCHAR(25)
EvalueFLOAT
orfdomains
> SELECT * FROM orfsINNER JOIN orfdomains on orfs.ORFName = orfdomains.ORFName WHERE PfamID = ‘PF00069’ AND Evalue < 0.001
21
CPI PPI Tutorial 2004 - AMH
Views
• To get a list of interactions from our LIMS, we have to join 5 tables together
• A VIEW is like a canned select statement that acts like a read-only table
CREATE VIEW ORF_INTX AS (SELECT Constructs.ORF as BaitORF, GIORF.ORF as HitORF FROM Constructs
INNER JOIN PullDowns ON PullDowns.CloneID = Constructs.CloneIDINNER JOIN Bands ON Bands.PullDownID = PullDowns.PullDownIDINNER JOIN Hits ON Hits.PlateNumber = BandsPlateNumber AND Hits.WellNumber = Bands.WellNumber INNER JOIN GIORF ON Hits.GI = GIORF.GI)
ORFVARCHAR(15)
TagPositionCHAR(1)
CloneIDINT
PulldownIDINT
CloneIDINT
CloneIDINT
PullDownIDINT
BandIDINT
PlateNumINT
WellNoINT PlateNum
INTWellNoINT
GIINT
ScoreINT
PulldownsConstructs
Bands Hits
GIINT
ORFVARCHAR(15)
GIORF
INTX
BaitORFVARCHAR(15)
HitORFVARCHAR(15)
22
CPI PPI Tutorial 2004 - AMH
slightly more interesting queries...
• Let’s find all the kinases that pull down other kinases
> SELECT orf_intx.* FROM orf_intxINNER JOIN orfdomains db on orf_intx.BaitORF = orfdomains.ORFNameINNER JOIN orfdomains dh on orf_intx.HitORF = orfdomains.ORFName WHERE db.PfamID = ‘PF00069’ AND db.Evalue < 0.001 and dh.PfamID = ‘PF00069’ AND dh.Evalue < 0.001
> SELECT orf_intx.* FROM orf_intxINNER JOIN orfdomains db on orfs_intx.BaitORF = orfdomains.ORFName WHERE db.PfamID = ‘PF00069’ AND db.Evalue < 0.001AND orf_intx.HitORF IN (SELECT ORFName from Localizations WHERE Loc = ‘Nucleus’)
UNION SELECT ORFName from Orf_GO WHERE GOID = ‘GO:0000130’)
Find all the kinases that interact with known transcription factors or with nuclear-localized proteins
23
CPI PPI Tutorial 2004 - AMH
LIMS & Database Summary
• Interaction proteomics experiments generate large amounts of data
• Large amounts of data are best stored in a relational database
•• SQL is an invaluable basic tool
• easy to learn and use - worth learning
• allows surprisingly sophisticated queries
•• More info:
• SQL for Web Nerds http://philip.greenspun.com/sql
• many tutorials on the web
24
CPI PPI Tutorial 2004 - AMH
2. Experimental Design & Statistics
1. Issues with high throughput interaction data2. Measuring performance: • Confusion matrix: types of errors
• ROC curves
3. Example: Design of HMS-PCI Experiments• Reproducibility
• False positive and negative rate estimates
• Biochemical factors
• Planning numbers of replicates
4. Summary
25
CPI PPI Tutorial 2004 - AMH
High Throughput Interaction Data
• Prone to false positives and negatives
• Low overlap between data from different methods
• Impractical to validate every interaction by traditional methods
• High costs of false positive data
• Need to be able to prioritize results
Bader & Hogue 2002
26
CPI PPI Tutorial 2004 - AMH
Confusion matrix
true class
positive negative
yesTrue
positiveFalse
positive
noFalse
negativeTrue
negativehypoth
esis
reference: Fawcett, 2004 http://www.hpl.hp.com/personal/Tom_Fawcett/papers/ROC101.pdf
fp rate
tp rate
totals: P N
precision =TP
FP + FP
=TP + TN
P + Naccuracy
=
TP
Psensitivity =recall
=
FP
N
=
TP
P
specificity = 1 − fprate
27
CPI PPI Tutorial 2004 - AMH
ROC curves
• Receiver Operating Characteristic
• originally used in signal detection theory and applied to medical diagnostic systems
• illustrates tradeoff between true-positive and false positive rates
• area under curve (AUC) provides a simple scalar value that can be used to compare performance
ROC graphs 15
0 0.2 0.4 0.6 0.8 1.0
0
0.2
0.4
0.6
0.8
1.0
False Positive rate
True
Pos
itive
rate
A
B
0 0.2 0.4 0.6 0.8 1.0
0
0.2
0.4
0.6
0.8
1.0
False Positive rate
True
Pos
itive
rate A
B
Figure 7. Two ROC graphs. The graph on the left shows the area under two ROCcurves. The graph on the right shows the area under the curves of a discrete classifier(A) and a probabilistic classifier (B).
Every instance that is classified to this leaf node will be assigned thesame score. The rectangle of figure 6 will be of size nm
PN , and if theseinstances are not averaged this one leaf may account for errors in ROCcurve area as high as nm
2PN .
5. Area under an ROC Curve (AUC)
An ROC curve is a two-dimensional depiction of classifier perfor-mance. To compare classifiers we may want to reduce ROC performanceto a single scalar value representing expected performance. A commonmethod is to calculate the area under the ROC curve, abbreviatedAUC (Bradley, 1997; Hanley and McNeil, 1982). Since the AUC is aportion of the area of the unit square, its value will always be between 0and 1.0. However, because random guessing produces the diagonal linebetween (0, 0) and (1, 1), which has an area of 0.5, no realistic classifiershould have an AUC less than 0.5.
The AUC has an important statistical property: the AUC of aclassifier is equivalent to the probability that the classifier will ranka randomly chosen positive instance higher than a randomly chosennegative instance. This is equivalent to the Wilcoxon test of ranks(Hanley and McNeil, 1982). The AUC is also closely related to theGini coefficient (Breiman et al., 1984), which is twice the area betweenthe diagonal and the ROC curve. Hand and Till (2001) point out thatGini + 1 = 2 × AUC.
Figure 7a shows the areas under two ROC curves, A and B. ClassifierB has greater area and therefore better average performance. Figure 7b
ROC101.tex; 16/03/2004; 12:56; p.15
Fawcett, 2004
28
CPI PPI Tutorial 2004 - AMH
example: HMS-PCI workflow
• reproducibility
• fp, fn rates at given # of replicates
• cross-contamination
• sources of variability
Bait selection
Cloning
Transfection & Expression
Immuno-precipitation
Separation
Digestion
LC MS/MS
Protein IdentificationBiological
databases
LIMS
Interpretation
Distribution
29
CPI PPI Tutorial 2004 - AMH
HMS-PCI: Design of Large-Scale Interaction Proteomics Projects
• How many times should a bait be expressed?
• How many hits can be expected?
• Should both N & C termini be tagged?
• How many trials are required?
• What is the false negative rate?
• What is the false positive rate?
Needed for:• Project design
• Data integrity and interpretation
• Cost estimation and control
30
CPI PPI Tutorial 2004 - AMH
HMS-PCI design: Reproducibility Study
• 49 baits from diverse protein families
• N- and C- tagged
• 4-10 replicates with each construct
• Controls
• Negative controls: FLAG-tag in empty vector
• Positive controls: VHL protein• had been observed to reproducibly pull down high and low abundance
interactors
• Replicates collected over different days and run side-by-side on gels
31
CPI PPI Tutorial 2004 - AMH
HMS-PCI design: Success Rate
• Expression success ≡ bait observed by MS
• N and C have roughly equivalent expression rates
• 5-6 attempts required for 4 successful replicates, on average
Expression Success
0.00
0.10
0.20
0.30
0.40
0.50
0 0.25 0.5 0.75 1
Fraction of attempts successful
% o
f to
tal b
ait
s a
ttem
pte
d
N
C
32
CPI PPI Tutorial 2004 - AMH
HMS-PCI design: “Hit” definition
• Some operational definition of a hit required to filter through data
• Hit should be:
• Specific: < 5% background observation frequency
• Reproducible: 2+ observations
# proteins
after control subtraction 3031
< 5% BOF 1081
2+ observations 190
33
CPI PPI Tutorial 2004 - AMH
HMS-PCI design: Reproducibility Rates
• Reproducibility varies greatly between hits
Observed Reproducibility Rate
0.01
0.02 0.07
0.39
0.01
0.04
0.17
0.31
0.00
0.10
0.20
0.30
0.40
0.50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Reproducibility Rate
Fra
cti
on
of
Hit
s
N
Average
C
34
CPI PPI Tutorial 2004 - AMH
• Bernoulli process:
• repeated trials, each of which is either a “success” or “failure”
• probability p of success in each trial
• # of successes in n trials has binomial distribution
• probability of k or more successes:
Reminder: Binomial Distribution
P [X = k] =
(n
k
)pk(1 − p)n−k
n∑k=2
(n
k
)pk(1 − p)n−k
(n
k
)=
n!
k!(n − k)!
35
CPI PPI Tutorial 2004 - AMH
HMS-PCI design: Minimizing False-Negative Risk
• How many trials are required to observe a hit twice?
• Depends on reproducibility rate for that hit
Number of Trials Needed to Observe Prey 2+
Times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Reproducibility Rate
Fra
cti
on
of
Hit
Po
ol
2
3
4
5
6
# of trials
2 3 4 5 6
0 0.00 0.00 0.00 0.00 0.00
0.1 0.01 0.03 0.05 0.08 0.11
0.2 0.04 0.10 0.18 0.26 0.34
0.3 0.09 0.22 0.35 0.47 0.58
0.4 0.16 0.35 0.52 0.66 0.77
0.5 0.25 0.50 0.69 0.81 0.89
0.6 0.36 0.65 0.82 0.91 0.96
0.7 0.49 0.78 0.92 0.97 0.99
0.8 0.64 0.90 0.97 0.99 1.00
0.9 0.81 0.97 1.00 1.00 1.00
1 1.00 1.00 1.00 1.00 1.00
Reproducibility
Rate
Theoretical Probability of 2+
observations in X # of trials
2 3 4 5 6
0 0.00 0.00 0.00 0.00 0.00 0.00
0.1 0.00 0.00 0.00 0.00 0.00 0.00
0.2 0.01 0.00 0.00 0.00 0.00 0.00
0.3 0.02 0.00 0.00 0.01 0.01 0.01
0.4 0.07 0.01 0.02 0.03 0.04 0.05
0.5 0.39 0.10 0.19 0.27 0.31 0.34
0.6 0.01 0.00 0.01 0.01 0.01 0.01
0.7 0.04 0.02 0.03 0.04 0.04 0.04
0.8 0.17 0.11 0.15 0.16 0.17 0.17
0.9 0.00 0.00 0.00 0.00 0.00 0.00
1 0.31 0.31 0.31 0.31 0.31 0.31
1.00 0.55 0.72 0.83 0.89 0.93
Fraction
of Prey
Pool
Predicted Fraction of Observed
Prey Pool Found in X # of trialsReproducibility
Rate
36
CPI PPI Tutorial 2004 - AMH
• Assume each protein is observed randomly, iid across all experiments
• Estimate frequency from observed background observation frequency (normalized for number of repeats of specific baits)
• Choose prey frequency cutoff based on number of trials and false positive rate considered acceptable
HMS-PCI design: Controlling False-Positive Risk
Is it necessary to tag both the amino - and carboxy- termini to detect
all of aprotein’sinteractions?
Table11A indicates that proteinsmust betagged at both positions in
order to obtain all of the interactions that can be found with the
HMS-PCI protocol.
Table 11B shows that the overlap between N- and C- tagged
constructsisextremely small.
TheN-terminal tagsaremoreproductiveoverall. It may bepossible
to improvethedesign of theC-tags.
Optimizing Experimental Design in High-Throughput Interaction ProteomicsAdrian Heilbut*, Paul Taylor, Lynda Moore, Mike F. Moran, Daniel Figeys, Thodoros Topaloglou, Gregg B. Morin*
mds prot eom ic s inc , Toront o, Canada
Optimizing Experimental Design in High-Throughput Interaction ProteomicsAdrian Heilbut*, Paul Taylor, Lynda Moore, Mike F. Moran, Daniel Figeys, Thodoros Topaloglou, Gregg B. Morin*
mds prot eom ic s inc , Toront o, Canada
1. Motivation
3. Experimental System
Conclusions
High throughput capability can be leveraged to significantly
improve the quality of protein interaction data, in addition to
expandingcoverage
Even a very simple statistical model can be helpful to rationally
guideexperimental designs.
A statistical approach allows for prioritization of preys based on
experimental quality, and permits potentially questionable data to
beflagged.
Experimental confidence estimates and higher quality underlying
data will facilitate the integration of data from orthogonal
experimental methods, and will make interaction data more useful
for generatinghypothesesand constructingmodels.* current address: Samuel Lunenfeld Research Insti tute
Mount Sinai Hospi tal ,Toronto,ON
49 different bait cDNAs, in both N and C-terminally FLAG
tagged constructs
Calcium phosphatetransfection intoHEK293cells
Empty vectors lacking a cDNA insert were used as negative
controls with each batch of samples. Proteins identified in
negativecontrol laneswereautomatically subtracted.
Each construct wasexpressed and immunoprecipitated 4-6 times
Predicted False Negative Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Reproducibility Rate
FractionofHitPool
2
3
4
5
6
# of trials
Predicted False Negative Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Reproducibility Rate
FractionofHitPool
2
3
4
5
6
# of trials
9.Planning Trial Size: Minimizing false-negative risk
\
High-throughput mass spectrometry protein complex identification
(HMS-PCI) hasemerged asakey technology for functional genomics
Based on lack of concordanceamongdatasetsand theliterature,HMS-
PCI data is thought to be non-saturating and prone to both false-
positive and false-negative interactions. Verifying every putative
experimental interaction by traditional methods isnot practical.
Intuitively, experiments need to be done more than once, but how
many repetitionsarenecessary and sufficient?
Protein biochemistry varies enormously. When HMS-PCI is applied
to large numbers of proteins, biochemistry protocols cannot be
optimized for each individual bait.
5. Defining True Interactions
Real interactionsmust bereproducible.
Worst-case assumption: Preys are observed randomly, and
distributed uniformly throughout thedataset.
If individual experiments are assumed independent and identically
distributed, then experimentscan bemodelled asBernoulli trials, and
an expectation for each protein can be calculated based on its
observed frequency.
Low frequency preys are those more likely to be true, bait-specific
interactions. Frequently observed preys can easily be reproduced,
but aremeaningless.
4. Prey Frequency Distribution
A:Determinetheprey reproducibility ratedistribution, and hencethe
number of trials that must be performed in order to observe all
reproducibleinteractions.
B: Develop acriterion for acceptingan observed prey as‘real’
C:Comparetheeffectivenessof N-terminal vs.C-terminal FLAG tags
D:Estimatethefalse-negativeratefor agiven number of trials
E:Estimatethefalse-positiverate
7. Prey Reproducibility Rate Distribution
6.Bait Biochemistry, Success, and Productivity
12. Summary of Results
To maximize recovery of complexes in an HMS-PCI experiment,
both N and C -terminally tagged baitsshould beattempted
An experiment using only onetag position w il l miss25%- 75%of all
observablei n ter act i ons.
5-6 repetitions for each bait should be attempted to obtain the 4
successful trials that are required for acceptable false-positive and
false-negative rates.
With 4 trials, the false-negativerateisapproximately 15%
With 4 trials and a 5% frequency cut-off, the false -positive rate is
lesst h an 5%
Approximately 5hitscan beexpected per bait, on average.
Hit Reproducibility (frequency)
0
10
20
30
40
50
1 2 3 4 5 6 7 8 9 10 11 12
# times observed
%oftotalhits(n=190)
Expression Success
0.00
0.10
0.20
0.30
0.40
0.50
0 0.25 0.5 0.75 1
Fraction of attempts successful
%oftotalbaitsattempted
N
C
Prey Frequency isdefined as thepercentageof baitswith which
any given prey protein isobserved.
We have accumulated a database containing the results of
hundredsof immunoprecipitation experiments. Over theentire
dataset, the prey frequency distribution is extremely skewed,
duetovariationsin background bindingand protein abundance
The global observation frequency of each protein can beused to
help assessthesignificanceof a new observation of that protein.
Exper iments can be improved ei ther by improving
reproducibility (by improving biochemistry or analytical
sensitivity) or by theincreasingnumber of trials.
Wewill experiencediminishing returns if thenumber of trials
isset too high, unless thesuccesscriteria isalso tightened. For
instance, wemay demand 3 or more observations, or include
semi-quantitative information such as thenumber or intensity
of peptides.
How many trialsarerequired toobserveaprey twice?
The number of tr ials required w i l l depend on the
reproducibility rate for that prey. We estimate the actual
reproducibility ratesusing theobserved reproducibility rates.
Figure9A illustrates thenumber of trials required to observea
prey at least twice.
1 or 2 trials will provide a highly incomplete dataset, from
which it w ill be difficult to distuinguish real preys from
background.
Table 9C shows the effect of the non-unifom distribution of
reproducibility rates on the expected number of false
negatives,as thenumber of repetitionsisvaried.
2 3 4 5 6
0 1.00 1.00 1.00 1.00 1.00
0.1 0.99 0.97 0.95 0.92 0.89
0.2 0.96 0.90 0.82 0.74 0.66
0.3 0.91 0.78 0.65 0.53 0.42
0.4 0.84 0.65 0.48 0.34 0.23
0.5 0.75 0.50 0.31 0.19 0.11
0.6 0.64 0.35 0.18 0.09 0.04
0.7 0.51 0.22 0.08 0.03 0.01
0.8 0.36 0.10 0.03 0.01 0.00
0.9 0.19 0.03 0.00 0.00 0.00
1 0.00 0.00 0.00 0.00 0.00
Reproducibility
Rate
Theoretical Probability of NOT
Observing 2+ in X # of trials
2 3 4 5 6
0 0.00 0.00 0.00 0.00 0.00 0.00
0.1 0.00 0.00 0.00 0.00 0.00 0.00
0.2 0.01 0.00 0.00 0.00 0.00 0.00
0.3 0.02 0.01 0.01 0.01 0.01 0.01
0.4 0.07 0.05 0.04 0.03 0.02 0.01
0.5 0.39 0.29 0.19 0.12 0.07 0.04
0.6 0.01 0.01 0.00 0.00 0.00 0.00
0.7 0.04 0.02 0.01 0.00 0.00 0.00
0.8 0.17 0.06 0.02 0.01 0.00 0.00
0.9 0.00 0.00 0.00 0.00 0.00 0.00
1 0.31 0.00 0.00 0.00 0.00 0.00
1.00 0.45 0.28 0.17 0.11 0.07
Fraction
of Prey
Pool
Predicted Fraction of Prey
Pool NOT Found in X # of trialsReproducibility
Rate
If thepresenceof a prey isa result of a biologically meaningful
interaction, then prey observation should be reproducible
between replicateexperimentswith thesamebait.
Reproducibility varies between preys for many possible
reasons, such as protein abundance, different interaction
affinities, transient interactions, and peptide chemistry
affectingmass spectrometry identification.
Bait ‘succcess’ is defined by detection of the expressed bait
protein by massspectrometry.
Bait proteinswereof varying sizesand functional classes. The
sizedistribution of theproteinsisshown in Fig.6A.
A subset of baitswere chosen from a common pathway, which
could potentially invalidate our assumptions of independence
between experiments.
Overall, the N-terminal tag expression success rate is slightly
better than C-terminal expression, and the difference is
probably statistically significant (chi squared = 3.65, df=1,
p=0.056). (6B)
Each bait was attempted in 4-6 replicate experiments. Fig. 6C
shows the expression success rate over those trials, and
compares thedistributionsof success rates for N-tagged vs. C-
tagged constructs. Once a construct expresses, the expression
rateiscomparablefor both tagpositions.
!
Bait size histogram
0
5
10
15
20
25
30
200
400
600
800
1000
1200
1400
1600
1800
2000
protein size (# aa)
numberofbaits
11. N-tag vs. C-tag Hit Overlap
8. Prey Reproducibility: N-tags vs. C-tags\
Observed Reproducibility Rate
0.01
0.02
0.07
0.39
0.01 0.04
0.17
0.31
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Reproducibility Rate
FractionofHits
N
Average
C
Reproducibility variesgreatly between preys. Asshown in
figure8A,many reproducibleproteinsareonly observed in
50%of experiments.
Possible reasons for variability include protein abundance,
different interaction affinities, transient interactions,
peptidechemistry.
N and C-tagged constructshaveno significant differencein
prey reproducibility,oncethey aresuccessfully expressed.
# hits
seen with N only 110 0.68
seen with C only 29 0.18
seen in both N&C 15 0.09
seen when N+C are combined 8 0.05
total 162
% of total
hits
tag successful attempted success rate
N 40 49 0.82
C 33 49 0.67
combined 42 49 0.86 Pred ic ted False Pos t ive Rate vs . Database
Frequ en cy
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2
3
4
5
6
# of trials
10. Estimating false-positives
p: prey observation frequencyn: number of trialsk: number of observations required (2)E : expected number of falsepositivescutoff: frequency cutoffNumhits(p): number of hitsat each preyobservation frequency
falsep o sitiv e
5%
< 0.05
“safe”
region5%
< 0.05
“safe”
region
Using the global prey frequencies, we can also estimate the rate of
false-positivesduetobackground
Assume background proteins have a uniform random distribution,
and that background does not vary over time or experimental
conditions
Choose a prey frequency cutoff based on the number of trials
performed and thefalsepositiverate considered acceptable.
0.77
0.27C-tag only experiment
N-tag only experiment
Fraction of total hits
observed
9A
9B
9C
8A
7A
6C
6B
6A
2. Objectives
BaitSelection
& Cloning
Ectopic
expression
Lysis and
Immunoprecipitation
Gel
Separation
Band
ExcisionLC-MS/MS Informatics
Times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Reproducibility Rate
FractionofHitPool
2
3
4
5
6
# of trials
Times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Reproducibility Rate
FractionofHitPool
2
3
4
5
6
# of trials
Number of Trials needed to Observe Prey 2+
11A 11B
37
CPI PPI Tutorial 2004 - AMH
• need to tag both positions to get all interactions that can be found by this method on a large set of proteins
• overlap between N and C tagged constructs is small
HMS-PCI design: N- vs. C- tags
# hits % total hits
seen with N only 151 79%
seen with C only 47 25%
seen in N and C 18 9%
seen N ∪ C 10 5%
total 190
38
CPI PPI Tutorial 2004 - AMH
HMS-PCI example: Conclusions
• For the protocol employed, 5-6 attempts required for acceptable false-positive and false-negative rates
• Both N and C terminally tagged baits should be attempted for maximal coverage
• Lots of opportunities to improve process:
• constructs, tags, replicates, controls, sensitivity...
• With more data, protocols can be optimized (given available resources and requirements)
• Lots of work still to do...
39
CPI PPI Tutorial 2004 - AMH
Experimental Design Summary
• Careful experimental design and error estimation is essential for high-throughput biology
• Need to be able to filter and prioritize results
• Pilot studies will be required to establish optimal designs for novel protocols
• LIMS facilitates analysis of process issues
• Don’t forget about basic statistics and basic design
• Reproducibility should be just as important for large-scale studies as in regular biology experiments
40
CPI PPI Tutorial 2004 - AMH
3. MS/MS Protein ID
• Protein mass spectrometry review
• Peptide fragmentation & nomenclature
• ESI: Multiply-charged ions and deconvolution
• Spectra
• MS/MS database search
• Example: X! Tandem
• Search statistics and scoring
• De novo sequencing
41
CPI PPI Tutorial 2004 - AMH
MS/MS Protein Identification
Proteins Peptidesionizedpeptides
fragment ionspectrum
(trypsin)digestion
MALDIor
electrosprayionization
CIDor
PSD
DLTDYLMK
VAPEEHPVLLTEAPLNPK
PEEHPVLLTEAPLNPK
QEYDESGPSIVHR
DLTDYLMK
VAPEEHPVLLTEAPLNPK
PEEHPVLLTEAPLNPK
QEYDESGPSIVHR
42
CPI PPI Tutorial 2004 - AMH
Tandem Mass Spectrometry
• many different possible configurations
• common types used in proteomics:
• Quadrupole Time-of-flight (ie. QStar)
• Quadrupole Ion trap (ie. LCQ)
• Linear Ion Trap (LTQ, QTRAP)
• acccuracy & resolution of instrument impacts search• m/z - mass to charge ratio
• mass accuracy (ppm) = (measured m/z - actual) / (actual * 10^6)
• mass resolution = m / peak width
43
CPI PPI Tutorial 2004 - AMH
QSTAR XL
The API QSTAR® XL Hybrid LC/MS/MS
System is the premier quadrupole time-of-
flight LC/MS/MS system, setting a new
dimension of flexibility and performance.
The enhanced ion optics and new
detector provide answers to the most
challenging analytical questions at the
highest sensitivity. This high sensitivity,
coupled with excellent mass accuracy,
yield unequivocal molecular weights and
high-quality structural information for both
protein and small molecule analysis.
Novel scan functions offer a high degree of
selectivity for low-level protein analysis,
together with the most sensitive precursor
ion scanning capabil ity for accurate
analysis of post-translational modifications
(PTMs) and target compound analysis for
drug metabolites. The QSTAR XL system
is the most flexible MS/MS platform,
offering fast, easy switching between
the broadest range of ionisation, including
NanoSpray™ source, oMALDI™ source,
APCI, PhotoSpray™ source, and
TurboIonSpray® source.
Key Features of the QSTAR XL System:
! Most flexible MS/MS platform for
both electrospray and MALDI analysis
! New oMALDI 2 ion source with
enhanced collisional cooling for
better sensitivity
! New NanoSpray source for increased
productivity with capillary and
nanoflow HPLC
! New ion optics and detector to
improve ruggedness in 24/7
working environment
! Increased quadrupole mass selection
of up to 6,000 amu and time-of-flight
mass range of up to 40,000 amu
! Increased efficiency of high mass
transmission
! Improved low mass fragment ion
transmission
! Unique trapping pulsing capability for
maximum duty cycle and ultimate
sensitivity
! Unique scan functions for enhanced
selectivity and sensitivity for low level
compound analysis
New Increased Mass Selection Capability
The high-mass ion transmission properties
have been significantly improved in the new
generation QSTAR XL system. Ions of up to
40,000 amu can now be analysed by the
time-of-flight detector, and their efficiency
of transmission has been increased.
The CID capabilities have been enhanced
so that ions of up to 6,000 amu can be
isolated and fragmented for sequence
analysis, with improved transmission of low
mass fragments (see figure 2).
Novel Multiple Charge Separation
A unique, proprietary charge separation
method is applied to the QSTAR XL system
to improve detection limits of peptides and
proteins when analysed from complex
mixtures. Multiple charge separation (MCS)
eliminates singly charged ions in the
spectra, thereby enhancing the signal-to-
noise ratio of multiply charged ions at very
low levels. This high degree of separation
offers significant gains in signal-to-noise for
species that have a charge state higher
than 1. The benefits of charge separation
are particularly apparent at low femtomole
concentration levels, where in regular
TOF MS spectra peptide ions are often lost
in a sea of chemical noise. The suppression
of chemical noise reduces the need for
chromatography and makes peptide mass
fingerprinting using an electrospray source
equivalent to peptide mass fingerprinting
by MALDI.
europe.appliedbiosystems.com
T he A P I Q S TA R ® X L
H y b r id L C /M S /M S S y st emA new dimension in flexibility and performance
NEWPRODUCTREVIEW
Figure 2.
Fragmentation of the synthetic peptide Bovine Corticotropin
Releasing Factor (CRF) at 4695.5 Da, showing
the high mass fragment ion sequence information
obtainable from large parent peptides
Figure 1.
Schematic of the new QSTAR XL system featuring the
oMALDI 2 source. Other enhancements include a
DC quad and new detector for improved ruggedness
Laser
New Detector
CarrierPlate High Efficiency
LINACCollision Cell
Ultra StableQuadrupoleMass Filter
DC Quad The quadrupole lens provides a marked improvement in the ability to optimise the ion beam profile
Q0. Patented collisional focusingmaximises ion transmission forsuperior sensitivity
Q2 Patented LINACTM High Pressure collision cell provides increases sensitivity and unique trapping pulsing capability for maximum duty-cycle.
44
CPI PPI Tutorial 2004 - AMH
LCQ and LTQ!"#$%&'(")*(&+,&%)-.)'#$)/(00(+&0)123)4-0)5,&6)7!
8!4
0$$9:$
;$&'$9
"&6(::&,<5=>$
:$0?@/A-0:<
%=:'(6-:$?
4-0)',&6
$09)"&6?
4-0)',&6
,(0+)$:$"',-9$
8:$"',-0
%=:'(6:($,
9<0-9$
BC (0:$'?
!#$&'#
:(D=(9
E&"==%
6=%6
E&"==%
6=%6
4-0).-,%&'(-0
FG)&'%H 9$?-:I&'(-04-0)',&0?6-,'
(0'-)I&"==%
7JK)&0&:<?(?
!"#$%&"'!( )*+,-)"./01%./$2)$33$%"/4533/!)-%/01601/!$%!7"787"9./$2)$33$%"/:-"-/
:$,$%:$%)9./"'$/53"7+-"$/,$,"7:$/+-,,7%&/!9!"$+
37+7"-"7*%!( 37+7"$:/!"*#-&$/*4/7*%!./;6</*4/3*=/+-!!/#-%&$/3*!"/7%/01601/+*:$!
L)M'&09$%A(0A'(%$N)%&??)?6$"',-%$'$,9(..$,$0')?'&+$?)-.)%&??)&0&:<?(?)6$,.-,%$9)(0)'#$)?&%$)6#<?("&: ?6&"$)&')9(?",$'$)6-(0'?)(0)'(%$
Thermo LTQ www.thermo.com
Thermo LCQ www.thermo.comfig from www.enovatia.com
45
CPI PPI Tutorial 2004 - AMH
ESI: Multiply-charged peptides
• Electrospray ionization results in multiply charged peptides - spectra require deconvolution
Fenn J, 1989
46
CPI PPI Tutorial 2004 - AMH
Mass Spec Protein Identification
• MS: Peptide mass mapping
• MS/MS: Sequence Tags
• MS/MS: De novo sequencing
• MS/MS: Database search / correlation
47
CPI PPI Tutorial 2004 - AMH
Peptide Fragmentation
http://www.matrixscience.com/help_index.html
a,b,c series: charge stays on N terminal fragmentx,y,z series: charge stays on C terminal fragment
48
CPI PPI Tutorial 2004 - AMH
MS/MS Database Searching
• Given an experimental ms/ms spectra, which sequences in the protein database are most likely to have generated that spectra?
• Algorithm
• For each protein in the database
• in silico digestion to generate peptides with appropriate precursor mass for queries
• theoretically fragment peptides
• include possibilities of mutations/post-translational modification?
• compare query spectra to predicted spectra of each peptide, and assign a score and statistical significance
• collect peptide matches for each protein
49
CPI PPI Tutorial 2004 - AMH
Database Search Engines
• Mascot
• Sequest
• X! Tandem
• Protein or translated genomic/EST sequences can be searched
50
CPI PPI Tutorial 2004 - AMH
Sequence Databases
• NCBI NR - All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
• IPI
• UniProt - successor to SwissProt
•• Issues
• Databases need to be updated regularly
• Databases contain significant redundancy
• Some identifiers (GIs) are unstable and can disappear from one release to the next; accessions should be stable
• Identifications are often ambiguous - clustering required
51
CPI PPI Tutorial 2004 - AMH
X! Tandem
• Open Source software for matching MS/MS spectra to sequences
• Uses a new 2-step algorithm to speed up searches
• Fast and free
http://www.thegpm.orgCraig R & Beavis RC. A method for reducing the time required to match protein sequences to tandem mass spectra. Rapid Commun Mass Spec 2003.
52
CPI PPI Tutorial 2004 - AMH
4. Modelling biological networks as graphs
1. Basic elements of graph theory & definitions2. Statistical measures of graph topology3. Small world networks4. Scale-free networks5. Biological implications of topology
1. Hubs & lethality2. Date hubs vs. party hubs
53
CPI PPI Tutorial 2004 - AMH
Graph Theory
• Graph
• set of Nodes (Vertices) V = {v1,v2,v3,v4}
• set of Edges E = {(v1,v2),(v1,v3),(v1,v4)}
• an edge is a symmetric relation on V
• Directed graph (digraph)• directed edges, or arcs - asymmetric
• Graphs can be used to model many, many different kinds of real-world objects and relationships
• Standard algorithms can be applied
v1
v3
v4
v2
(v1,v2) (v1,v4)
(v1,v3)
v1
v3
v4
v2
(v1,v2) (v
1,v4)
(v1,v3)
54
CPI PPI Tutorial 2004 - AMH
Degree
• Degree of a node ≡ # of edge ends connected to it
0 1
1 2
55
CPI PPI Tutorial 2004 - AMH
Clustering Coefficient
• In a cluster, neighbours of a node tend to be connected to each other
• Clustering can be measured by counting number of “triangles” around a node, vs. how many there might be
A
B
C
D
E
E
A
B
C
D
E
E
56
CPI PPI Tutorial 2004 - AMH
Graph Data Structures: Adjacency List vs. Matrix
2 major ways to represent a graph in the computer
• Adjacency list• Store a list of adjacent edges of each node
• Efficient use of memory for sparse graphs
• Adjacency matrix• Matrix of all possible edges
• Mark those edges that are actually present
v1 v2 v3 v4
v1 0 1 1 1
v2 1 0 0 0
v3 1 0 0 0
v4 1 0 0 0
v1
v3
v4
v2
(v1,v2) (v1,v4)
(v1,v3)
E = {(v1,v2),(v1,v3),(v1,v4)}
57
CPI PPI Tutorial 2004 - AMH
Statistical Mechanics of Networks
• Many natural graphs are characterized by distinctive statistical properties.
• Characteristic distance
• Average Clustering coefficient
• Degree distribution
58
CPI PPI Tutorial 2004 - AMH
Small-World Networks
• Graphs traditionally modeled as regular lattices or random graphs
• L(p) = characteristic path length• average of shortest path lengths between each pair of vertices
• C(p) = average clustering coefficient• Cv = clustering coefficient = number of triangles around node v / total possible number of triangles
Nature © Macmillan Publishers Ltd 1998
8
letters to nature
NATURE | VOL 393 | 4 JUNE 1998 441
removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.
The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/!grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.
Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized
model—it is probably generic for many large, sparse networksfound in nature.
We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.
p = 0 p = 1 Increasing randomness
Regular Small-world Random
Figure 1 Random rewiring procedure for interpolating between a regular ring
lattice and a random network, without altering the number of vertices or edges in
the graph. We start with a ring of n vertices, each connected to its k nearest
neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic
examples shown here, but much larger n and k are used in the rest of this Letter.)
We choose a vertex and the edge that connects it to its nearest neighbour in a
clockwise sense. With probability p, we reconnect this edge to a vertex chosen
uniformly at random over the entire ring, with duplicate edges forbidden; other-
wise we leave the edge in place. We repeat this process by moving clockwise
around the ring, considering each vertex in turn until one lap is completed. Next,
we consider the edges that connect vertices to their second-nearest neighbours
clockwise. As before, we randomly rewire each of these edges with probability p,
and continue this process, circulating around the ring and proceeding outward to
more distant neighbours after each lap, until each edge in the original lattice has
been considered once. (As there are nk/2 edges in the entire graph, the rewiring
process stops after k/2 laps.) Three realizations of this process are shown, for
different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the
graph becomes increasingly disordered until for p ¼ 1, all edges are rewired
randomly. One of our main results is that for intermediate values of p, the graph is
a small-world network: highly clustered like a regular graph, yet with small
characteristic path length, like a random graph. (See Fig. 2.)
Table 1 Empirical examples of small-world networks
Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes !90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ! Lrandom but C q Crandom.
0
0.2
0.4
0.6
0.8
1
0.0001 0.001 0.01 0.1 1
p
L(p) / L(0)
C(p) / C(0)
Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the
family of randomly rewired graphs described in Fig. 1. Here L is defined as the
number of edges in the shortest path between two vertices, averaged over all
pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose
that a vertex v has kv neighbours; then at most kvðkv " 1Þ=2 edges can exist
between them (this occurs when every neighbour of v is connected to everyother
neighbour of v). Let Cv denote the fraction of these allowable edges that actually
exist. Define C as the average of Cv over all v. For friendship networks, these
statistics have intuitive meanings: L is the average number of friendships in the
shortest chain connecting two people; Cv reflects the extent to which friends of v
are also friends of each other; and thus C measures the cliquishness of a typical
friendship circle. The data shown in the figure are averages over 20 random
realizations of the rewiring process described in Fig.1, and have been normalized
by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices
and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic
horizontal scale has been used to resolve the rapid drop in L(p), corresponding to
the onset of the small-world phenomenon. During this drop, C(p) remains almost
constant at its value for the regular lattice, indicating that the transition to a small
world is almost undetectable at the local level.
Nature © Macmillan Publishers Ltd 1998
8
letters to nature
NATURE | VOL 393 | 4 JUNE 1998 441
removed from a clustered neighbourhood to make a short cut has, atmost, a linear effect on C; hence C(p) remains practically unchangedfor small p even though L(p) drops rapidly. The important implica-tion here is that at the local level (as reflected by C(p)), the transitionto a small world is almost undetectable. To check the robustness ofthese results, we have tested many different types of initial regulargraphs, as well as different algorithms for random rewiring, and allgive qualitatively similar results. The only requirement is that therewired edges must typically connect vertices that would otherwisebe much farther apart than Lrandom.
The idealized construction above reveals the key role of shortcuts. It suggests that the small-world phenomenon might becommon in sparse networks with many vertices, as even a tinyfraction of short cuts would suffice. To test this idea, we havecomputed L and C for the collaboration graph of actors in featurefilms (generated from data available at http://us.imdb.com), theelectrical power grid of the western United States, and the neuralnetwork of the nematode worm C. elegans17. All three graphs are ofscientific interest. The graph of film actors is a surrogate for a socialnetwork18, with the advantage of being much more easily specified.It is also akin to the graph of mathematical collaborations centred,traditionally, on P. Erdos (partial data available at http://www.acs.oakland.edu/!grossman/erdoshp.html). The graph ofthe power grid is relevant to the efficiency and robustness ofpower networks19. And C. elegans is the sole example of a completelymapped neural network.
Table 1 shows that all three graphs are small-world networks.These examples were not hand-picked; they were chosen because oftheir inherent interest and because complete wiring diagrams wereavailable. Thus the small-world phenomenon is not merely acuriosity of social networks13,14 nor an artefact of an idealized
model—it is probably generic for many large, sparse networksfound in nature.
We now investigate the functional significance of small-worldconnectivity for dynamical systems. Our test case is a deliberatelysimplified model for the spread of an infectious disease. Thepopulation structure is modelled by the family of graphs describedin Fig. 1. At time t ¼ 0, a single infective individual is introducedinto an otherwise healthy population. Infective individuals areremoved permanently (by immunity or death) after a period ofsickness that lasts one unit of dimensionless time. During this time,each infective individual can infect each of its healthy neighbourswith probability r. On subsequent time steps, the disease spreadsalong the edges of the graph until it either infects the entirepopulation, or it dies out, having infected some fraction of thepopulation in the process.
p = 0 p = 1 Increasing randomness
Regular Small-world Random
Figure 1 Random rewiring procedure for interpolating between a regular ring
lattice and a random network, without altering the number of vertices or edges in
the graph. We start with a ring of n vertices, each connected to its k nearest
neighbours by undirected edges. (For clarity, n ¼ 20 and k ¼ 4 in the schematic
examples shown here, but much larger n and k are used in the rest of this Letter.)
We choose a vertex and the edge that connects it to its nearest neighbour in a
clockwise sense. With probability p, we reconnect this edge to a vertex chosen
uniformly at random over the entire ring, with duplicate edges forbidden; other-
wise we leave the edge in place. We repeat this process by moving clockwise
around the ring, considering each vertex in turn until one lap is completed. Next,
we consider the edges that connect vertices to their second-nearest neighbours
clockwise. As before, we randomly rewire each of these edges with probability p,
and continue this process, circulating around the ring and proceeding outward to
more distant neighbours after each lap, until each edge in the original lattice has
been considered once. (As there are nk/2 edges in the entire graph, the rewiring
process stops after k/2 laps.) Three realizations of this process are shown, for
different values of p. For p ¼ 0, the original ring is unchanged; as p increases, the
graph becomes increasingly disordered until for p ¼ 1, all edges are rewired
randomly. One of our main results is that for intermediate values of p, the graph is
a small-world network: highly clustered like a regular graph, yet with small
characteristic path length, like a random graph. (See Fig. 2.)
Table 1 Empirical examples of small-world networks
Lactual Lrandom Cactual Crandom.............................................................................................................................................................................Film actors 3.65 2.99 0.79 0.00027Power grid 18.7 12.4 0.080 0.005C. elegans 2.65 2.25 0.28 0.05.............................................................................................................................................................................Characteristic path length L and clustering coefficient C for three real networks, comparedto random graphs with the same number of vertices (n) and average number of edges pervertex (k). (Actors: n ¼ 225;226, k ¼ 61. Power grid: n ¼ 4;941, k ¼ 2:67. C. elegans: n ¼ 282,k ¼ 14.) The graphs are defined as follows. Two actors are joined by an edge if they haveacted in a film together. We restrict attention to the giant connected component16 of thisgraph, which includes !90% of all actors listed in the Internet Movie Database (available athttp://us.imdb.com), as of April 1997. For the power grid, vertices represent generators,transformers and substations, and edges represent high-voltage transmission linesbetween them. For C. elegans, an edge joins two neurons if they are connected by eithera synapse or a gap junction. We treat all edges as undirected and unweighted, and allvertices as identical, recognizing that these are crude approximations. All three networksshow the small-world phenomenon: L ! Lrandom but C q Crandom.
0
0.2
0.4
0.6
0.8
1
0.0001 0.001 0.01 0.1 1
p
L(p) / L(0)
C(p) / C(0)
Figure 2 Characteristic path length L(p) and clustering coefficient C(p) for the
family of randomly rewired graphs described in Fig. 1. Here L is defined as the
number of edges in the shortest path between two vertices, averaged over all
pairs of vertices. The clustering coefficient C(p) is defined as follows. Suppose
that a vertex v has kv neighbours; then at most kvðkv " 1Þ=2 edges can exist
between them (this occurs when every neighbour of v is connected to everyother
neighbour of v). Let Cv denote the fraction of these allowable edges that actually
exist. Define C as the average of Cv over all v. For friendship networks, these
statistics have intuitive meanings: L is the average number of friendships in the
shortest chain connecting two people; Cv reflects the extent to which friends of v
are also friends of each other; and thus C measures the cliquishness of a typical
friendship circle. The data shown in the figure are averages over 20 random
realizations of the rewiring process described in Fig.1, and have been normalized
by the values L(0), C(0) for a regular lattice. All the graphs have n ¼ 1;000 vertices
and an average degree of k ¼ 10 edges per vertex. We note that a logarithmic
horizontal scale has been used to resolve the rapid drop in L(p), corresponding to
the onset of the small-world phenomenon. During this drop, C(p) remains almost
constant at its value for the regular lattice, indicating that the transition to a small
world is almost undetectable at the local level.
Watts & Strogatz, Nature 393 440-442 (1998)
59
CPI PPI Tutorial 2004 - AMH
Scale-Free Networks
• Degree distribution is described by a power law:
P(k) ! k-"
- only a few nodes with lots of links
- but more nodes than you’d expect with an intermediate number of links (than in an ER random network)
‘scale-free’ no representative average type of node, or characteristic scale of
local connectivity distribution
Albert R & Barabasi A-L. “Emergence of Scaling in Random Networks” Science 286 (509-512) 1999
60
CPI PPI Tutorial 2004 - AMH
Example Scale-Free Networks
ing systems form a huge genetic networkwhose vertices are proteins and genes, thechemical interactions between them repre-senting edges (2). At a different organization-al level, a large network is formed by thenervous system, whose vertices are the nervecells, connected by axons (3). But equallycomplex networks occur in social science,where vertices are individuals or organiza-tions and the edges are the social interactionsbetween them (4), or in the World Wide Web(WWW), whose vertices are HTML docu-ments connected by links pointing from onepage to another (5, 6). Because of their largesize and the complexity of their interactions,the topology of these networks is largelyunknown.
Traditionally, networks of complex topol-ogy have been described with the randomgraph theory of Erdos and Renyi (ER) (7),but in the absence of data on large networks,the predictions of the ER theory were rarelytested in the real world. However, driven bythe computerization of data acquisition, suchtopological information is increasingly avail-able, raising the possibility of understandingthe dynamical and topological stability oflarge networks.
Here we report on the existence of a highdegree of self-organization characterizing thelarge-scale properties of complex networks.Exploring several large databases describingthe topology of large networks that spanfields as diverse as the WWW or citationpatterns in science, we show that, indepen-dent of the system and the identity of itsconstituents, the probability P(k) that a ver-tex in the network interacts with k othervertices decays as a power law, followingP(k) ! k"#. This result indicates that largenetworks self-organize into a scale-free state,a feature unpredicted by all existing randomnetwork models. To explain the origin of thisscale invariance, we show that existing net-work models fail to incorporate growth andpreferential attachment, two key features ofreal networks. Using a model incorporating
these two ingredients, we show that they areresponsible for the power-law scaling ob-served in real networks. Finally, we arguethat these ingredients play an easily identifi-able and important role in the formation ofmany complex systems, which implies thatour results are relevant to a large class ofnetworks observed in nature.
Although there are many systems thatform complex networks, detailed topologicaldata is available for only a few. The collab-oration graph of movie actors represents awell-documented example of a social net-work. Each actor is represented by a vertex,two actors being connected if they were casttogether in the same movie. The probabilitythat an actor has k links (characterizing his orher popularity) has a power-law tail for largek, following P(k) ! k"#actor, where #actor $2.3 % 0.1 (Fig. 1A). A more complex net-work with over 800 million vertices (8) is theWWW, where a vertex is a document and theedges are the links pointing from one docu-ment to another. The topology of this graphdetermines the Web’s connectivity and, con-sequently, our effectiveness in locating infor-mation on the WWW (5). Information aboutP(k) can be obtained using robots (6), indi-cating that the probability that k documentspoint to a certain Web page follows a powerlaw, with #www $ 2.1 % 0.1 (Fig. 1B) (9). Anetwork whose topology reflects the histori-cal patterns of urban and industrial develop-ment is the electrical power grid of the west-ern United States, the vertices being genera-tors, transformers, and substations and theedges being to the high-voltage transmissionlines between them (10). Because of the rel-atively modest size of the network, contain-ing only 4941 vertices, the scaling region isless prominent but is nevertheless approxi-mated by a power law with an exponent#power ! 4 (Fig. 1C). Finally, a rather largecomplex network is formed by the citationpatterns of the scientific publications, the ver-tices being papers published in refereed jour-nals and the edges being links to the articles
cited in a paper. Recently Redner (11) hasshown that the probability that a paper iscited k times (representing the connectivity ofa paper within the network) follows a powerlaw with exponent #cite $ 3.
The above examples (12) demonstrate thatmany large random networks share the com-mon feature that the distribution of their localconnectivity is free of scale, following a powerlaw for large k with an exponent # between2.1 and 4, which is unexpected within theframework of the existing network models.The random graph model of ER (7) assumesthat we start with N vertices and connect eachpair of vertices with probability p. In themodel, the probability that a vertex has kedges follows a Poisson distribution P(k) $e"&&k/k!, where
& ! N"N " 1
k#pk'1 " p(N"1"k
In the small-world model recently intro-duced by Watts and Strogatz (WS) (10), Nvertices form a one-dimensional lattice,each vertex being connected to its twonearest and next-nearest neighbors. Withprobability p, each edge is reconnected to avertex chosen at random. The long-rangeconnections generated by this process de-crease the distance between the vertices,leading to a small-world phenomenon (13),often referred to as six degrees of separa-tion (14 ). For p $ 0, the probability distri-bution of the connectivities is P(k) $ )(k "z), where z is the coordination number inthe lattice; whereas for finite p, P(k) stillpeaks around z, but it gets broader (15). Acommon feature of the ER and WS modelsis that the probability of finding a highlyconnected vertex (that is, a large k) decreas-es exponentially with k; thus, vertices withlarge connectivity are practically absent. Incontrast, the power-law tail characterizingP(k) for the networks studied indicates thathighly connected (large k) vertices have alarge chance of occurring, dominating theconnectivity.
There are two generic aspects of real net-works that are not incorporated in these mod-els. First, both models assume that we startwith a fixed number (N) of vertices that arethen randomly connected (ER model), or re-connected (WS model), without modifyingN. In contrast, most real world networks areopen and they form by the continuous addi-tion of new vertices to the system, thus thenumber of vertices N increases throughoutthe lifetime of the network. For example, theactor network grows by the addition of newactors to the system, the WWW grows expo-nentially over time by the addition of newWeb pages (8), and the research literatureconstantly grows by the publication of newpapers. Consequently, a common feature of
Fig. 1. The distribution function of connectivities for various large networks. (A) Actor collaborationgraph with N $ 212,250 vertices and average connectivity *k+ $ 28.78. (B) WWW, N $325,729, *k+ $ 5.46 (6). (C) Power grid data, N $ 4941, *k+ $ 2.67. The dashed lines haveslopes (A) #actor $ 2.3, (B) #www $ 2.1 and (C) #power $ 4.
R E P O R T S
15 OCTOBER 1999 VOL 286 SCIENCE www.sciencemag.org510
ActorCollaboration
networkWWW Power Grid
Albert & Barabasi, 1999
61
CPI PPI Tutorial 2004 - AMH
Scale-free Protein Interaction Networks
Proteomics 2004, 4, 928–942 Protein interaction networks 931
Figure 2. Large-scale characteristics of the protein interaction databases. (a) Degree distribution of the four databases,shown on a log-log plot. Note that all datasets have a power law tail, indicating that the underlying network has a scale-freetopology. The solid line is obtained from the fitting to the function P(k) , kg to the DIP data, the best fit indicating g< 2.5 forDIP data set. (b) Distribution of the clustering coefficient for the four studied databases shown on a log-log plot. The straightline has slope 22. (c) Cluster size distribution for the four databases shown on a log-log plot. Apart from the points corre-sponding to the giant component (for right) the P(n) curves follow a power law. The solid line is obtained from the leastsquare fitting to P(n) , n2a for the MIPS dataset, providing a = 3.4.
lar biology [28], a series of studies have focused on iden-tifying the biological modules in various cellular networks,ranging from the metabolism [15, 29–31] to geneticnetworks [2, 32]. Modularity assumes the existence ofgroups of proteins that work together to achieve somewell-defined biological function. For example, it is experi-mentally well established that protein complexes that actas functional modules carry out many biological func-tions. From the network perspective these modulesshould appear as distinct group of nodes that are highlyinterconnected with each other but have only a few linksto nodes outside of the module. Yet, the scale-free to-pology apparently forbids the existence of independentmodules in the network, as the hub proteins’ ability tointeract with a high fraction of each module’s componentsmakes a module’s relative isolation all but impossible.Recently, we proposed that the network’s scale-freetopology can be reconciled with its potential modularitywithin the framework of a hierarchical modularity [15, 30,33]. The most important test of such hierarchical modular-ity is the scaling of the clustering coefficient, C, defined asCi = 2ni/ki(ki21) for each node i that has ki links, where ni
denotes the number of direct links between the ki neigh-bors of node i. For the random and the scale-free modelthe clustering coefficient of a node with k links is inde-pendent of k, that is, on average hubs have the sameclustering coefficient as small nodes do. In contrast, fora hierarchical network the clustering coefficient C(k) de-pends on the node’s degree as [15, 33–35]
C(k) , k2b (1)
where b is the modularity exponent characterizing thenetwork’s hierarchical modularity. Therefore, the C(k)function, which can be measured for arbitrary networks[30], can provide direct evidence if the network has ahierarchical modularity. To test the organization of mod-ularity in protein interaction networks we measured theC(k) function for each of the four studied protein networkdatabases. As Fig. 2b shows, we find that C(k) is notindependent of k, but can be well approximated by apower law with exponent b < 2, giving direct evidenceof hierarchical modularity in protein interaction net-works.
Another important property of the currently available pro-tein interaction networks is that they are fragmented intomany distinct clusters [11, 22, 26]. Indeed, we find thateach of the four databases are dominated by a giant clus-ter that contains a significant fraction of all connectedproteins, such that one can find a path of protein interac-tions between any two proteins belonging to this giantcomponent. A small fraction of proteins, however, areeither completely isolated (i.e., do not have any knowninteractions to other proteins) or form small islands of iso-lated groups of interconnected proteins. To characterizethe fragmented nature of the protein interaction networkwe determined the size n of each isolated cluster, andprepared a normalized histogram of the results, obtainingthe cluster size distribution. As Fig. 2c shows, each of thedatasets have a giant component of approximately 103
proteins. However, the giant component coexists withmany isolated proteins, somewhat fewer two protein clus-
! 2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de
Yook, Oltvai, Barabasi, 2004
Degree distributionClustering coefficient
Cluster sizes
62
CPI PPI Tutorial 2004 - AMH
Hubs and Lethality
Barabasi, 2001
Proteins are traditionally identified onthe basis of their individual actions ascatalysts, signalling molecules, or
building blocks in cells and microorgan-isms. But our post-genomic view is expand-ing the protein’s role into an element in anetwork of protein–protein interactions aswell, in which it has a contextual or cellularfunction within functional modules1,2. Herewe provide quantitative support for thisidea by demonstrating that the phenotypicconsequence of a single gene deletion in theyeast Saccharomyces cerevisiae is affected toa large extent by the topological position ofits protein product in the complex hierar-chical web of molecular interactions.
The S. cerevisiae protein–protein inter-action network we investigate has 1,870proteins as nodes, connected by 2,240 iden-tified direct physical interactions, and isderived from combined, non-overlappingdata3,4, obtained mostly by systematic two-hybrid analyses3. Owing to its size, a com-plete map of the network (Fig. 1a),although informative, in itself offers littleinsight into its large-scale characteristics.Our first goal was therefore to identify thearchitecture of this network, determiningwhether it is best described by an inherentlyuniform exponential topology, with pro-teins on average possessing the same num-ber of links, or by a highly heterogeneousscale-free topology, in which proteins havewidely different connectivities5.
As we show in Fig. 1b, the probabilitythat a given yeast protein interacts with kother yeast proteins follows a power law5
with an exponential cut-off6 at kc!20, atopology that is also shared by the pro-tein–protein interaction network of thebacterium Helicobacter pylori 7. This indi-cates that the network of protein interac-tions in two separate organisms forms ahighly inhomogeneous scale-free networkin which a few highly connected proteinsplay a central role in mediating interactionsamong numerous, less connected proteins.
An important known consequence of theinhomogeneous structure is the network’ssimultaneous tolerance to random errors,coupled with fragility against the removal ofthe most connected nodes8. We find that ran-dom mutations in the genome of S. cerevisiae,modelled by the removal of randomly select-ed yeast proteins, do not affect the overalltopology of the network. By contrast, whenthe most connected proteins are computa-tionally eliminated, the network diameterincreases rapidly. This simulated toleranceagainst random mutation is in agreementwith results from systematic mutagenesis
experiments, which identified a strikingcapacity of yeast to tolerate the deletion of asubstantial number of individual proteinsfrom its proteome9,10. However, if this isindeed due to a topological component toerror tolerance, then, on average, less con-nected proteins should prove to be less essen-tial than highly connected ones.
To test this, we rank-ordered all interact-ing proteins based on the number of linksthey have, and correlated this with the phe-notypic effect of their individual removalfrom the yeast proteome. As shown in Fig.1c, the likelihood that removal of a proteinwill prove lethal correlates with the numberof interactions the protein has. For exam-ple, although proteins with five or fewerlinks constitute about 93% of the totalnumber of proteins, we find that only about21% of them are essential. By contrast, onlysome 0.7% of the yeast proteins withknown phenotypic profiles have more than15 links, but single deletion of 62% or so ofthese proves lethal. This implies that highlyconnected proteins with a central role in thenetwork’s architecture are three times morelikely to be essential than proteins with onlya small number of links to other proteins.
The simultaneous emergence of aninhomogeneous structure in both metabol-ic5,11 and protein interaction networks sug-gests that there has been evolutionaryselection of a common large-scale structureof biological networks and indicates thatfuture systematic protein–protein interac-tion studies in other organisms will uncoveran essentially identical protein-networktopology. The correlation between the con-nectivity and indispensability of a givenprotein confirms that, despite the impor-tance of individual biochemical functionand genetic redundancy, the robustnessagainst mutations in yeast is also derivedfrom the organization of interactions andthe topological positions of individual pro-teins12. A better understanding of celldynamics and robustness will be obtainedfrom an integrated approach that simulta-neously incorporates the individual andcontextual properties of all constituents incomplex cellular networks.H. Jeong*, S. P. Mason†, A.-L. Barabási*,
Z. N. Oltvai†
*Department of Physics, University of Notre Dame,
Notre Dame, Indiana 46556, USA
e-mail: [email protected], [email protected]
brief communications
NATURE | VOL 411 | 3 MAY 2001 | www.nature.com 41
Lethality and centrality in protein networksThe m ost highly connected proteins in the cell are the m ost im portant for its survival.
Figure 1 Characteristics of the yeast proteome. a, Map of protein–protein interactions. The largest cluster, which contains ~78% of all
proteins, is shown. The colour of a node signifies the phenotypic effect of removing the corresponding protein (red, lethal; green, non-
lethal; orange, slow growth; yellow, unknown). b, Connectivity distribution P(k) of interacting yeast proteins, giving the probability that a
given protein interacts with kother proteins. The exponential cut-off6 indicates that the number of proteins with more than 20 interactions
is slightly less than expected for pure scale-free networks. In the absence of data on the link directions, all interactions have been consid-
ered as bidirectional. The parameter controlling the short-length scale correction has value k0!1. c, The fraction of essential proteins
with exactly k links versus their connectivity, k, in the yeast proteome. The list of 1,572 mutants with known phenotypic profile was
obtained from the Proteome database13. Detailed statistical analysis, including r!0.75 for Pearson’s linear correlation coefficient,
demonstrates a positive correlation between lethality and connectivity. For additional details, see http://www.nd.edu/~networks/cell.
© 2001 Macmillan Magazines Ltd
63
CPI PPI Tutorial 2004 - AMH
Date Hubs & Party Hubs
Different proteins are expressed at different times and places - not all interactions need occur at the same state/time/place
a
d
c
b
a
d
ac
a
b
64
CPI PPI Tutorial 2004 - AMH
Network Motifs
• Motifs: small patterns of interactions that occur more frequently than random in biological networks
• Different motifs may play different information-processing roles
• Count # of occurrences of each possible network containing 2, 3,...4 nodes, and compare to randomized networks
65
CPI PPI Tutorial 2004 - AMH
Searching for Network Motifs
Milo R. et al “Network Motifs: Simple Building Blocks of Complex Networks” Science 298 p824-827 (2002)
Cl concentrations in the Sajama ice core, and toa number of other pedological and geomorpho-logical features indicative of long-term dry cli-mates (8, 11–14, 18). This decline in humanactivity around the Altiplano paleolakes is seenin most caves, with early and late occupationsseparated by largely sterile mid-Holocene sed-iments. However, a few sites, including thecaves of Tulan-67 and Tulan-68, show thatpeople did not completely disappear from thearea. All of the sites of sporadic occupationare located near wetlands in valleys, nearlarge springs, or where lakes turned into wet-lands and subsistence resources were locallystill available despite a generally arid climate(7, 8, 19, 20).
Archaeological data from surrounding ar-eas suggest that the Silencio Arqueologicoapplies best to the most arid areas of thecentral Andes, where aridity thresholds forearly societies were critical. In contrast, aweaker expression is to be expected in themore humid highlands of northern Chile(north of 20°S, such as Salar Huasco) andPeru (21). In northwest Argentina, the Silen-cio Arqueologico is found in four of the sixknown caves (22) [see review in (23)]. It isalso found on the coast of Peru in sites thatare associated with ephemeral streams (24).The southern limit in Chile and northwestArgentina has yet to be explored.
References and Notes1. T. Dillehay, Science 245, 1436 (1989).2. D. J. Meltzer et al., Am. Antiq. 62, 659 (1997).3. T. F. Lynch, C. M. Stevenson, Quat. Res. 37, 117(1992).
4. D. H. Sandweiss et al., Science 281, 1830 (1998).5. L. Nunez, M. Grosjean, I. Cartajena, in Interhemispher-ic Climate Linkages, V. Markgraf, Ed. (Academic Press,San Diego, CA 2001), pp. 105–117.
6. M. A. Geyh, M. Grosjean, L. Nunez, U. Schotterer,Quat. Res. 52, 143 (1999).
7. J. L. Betancourt, C. Latorre, J. A. Rech, J. Quade, K.Rylander, Science 289, 1542 (2000).
8. M. Grosjean et al., Global Planet. Change 28, 35(2001).
9. C. Latorre, J. L. Betancourt, K. A. Rylander, J. Quade,Geol. Soc. Am. Bull. 114, 349 (2002).
10. Charcoal in layers containing triangular points hasbeen 14C dated at Tuina-1, Tuina-5, Tambillo-1, SanLorenzo-1, and Tuyajto-1 between 13,000 and 9000cal yr B.P. (table S1 and fig. S1).
11. P. A. Baker et al., Science 291, 640 (2001).12. G. O. Seltzer, S. Cross, P. Baker, R. Dunbar, S. Fritz,Geology 26, 167 (1998).
13. L. G. Thompson et al., Science 282, 1858 (1998).14. M. Grosjean, Science 292, 2391 (2001).15. E. P. Tonni, written communication.16. M. T. Alberdi, written communication.17. J. Fernandez et al., Geoarchaeology 6, 251 (1991).18. The histogram of middens is processed from (9).19. M. Grosjean, L. Nunez, I. Cartajena, B. Messerli, Quat.Res. 48, 239 (1997).
20. The term Silencio Arqueologico describes the mid-Holocene collapse of human population at thosearchaeological sites of the Atacama Desert that arevulnerable to multicentennial or millennial-scaledrought. The term Silencio Archaeologico does notconflict with the presence of humans at sites that arenot susceptible to climate change, such as in springand river oases that drain large (Pleistocene) aquifersor at sites where wetlands were created during thearid middle Holocene, such as Tulan-67, Tulan-68,and Laguna Miscanti.
21. M. Aldenderfer, Science 241, 1828 (1988).22. A mid-Holocene hiatus is found at Inca Cueva 4,Huachichocana 3, Pintocamayoc, and Yavi, whereasoccupation continued at the oases of Susques andQuebrada Seca.
23. L. Nunez et al., Estud. Atacamenos 17, 125 (1999).24. D. H. Sandweiss, K. A. Maasch, D. G. Anderson, Sci-ence 283, 499 (1999).
25. Grants from the National Geographic Society (5836-96), the Swiss National Science Foundation(21-57073), and Fondo Nacional de Desarrollo Cien-
tıfico y Tecnologico (1930022) and comments by J. P.Bradbury, B. Meggers, G. Seltzer, and D. Stanford areacknowledged.
Supporting Online Materialwww.sciencemag.org/cgi/content/full/298/5594/821/DC1Figs. S1 to S3Tables S1 and S2
22 July 2002; accepted 9 September 2002
Network Motifs: Simple BuildingBlocks of Complex Networks
R. Milo,1 S. Shen-Orr,1 S. Itzkovitz,1 N. Kashtan,1 D. Chklovskii,2
U. Alon1*
Complex networks are studied across many fields of science. To uncover theirstructural design principles, we defined “network motifs,” patterns of inter-connections occurring in complex networks at numbers that are significantlyhigher than those in randomized networks. We found such motifs in networksfrom biochemistry, neurobiology, ecology, and engineering. The motifs sharedby ecological food webs were distinct from the motifs shared by the geneticnetworks of Escherichia coli and Saccharomyces cerevisiae or from those foundin the World Wide Web. Similar motifs were found in networks that performinformation processing, even though they describe elements as different asbiomolecules within a cell and synaptic connections between neurons in Cae-norhabditis elegans. Motifs may thus define universal classes of networks. Thisapproach may uncover the basic building blocks of most networks.
Many of the complex networks that occur innature have been shown to share global statis-tical features (1–10). These include the “smallworld” property (1–9) of short paths betweenany two nodes and highly clustered connec-tions. In addition, in many natural networks,there are a few nodes with many more connec-tions than the average node has. In these types
of networks, termed “scale-free networks” (4,6), the fraction of nodes having k edges, p(k),decays as a power law p(k) ! k–" (where " isoften between 2 and 3). To go beyond theseglobal features would require an understandingof the basic structural elements particular toeach class of networks (9). To do this, wedeveloped an algorithm for detecting networkmotifs: recurring, significant patterns of inter-connections. A detailed application to a generegulation network has been presented (11).Related methods were used to test hypotheseson social networks (12, 13). Here we generalizethis approach to virtually any type of connec-tivity graph and find the striking appearance of
1Departments of Physics of Complex Systems andMolecular Cell Biology, Weizmann Institute of Sci-ence, Rehovot, Israel 76100. 2Cold Spring Harbor Lab-oratory, Cold Spring Harbor, NY 11724, USA.
*To whom correspondence should be addressed. E-mail: [email protected]
Fig. 1. (A) Examplesof interactions repre-sented by directededges between nodesin some of the net-works used for thepresent study. Thesenetworks go from thescale of biomolecules(transcription factorprotein X binds regu-latory DNA regionsof a gene to regulatethe production rateof protein Y),through cells (neuronX is synaptically con-nected to neuron Y),to organisms (Xfeeds on Y). (B) All 13 types of three-node connected subgraphs.
R E P O R T S
25 OCTOBER 2002 VOL 298 SCIENCE www.sciencemag.org824
66
CPI PPI Tutorial 2004 - AMH
Searching For Network Motifs
motifs in networks representing a broad rangeof natural phenomena.
We started with networks where the inter-actions between nodes are represented by di-rected edges (Fig. 1A). Each network wasscanned for all possible n-node subgraphs (inthe present study, n ! 3 and 4), and the numberof occurrences of each subgraph was recorded.Each network contains numerous types of n-node subgraphs (Fig. 1B). To focus on thosethat are likely to be important, we compared thereal network to suitably randomized networks(12–16) and only selected patterns appearing inthe real network at numbers significantly higherthan those in the randomized networks (Fig. 2).For a stringent comparison, we used random-ized networks that have the same single-nodecharacteristics as does the real network: Eachnode in the randomized networks has the same
number of incoming and outgoing edges as thecorresponding node has in the real network.The comparison to this randomized ensembleaccounts for patterns that appear only becauseof the single-node characteristics of the network(e.g., the presence of nodes with a large numberof edges). Furthermore, the randomized net-works used to calculate the significance of n-node subgraphs were generated to preserve thesame number of appearances of all (n – 1)-nodesubgraphs as in the real network (17, 18). Thisensures that a high significance was not as-signed to a pattern only because it has a highlysignificant subpattern. The “network motifs”are those patterns for which the probability P ofappearing in a randomized network an equal orgreater number of times than in the real networkis lower than a cutoff value (here P ! 0.01).Patterns that are functionally important but not
statistically significant could exist, whichwould be missed by our approach.
We applied the algorithm to several net-works from biochemistry (transcriptional generegulation), ecology (food webs), neurobiology(neuron connectivity), and engineering (elec-tronic circuits, World Wide Web). The networkmotifs found are shown in Table 1. Transcrip-tion networks are biochemical networks re-sponsible for regulating the expression of genesin cells (11, 19). These are directed graphs, inwhich the nodes represent genes (Fig. 1A).Edges are directed from a gene that encodes fora transcription factor protein to a gene transcrip-tionally regulated by that transcription factor.We analyzed the two best characterized tran-scriptional regulation networks, correspondingto organisms from different kingdoms: a eu-karyote (the yeast Saccharomyces cerevisiae)(20) and a bacterium (Escherichia coli) (11,19). The two transcription networks show thesame motifs: a three-node motif termed “feed-forward loop” (11) and a four-node motiftermed “bi-fan.” These motifs appear numeroustimes in each network (Table 1), in nonhomolo-gous gene systems that perform diverse biolog-ical functions. The number of times they appearis more than 10 standard deviations greater thantheir mean number of appearances in random-ized networks. Only these subgraphs, of the 13possible different three-node subgraphs (Fig.1B) and 199 different four-node subgraphs, aresignificant and are therefore considered net-work motifs. Many other three- and four-nodesubgraphs recur throughout the networks, but atnumbers that are less than the mean plus 2standard deviations of their appearance in ran-domized networks.
We next applied the algorithm to ecosystemfood webs (21, 22), in which nodes representgroups of species. Edges are directed from anode representing a predator to the node repre-senting its prey. We analyzed data collected bydifferent groups at seven distinct ecosystems(22), including both aquatic and terrestrial hab-itats. Each of the food webs displayed one ortwo three-node network motifs and one to fivefour-node network motifs. One can define the“consensus motifs” as the motifs shared bynetworks of a given type. Five of the seven foodwebs shared one three-node motif, and all sevenshared one four-node motif (Table 1). In con-trast to the three-node motif (termed “threechain”), the three-node feedforward loop wasunderrepresented in the food webs. This sug-gests that direct interactions between species ata separation of two layers [as in the case ofomnivores (23)] are selected against. The bi-parallel motif indicates that two species that areprey of the same predator both tend to share thesame prey. Both network motifs may thus rep-resent general tendencies of food webs (21, 22).
We next studied the neuronal connectivitynetwork of the nematode Caenorhabditis ele-gans (24). Nodes represent neurons (or neuron
Fig. 2. Schematic view of network motif detection. Network motifs are patterns that recur muchmore frequently (A) in the real network than (B) in an ensemble of randomized networks. Eachnode in the randomized networks has the same number of incoming and outgoing edges as doesthe corresponding node in the real network. Red dashed lines indicate edges that participate in thefeedforward loop motif, which occurs five times in the real network.
150 200 250 300 350 4000
0.005
0.01
0.015
Subnetwork size
Co
nce
ntr
atio
n o
f F
ee
dfo
rwa
rd lo
op
Real
Random
Fig. 3. Concentration C ofthe feedforward loop motifin real and randomizedsubnetworks of the E. colitranscription network (11).C is the number of appear-ances of the motif dividedby the total number of ap-pearances of all connectedthree-node subgraphs (Fig.1B). Subnetworks of size Swere generated by choos-ing a node at random andadding to it nodes con-nected by an incoming oroutgoing edge, until Snodes were obtained, andthen including all of theedges between these Snodes present in the fullnetwork. Each of the sub-networks was randomized(17, 18) (shown are mean and SD of 400 subnetworks of each size).
R E P O R T S
www.sciencemag.org SCIENCE VOL 298 25 OCTOBER 2002 825
Milo R. et al “Network Motifs: Simple Building Blocks of Complex Networks” Science 298 p824-827 (2002)
67
CPI PPI Tutorial 2004 - AMH
Motifs in Networks
classes), and edges represent synaptic connec-tions between the neurons. We found the feed-forward loop motif in agreement with anatomi-cal observations of triangular connectivity struc-tures (24). The four-node motifs include thebi-fan and the bi-parallel (Table 1). Two ofthese motifs (feedforward loop and bi-fan) were
also found in the transcriptional gene regulationnetworks. This similarity in motifs may point toa fundamental similarity in the design con-straints of the two types of networks. Both net-works function to carry information from sen-sory components (sensory neurons/transcriptionfactors regulated by biochemical signals) to ef-
fectors (motor neurons/structural genes). Thefeedforward loop motif common to both typesof networks may play a functional role in infor-mation processing. One possible function of thiscircuit is to activate output only if the inputsignal is persistent and to allow a rapid deacti-vation when the input goes off (11). Indeed,many of the input nodes in the neural feedfor-ward loops are sensory neurons, which mayrequire this type of information processingto reject transient input fluctuations that areinherent in a variable or noisy environment.
We also studied several technological net-works. We analyzed the ISCAS89 benchmarkset of sequential logic electronic circuits (7, 25).The nodes in these circuits represent logic gatesand flip-flops. These nodes are linked by direct-ed edges. We found that the motifs separate thecircuits into classes that correspond to the cir-cuit’s functional description. In Table 1, wepresent two classes, consisting of five forward-logic chips and three digital fractional multipli-ers. The digital fractional multipliers share threemotifs, including three- and four-node feedbackloops. The forward logic chips share the feed-forward loop, bi-fan, and bi-parallel motifs,which are similar to the motifs found in thegenetic and neuronal information-processingnetworks. We found a different set of motifs ina network of directed hyperlinks betweenWorld Wide Web pages within a single domain(4). The World Wide Web motifs may reflect adesign aimed at short paths between relatedpages. Application of our approach to nondi-rected networks shows distinct sets of motifs innetworks of protein interactions and Internetrouter connections (18).
None of the network motifs shared by thefood webs matched the motifs found in the generegulation networks or the World Wide Web.Only one of the food web consensus motifs alsoappeared in the neuronal network. Differentmotif sets were found in electronic circuits withdifferent functions. This suggests that motifscan define broad classes of networks, each withspecific types of elementary structures. Themotifs reflect the underlying processes that gen-erated each type of network; for example, foodwebs evolve to allow a flow of energy from thebottom to the top of food chains, whereas generegulation and neuron networks evolve to pro-cess information. Information processing seemsto give rise to significantly different structuresthan does energy flow.
We further characterized the statistical sig-nificance of the motifs as a function of networksize, by considering pieces of various sizes(subnetworks) of the full network. The concen-tration of motifs in the subnetworks is about thesame as that in the full network (Fig. 3). Incontrast, the concentration of the correspondingsubgraphs in the randomized versions of thesubnetworks decreases sharply with size. Inanalogy with statistical physics, the number ofappearances of each motif in the real networks
Table 1. Network motifs found in biological and technological networks. The numbers of nodes and edgesfor each network are shown. For each motif, the numbers of appearances in the real network (Nreal) andin the randomized networks (Nrand! SD, all values rounded) (17, 18) are shown. The P value of all motifsis P " 0.01, as determined by comparison to 1000 randomized networks (100 in the case of the WorldWide Web). As a qualitative measure of statistical significance, the Z score # (Nreal – Nrand)/SD is shown.NS, not significant. Shown are motifs that occur at least U # 4 times with completely different sets ofnodes. The networks are as follows (18): transcription interactions between regulatory proteins and genesin the bacterium E. coli (11) and the yeast S. cerevisiae (20); synaptic connections between neurons inC. elegans, including neurons connected by at least five synapses (24); trophic interactions in ecologicalfood webs (22), representing pelagic and benthic species (Little Rock Lake), birds, fishes, invertebrates(Ythan Estuary), primarily larger fishes (Chesapeake Bay), lizards (St. Martin Island), primarily inverte-brates (Skipwith Pond), pelagic lake species (Bridge Brook Lake), and diverse desert taxa (CoachellaValley); electronic sequential logic circuits parsed from the ISCAS89 benchmark set (7, 25), where nodesrepresent logic gates and flip-flops (presented are all five partial scans of forward-logic chips and threedigital fractional multipliers in the benchmark set); and World Wide Web hyperlinks between Web pagesin a single domain (4) (only three-node motifs are shown). e, multiplied by the power of 10 (e.g., 1.46e6# 1.46$ 106).
*Has additional four-node motif: (X3Z, W; Y3Z, W; Z3W), Nreal# 150, Nrand# 85! 15, Z# 4. †Has additionalfour-node motif: (X3Y, Z; Y3Z; Z3W), Nreal# 204, Nrand# 80! 20, Z# 6. The three-node pattern (X3Y, Z; Y3Z;Z3Y) also occurs significantly more than at random. It is not a motif by the present definition because it does notappear with completely distinct sets of nodes more than U # 4 times. ‡Has additional four-node motif: (X3Y;Y3Z, W; Z3X; W3X), Nreal # 914, Nrand # 500 ! 70, Z # 6. §Has two additional three-node motifs: (X3Y, Z;Y3Z; Z3Y), Nreal # 3e5, Nrand # 1.4e3 ! 6e1, Z # 6000, and (X3Y, Z; Y3Z), Nreal # 5e5, Nrand # 9e4 ! 1.5e3,Z # 250.
Network Nodes Edges Nreal Nrand ± SD Z score Nreal Nrand ± SD Z score Nreal Nrand ± SD Z score
Gene regulation
(transcription)
X
Y
Z
Feed-
forward
loop
X Y
Z W
Bi-fan
E. coli 424 519 40 7 ± 3 10 203 47 ± 12 13
S. cerevisiae* 685 1,052 70 11 ± 4 14 1812 300 ± 40 41
Neurons X
Y
Z
Feed-
forward
loop
X Y
Z W
Bi-fan X
Y Z
W
Bi-
parallel
C. elegans† 252 509 125 90 ± 10 3.7 127 55 ± 13 5.3 227 35 ± 10 20
Food webs X
Y
Z
Three
chain
X
Y Z
W
Bi-
parallel
Little Rock 92 984 3219 3120 ± 50 2.1 7295 2220 ± 210 25
Ythan 83 391 1182 1020 ± 20 7.2 1357 230 ± 50 23
St. Martin 42 205 469 450 ± 10 NS 382 130 ± 20 12
Chesapeake 31 67 80 82 ± 4 NS 26 5 ± 2 8
Coachella 29 243 279 235 ± 12 3.6 181 80 ± 20 5
Skipwith 25 189 184 150 ± 7 5.5 397 80 ± 25 13
B. Brook 25 104 181 130 ± 7 7.4 267 30 ± 7 32
Electronic circuits
(forward logic chips)
X
Y
Z
Feed-
forward
loop
Bi-fan X
Y Z
W
Bi-
parallel
s15850 10,383 14,240 424 2 ± 2 285 1040 1 ± 1 1200 480 2 ± 1 335
s38584 20,717 34,204 413 10 ± 3 120 1739 6 ± 2 800 711 9 ± 2 320
s38417 23,843 33,661 612 3 ± 2 400 2404 1 ± 1 2550 531 2 ± 2 340
s9234 5,844 8,197 211 2 ± 1 140 754 1 ± 1 1050 209 1 ± 1 200
s13207 8,651 11,831 403 2 ± 1 225 4445 1 ± 1 4950 264 2 ± 1 200
Electronic circuits
(digital fractional multipliers)
X
Y Z
Three-
node
feedback
loop
Bi-fan X Y
Z W
Four-
node
feedback
loop
s208 122 189 10 1 ± 1 9 4 1 ± 1 3.8 5 1 ± 1 5
s420 252 399 20 1 ± 1 18 10 1 ± 1 10 11 1 ± 1 11
s838‡ 512 819 40 1 ± 1 38 22 1 ± 1 20 23 1 ± 1 25
World Wide Web X
Y
Z
Feedback
with two
mutual
dyads
X
Y Z
Fully
connected
triad
X
Y Z
Uplinked
mutual
dyad
nd.edu§ 325,729 1.46e6 1.1e5 2e3 ± 1e2 800 6.8e6 5e4±4e2 15,000 1.2e6 1e4 ± 2e2 5000
X Y
Z W
X Y
Z W
R E P O R T S
25 OCTOBER 2002 VOL 298 SCIENCE www.sciencemag.org826
Milo R. et al
68
CPI PPI Tutorial 2004 - AMH
Graph Theory & Networks References
Gary Chartrand “Introductory Graph Theory” Dover 1977
Albert-László Barabási & Zoltán N.Oltvai. “Network Biology: Understanding the cell’s functional organization” Nature Reviews Genetics 5 101-114. (Feb 2004)
Albert-László Barabási. “Linked” (Nice easy read.)
Reka Albert & Albert-László Barabási. “Statistical mechanics of complex networks” Reviews of Modern Physics 74. 47-97 (January 2002) (Nice, not so easy read.)
69
CPI PPI Tutorial 2004 - AMH
5. Accessing & Visualizing Interaction Data
1. Types of interaction data2. Interaction Databases3. Integrating other biological data4. Cytoscape for integrating biological data5. Pajek for graph analysis
70
CPI PPI Tutorial 2004 - AMH
more than just protein interactions...
• Genetic interactions (epistatic, synthetic lethal )
• Coexpression
• Small-molecule/protein interactions
• Chemical-genetic interactions
• Chemical reactions
• Transcription factor binding (ChIP)
• Localization
• etc...
• Quantitative data:
• transcripts, proteins, metabolites...
71
CPI PPI Tutorial 2004 - AMH
Major Public Interaction Databases
BINDBlueprint (Hogue)
Freecomplex data modelextensive curation
http://www.blueprint.org
DIPUCLA
(Eisenberg)Free only for
academicscurated http://dip.doe-mbi.ucla.edu/
GRID MSHRI (Tyers) FreeSimple, easy-to-use
formathigh throughput only
http://biodata.mshri.on.ca/grid/servlet/Index
INTACT EBI (Apweiler) Free http://www.ebi.ac.uk/intact/index.html
MINT Rome Free http://mint.bio.uniroma2.it/mint/index.php
HPRDHopkins/IOB
(Pandey)Free only for
academicsHuman data http://www.hprd.org
many databases are limited to protein-protein interaction data
72
CPI PPI Tutorial 2004 - AMH
Commercial Sources
• Ingenuity
• Database of mainly mammalian functional interactions.
• Analysis tools are useful, but very focussed on interpreting microarray data.
• high cost (but discounted for academics)
Ingenuity Pathways AnalysisApplication Note 0104
Application Note 0104 Page 5
The functional analysis and the molecularrelationships provided in Network 12suggest that circadian phased expressionof nuclear hormone receptors affectscircadian regulation of lipid metabolism.This hypothesis is further explored byexamining the well-characterizedglycerolipid biosynthesis pathway in itsentirety, and seeing which of the enzymesin that pathway have an expression patternthat correlates with circadian phasing.
Clicking on the Pathway Tag icon in theNetwork Explorer bar for Network 12provides a direct link to a graphicalrepresentation of the canonical metabolicpathway “Glycerolipid Metabolism”. As seenin Figure 6, this graphic shows all of thegenes in the user-defined input list thatplay a role in Glycerolipid Metabolism, theircorresponding Enzyme Class (EC) numbers,and the Ingenuity Pathways Analysisnetworks that gene is involved in.
By providing seamless navigation betweenthe glycerolipid metabolism pathway andNetwork 12, Ingenuity Pathways Analysisenables users to quickly answer additionalquestions about the network. Specifically,users can address the question of thera-peutic relevance of this network byactivating the Drug View icon in theNetwork Explorer toolbar. As displayed inFigure 7, Network 12 contains severaltargets of FDA approved drugs used inthe treatment of cholesterol disorders,adding additional relevance to thiscircadianly regulated network.
Figure 5: Coordinate regulation of metabolic enzymes. Network 12 identifies thefunctional relationship between circadianly regulated metabolic enzymes MGLL and LPL(diamond shape) and the nuclear hormone receptor PPARA (rectangle shape). Thearrowhead reflects the directionality of the relationships (PPARA acts on MGLL and LPL).
Figure 6: Circadianregulation of glycerolipidmetabolism enzymes.The coloring scheme isidentical to that of Network12 (circadianly regulatedFocus Genes are green).Membership of individualgenes in enzyme classeswas established using theLIGAND database 5.
73
CPI PPI Tutorial 2004 - AMH
6
BIND Data Policy
• Source Code
• BIND source code is available at SourceForge.net under the terms andagreements of the GNU General Public License (GPL).
• Data
• BIND data is free for both commercial and academic use. If you use BIND data,please cite:
• Bader GD, Betel D, Hogue CW. (2003) BIND: the Biomolecular InteractionNetwork Database. Nucleic Acids Res. 31(1):248-50 PMID: 12519993
• This data is distributed in the hope that it will be useful, but WITHOUT ANYWARRANTY; without even the implied warranty of MERCHANTABILITY orFITNESS FOR A PARTICULAR PURPOSE.
BIND + NCBI RefSeq :Biomolecular Assembly – the “edge”
Bader, Hogue 2002
74
CPI PPI Tutorial 2004 - AMH
Modeling Protein complex data as interactions
A
B
C
D
E
AB
C
D
E
A
B
C
DE
matrix
spoke
Topology and number of complexes remain unknown
possible models:
75
CPI PPI Tutorial 2004 - AMH
Data
• At present, different databases contain non-overlapping data - need to collect data from multiple sources
• Emerging standards and consortia: PSI (psidev.sourceforge.net) and Biopax (www.biopax.org) will eventually facilitate synchronization
76
CPI PPI Tutorial 2004 - AMH
Visualization
• Informative layouts
• Integration of many interaction types
• Integration of state, function data
• Exploration
• Filtering biologically interesting subgraphs
• Network vs. matrix
77
CPI PPI Tutorial 2004 - AMH
Visualization Tools
Pajekhttp://vlado.fmf.uni-lj.si/pub/
networks/pajek/
Cytoscape www.cytoscape.org
Osprey biodata.mshri.on.ca/osprey
78
CPI PPI Tutorial 2004 - AMH
Cytoscape demo
79
CPI PPI Tutorial 2004 - AMH
6. Assessing and Predicting Interactions
1. Supervised Classification
2. Rating confidence in interactionsStatistical methodsGraph-theoretic methods
3. Predicting interactionsliterature mininginterlogsintegrating functional & genomic data
phylogenetic profiles, gene fusion, coexpresion, GO, localization
protein-protein docking
80
CPI PPI Tutorial 2004 - AMH
Classification
• Rating confidence in interactions and predicting novel interactions both pose a classification problem
• Classification:
• multiple inputs, x - ‘feature vectors’
• single discrete output y
• predict output from future inputs
• Supervised learning:
• Train classifier using known positive and negative examples• apply standard methods: naive bayes, support vector
machine, logistic regression
81
CPI PPI Tutorial 2004 - AMH
Naive Bayes Classifier
• Assumes that all features are independent, given class labels
• Calculate probability (ie. of two proteins interacting) based on each feature separately, and then just multiply them together to get to the overall probability
p(x|y) =∏
i
p(xi|y)x1 x2 xn
y
...
82
CPI PPI Tutorial 2004 - AMH
Support Vector Machine
83
CPI PPI Tutorial 2004 - AMH
Literature Mining
• Search for abstracts containing two protein names, and a set of interaction words
• # of papers containing two proteins together is strong evidence of an interaction
• Apply machine learning, natural language processing methods to identify likely interactions• PreBIND (Donaldson et al, 2003) (data available at ftp.bind.org)
• Support Vector Machine to classify “interaction” abstracts
84
CPI PPI Tutorial 2004 - AMH
‘Interlogs’
• Two proteins are more likely to interact if they both have homologs in another species that are known to interact
A
B
a'
b'
experim
enta
l in
tera
ction
homology
homology
infe
rred inte
raction
interacting proteins may have coevolved such that only dis-crete interacting domains were conserved.
The data described above suggest that the approach ofsequence-based searches for candidate interologs can be usedglobally to identify potential networks of interactions. How-ever, such networks only can be considered as biological hy-potheses. Hence, we investigated methods to generate re-agents that can be used to study potential interaction net-works identified by interolog searches. The reverse two-hybridsystem provides a genetic selection that allows the rapid iden-tification of cis-acting mutations or trans-acting moleculesthat dissociate potential interactions (Vidal 1997). The two-hybrid SPAL10::URA3 inducible reporter gene (Vidal et al.1996) confers sensitivity to 5-Fluoroorotic acid (5-FOA). Thedissociation of the yeast two-hybrid interaction confers a se-lective advantage allowing screens for dissociating com-pounds or for mutations that prevent the normal associationof a protein pair using positive selection. Such reagents can beused back in vivo to characterize the role of the correspondingprotein-protein interactions (Endoh et al. 2001).
To test the degree to which the reverse two-hybrid sys-tem can be applied to our network of identified interologs, wedetermined the percentage of the interactions describedabove that could be counter-selected on media containing5-FOA (Vidal 1997). Starting from the 35 true worm interologsdescribed above, 77% (27/35) of C. elegans interactions weredetected as 5-FOA sensitive (Fig. 3). Because the reverse two-hybrid system can be automated (Endoh et al. 2001), it ispossible that relatively large numbers of yeast two-hybrid in-teractions that emerge from interolog searches could indeedbe tested back in the relevant biological settings.
This work suggests that interaction maps from one spe-cies may be useful in predicting interactions in another spe-cies and may provide insight into the function of otherwiseuncharacterized proteins. In addition, the identification of aninterolog provides additional support for the validity of theinitial interaction found in the “reference” species. This maybe most meaningful if the only evidence for the original in-teraction comes, itself, from a high-throughput experiment.When the function of one of the proteins in the starting spe-
Figure 1 Experimentally verified interactions between Saccharomyces cerevisiae and Caenorhabditis elegans. (A–D Yeast diploid cells expressingeach of 35 C. elegans potential interologs. Pairs are arranged in the order described in Table 2. The five patches at the bottom are controls (negativecontrol on the left side and controls of increasing interaction strength towards the right side). See Vidal (1997) for a detailed description of thesecontrols. (B) !-Galactosidase assay to detect the expression of GAL1::lacZ. (C) Growth assay on SC-Leu-Trp-His, +20 mM 3AT plates to detect theexpression of GAL1::HIS3. (D) Growth assay on SC-Leu-Trp-Ura plates to detect the expression of SPAL10::URA3. (E) Conservation of interactions.Each C. elegans protein pair tested was plotted according to two E-values. The first E-value corresponds to the conservation between the X (fromyeast) and X! (from C. elegans) proteins while the second E-value corresponds to the conservation between the Y (from yeast) and Y! (from C.elegans) proteins. The smaller of the two E-values was plotted on the X-axis and the greater on the Y-axis. The C. elegans protein pairs that testedpositive in the two-hybrid system are labeled in black.
Matthews et al.
2122 Genome Researchwww.genome.org
Matthews, 2001
85
CPI PPI Tutorial 2004 - AMH
Domain Fusion
• Two proteins A and B with homologs in another organism that are fused into a single protein chain are likely to functionally or physically interact
Detecting Protein Function andProtein-Protein Interactionsfrom Genome Sequences
Edward M. Marcotte, Matteo Pellegrini, Ho-Leung Ng,Danny W. Rice, Todd O. Yeates, David Eisenberg*
A computational method is proposed for inferring protein interactions fromgenome sequences on the basis of the observation that some pairs of interactingproteins have homologs in another organism fused into a single protein chain.Searching sequences frommany genomes revealed 6809 such putative protein-protein interactions in Escherichia coli and 45,502 in yeast. Many members ofthese pairs were confirmed as functionally related; computational filteringfurther enriches for interactions. Some proteins have links to several otherproteins; these coupled links appear to represent functional interactions suchas complexes or pathways. Experimentally confirmed interacting pairs aredocumented in a Database of Interacting Proteins.
The lives of biological cells are controlled byinteracting proteins in metabolic and signal-ing pathways and in complexes such as themolecular machines that synthesize and useadenosine triphosphate (ATP), replicate andtranslate genes, or build up the cytoskeletalinfrastructure (1). Our knowledge of protein-protein interactions has been accumulatedfrom biochemical and genetic experiments,including the widely used yeast two-hybridtest (2). Here we ask if protein-protein inter-actions can be recognized from genome se-quences by purely computational means.
Some interacting proteins such as the GyrA and Gyr B subunits of Escherichia coliDNA gyrase are fused into a single chain inanother organism, in this case the topoisom-erase II of yeast (3). Thus, the sequencesimilarities of Gyr A (804 amino acid resi-dues) and Gyr B (875 residues) to differentsegments of the topoisomerase II (1429 resi-dues) might be used to predict that Gyr A andGyr B interact in E. coli.
To find other such putative protein inter-actions in E. coli, we searched the 4290protein sequences of the E. coli genome (4)for these patterns of sequence homology (5).We found 6809 pairs of nonhomologous se-quences, both members of the pair havingsignificant similarity (6) to a single protein insome other genome that we term a RosettaStone sequence because it deciphers the in-teraction between the protein pairs. The 4290proteins could form at most (4290)2/2 ! 9 "106 pair interactions, but we would expect
many fewer interactions in a functioning cell;roughly 2 to 10 interactions for each proteindoes not seem unreasonably many.
Each of these 6809 pairs is a candidate fora pair of interacting proteins in E. coli. Fivesuch candidates are shown in Fig. 1. The firstthree pairs of E. coli proteins were amongthose easily determined from the biochemicalliterature in fact to interact. The final twopairs of proteins are not known to interact.They are representatives of many such pairswhose putative interactions at this time mustbe taken as testable hypotheses.
We devised three independent tests of in-teractions predicted by the method we termdomain fusion analysis, each showing that areasonable fraction may in fact interact. Thefirst method uses the annotation of proteinsgiven in the SWISS-PROT database (7). Forcases where the interacting proteins haveboth been annotated, we compare their anno-tations, looking for a similar function for bothmembers of the pair. Similar function would
imply at least a functional interaction. Of the3950 E. coli pairs of known function, 2682(68%) share at least one keyword in theirSWISS-PROT annotations (ignoring the key-word “hypothetical protein”), suggesting re-lated functional roles. When pairs of annotat-ed E. coli proteins are selected at random,only 15% share a keyword. In short, of the E.coli pairs that the domain fusion analysisturns up as candidates for protein-protein in-teractions, more than half have both memberswith a similar function; the method thereforeseems to be a robust predictor of proteinfunction. Where the function of one memberof a protein pair is known, the function of theother member can be predicted. Performing asimilar analysis in yeast turns up 45,502 pro-tein pairs. Of the 9857 pairs of known func-tion, 32% share at least one keyword in theirannotations compared with 14% when pro-teins are selected at random.
The second test of the interactions predict-ed by the domain fusion analysis uses asconfirmation the Database of Interacting Pro-teins (8). This database is a compilation ofprotein pairs that have been found to interactin some published experiment. As of Decem-ber 1998, the database contained 939 entries,724 of which have both members of the pairlisted in the ProDom database. Of these 724pairs, we found 46 or 6.4% linked by RosettaStone sequences. We expect this percentageto rise as more genomes are sequenced.
The third test of domain fusion predic-tions is by another computational method forpredicting interactions (9), the method ofphylogenetic profiles, which detects func-tional interactions by analyzing correlatedevolution of proteins. This method was ap-plied to the 6809 interactions predicted by thedomain fusion analysis for E. coli proteins.Some 321 of these predictions (#5%) weresuggested by the phylogenetic profile methodto interact, more than eight times as manyinteractions in common as for randomly cho-
UCLA–Department of Energy Laboratory of StructuralBiology and Molecular Medicine, Departments ofChemistry and Biochemistry and Biological Chemis-try, Box 951570, University of California at Los An-geles, Los Angeles, CA 90095–1570, USA.
*To whom correspondence should be addressed: E-mail: [email protected]
Fig. 1. Five examplesof pairs of E. coli pro-teins predicted to inter-act by the domain fu-sion analysis. Each pro-tein is shown schemat-ically with boxes rep-resenting domains [asdefined in the ProDomdomain database (17)].For each example, atriplet of proteins is pic-tured: The second andthird proteins are pre-dicted to interact be-cause their homologsare fused in the firstprotein (called the Rosetta Stone protein in the text). The first three predictions are known to interactfrom experiments (18). The final two examples show pairs of proteins from the same pathway (twononsequential enzymes from the histidine biosynthesis pathway and the first two steps of the prolinebiosynthesis pathway) that are not known to interact directly.
R E P O R T S
www.sciencemag.org SCIENCE VOL 285 30 JULY 1999 751
Marcotte et al, Science 285 751-753 1999
86
CPI PPI Tutorial 2004 - AMH
Phylogenetic Profiling
• Proteins that interact tend to be evolutionarily correlated - they are either both conserved or both lost.
the clustering of phylogenetic profiles is that these as yetuncharacterized proteins have functions associated with theribosome.
The comparisons of the phylogenetic profiles of flagellarproteins (Fig. 2B) further support the idea that proteins withsimilar profiles are likely to be functionally linked; 10 flagellarproteins share a common profile. Their homologs are found ina subset of five bacterial genomes: those of Aquifex aeolicus,Borrelia burgdorferi, B. subtilis, Helicobacter pylori, and Myco-bacterium tuberculosis. Other proteins that appear in neigh-boring clusters (groups of proteins that share a commonprofile) include various flagellar proteins and cell-wall main-tenance proteins. Flagellar and cell-wall maintenance proteinsmay be biochemically linked, because flagella are insertedthrough the cell wall. For example, the lytic murein transgly-cosylase MltD has a phylogenetic profile that differs by onlyone bit from that of the flagellar structural protein FlgL. Thistransglycosylase cuts the cell wall for unknown reasons. There-fore, another prediction is that this enzyme may participate inflagellar assembly.
Fig. 2 A and B includes proteins in structural complexes,whereas Fig. 2C shows proteins involved in amino acid me-tabolism. We find that more than half of the proteins withphylogenetic profiles similar (within one bit) to that of thehistidine synthesis protein His5 are involved in amino acidmetabolism. With the 16 currently available fully sequencedgenomes, however, phylogenetic profiles are not able to sep-
arate the metabolic pathways of specific amino acids. Instead,because of the limitations of currently available data, a histi-dine biosynthesis protein seems to have the same profile as atryptophan, arginine, and cysteine synthesis protein. It isprobable that, as more genomes are fully sequenced and thenumber of entries in phylogenetic profiles is increased, similarbut distinct amino acid metabolic pathways will cluster sepa-rately in phylogenetic-profile spaces.
The examples included in Fig. 2 show that proteins withphylogenetic profiles similar to a query protein are likely to befunctionally linked with it. We next show the converse: thatgroups of proteins known to be functionally linked often havesimilar phylogenetic profiles. As shown in Table 1, we chosegroups of E. coli proteins that share a common keyword in theirSwissProt (7) annotation, reflecting well known families offunctionally linked proteins. Because homologous proteinscoded by the same genome necessarily have similar profiles,they were eliminated from the groups. For each group, wecomputed the number of protein pairs that are neighbors;
FIG. 1. Our method of analyzing protein phylogenetic profiles isillustrated schematically for the hypothetical case of four fully se-quenced genomes (from E. coli, Saccharomyces cerevisiae, Haemophi-lus influenzae, and Bacillus subtilis) in which we focus on seven proteins(P1–P7). For each E. coli protein, we construct a profile, indicatingwhich genomes code for homologs of the protein. We next cluster theprofiles to determine which proteins share the same profiles. Proteinswith identical (or similar) profiles are boxed to indicate that they arelikely to be functionally linked. Boxes connected by lines have phylo-genetic profiles that differ by one bit and are termed neighbors.
Table 1. Phylogenetic profiles link protein with similar keywords
KeywordNo.
proteins
No.neighbors
in keywordgroup
No.neighborsin random
group
Ribosome 60 197 27Transcription 36 17 10tRNA synthase and ligase 26 11 5Membrane proteins* 25 89 5Flagellar 21 89 3Iron, ferric, and ferritin 19 31 2Galactose metabolism 18 31 2Molybdoterin and Molybdenum,
and molybdoterin 12 6 1Hypothetical† 1,084 108,226 8,440
Proteins grouped on the basis of similar keywords in SwissProt havemore similar phylogenetic profiles than random proteins. Column 2gives the number of nonhomologous proteins in the keyword group.Column 3 gives the number of protein pairs in the keyword group withprofiles that differ by less than 3 bits. These pairs are called neighbors.Column 4 lists the number of neighbors found on average for a randomgroup of proteins of the same size as the keyword group.*Only membrane proteins without uniformly zero phylogenetic pro-
files were included.†Unlike the other rows of the table, the hypothetical proteins docontain homologous pairs.
Table 2. Phylogenetic profiles link proteins in EcoCyc classes
EcoCyc classNo.
proteins
No.neighborsin EcoCyc
class
No.neighborsin random
group
Carbon compounds 88 798 60Anaerobic respiration 66 275 30Aerobic respiration 28 39 6Electron transport 26 91 5Purine biosynthesis 21 11 3Salvage nucleosides 15 10 1Fermentation 19 17 3Tricarboxylic acid cycle 16 6 1Glycolysis 14 5 1Peptidoglycan biosynthesis 12 10 1
Proteins grouped according to metabolic function on the basis ofEcoCyc classes have more similar phylogenetic profiles than randomproteins. Column 2 gives the number of proteins in the EcoCyc class.Column 3 gives the number of protein pairs in the EcoCyc class withprofiles that differ by less than 3 bits. These pairs are called neighbors.Column 4 lists the number of neighbors found on average for a randomgroup of proteins of the same size as the keyword group.
4286 Biochemistry: Pellegrini et al. Proc. Natl. Acad. Sci. USA 96 (1999)
Pellegrini et al. PNAS 96(4285-4288) 1999
87
CPI PPI Tutorial 2004 - AMH
Integrating Genomic Data to Predict Interactions
Jansen et al. “A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data” Science 302 449-452 2003
in our filtered version (19)]. A negatives gold-standard is harder to define, but essential forsuccessful training. Thus, we synthesized neg-atives from lists of proteins in separate subcel-lular compartments (9). Our positive and nega-tive gold-standards satisfy the first two criteriaand provide a good practical solution for thethird. Hence, our goal, precisely defined, was topredict whether two proteins are in the samecomplex, not whether they necessarily had di-rect physical contact.
As a measure of reliability, the overlap ofinformation sources (i.e., “interaction datasets,” which could either be noisy experimentaldata or sets of genomic features) with the gold-standards can be expressed in terms of a “like-lihood ratio.” For example, consider a genomicfeature f expressed in binary terms (i.e.,“present” or “absent”). The likelihood ratioL( f ) is then defined as the fraction of gold-standard positives having feature f divided bythe fraction of negatives having f. For twofeatures f1 and f2 with uncorrelated evidence,the likelihood ratio of the combined evidence issimply the product L( f1, f2) ! L( f1)L( f2). Forcorrelated evidence, L( f1, f2) cannot be factor-ized in this way. Bayesian networks are a for-mal representation of such relationships be-tween features. The combined likelihood ratiois proportional to the estimated odds that twoproteins are in the same complex, given multi-ple sources of information.
We predict a protein pair as positive if itscombined likelihood ratio exceeds a particularcutoff (L " Lcut) (negative otherwise). To getan overall assessment of how the predictionperforms, we segmented the gold-standard into
separate training and testing sets (using a sev-enfold cross-validation protocol). Then weevaluated the number of true- (TP) and false-positive (FP) predictions in the testing set. Fi-nally, we applied the Bayesian network beyondthe testing set, computing likelihood ratios forall possible protein pairs in the genome.
Figure 1 schematically shows the infor-mation sources and results of our calcula-tions. We term the results “probabilistic in-teractomes” (PIs), in which each protein pairis associated with a probability measure forbeing in the same complex (i.e., likelihoodratio L). Our procedure not only allows com-bining existing experimental interaction datasets (resulting in a PI-experimental or “PIE”),but also the de novo prediction of proteincomplexes from genomic data sets (when theinput data are not interaction data sets per se,resulting in a PI-predicted or “PIP”).
We combined four interaction data setsfrom high-throughput experiments into thePIE (1–4) (Fig. 1B). The PIE represents atransformation of the individual binary-valued interaction sets into a data set whereevery protein pair is weighted according tothe likelihood that it exists within a complex.
We computed the PIP from several genomicdata sources: the correlation of mRNA amountsin two expression data sets (one with temporalprofiles during the cell cycle, one of expressionlevels under 300 cellular conditions), two sets ofinformation on biological function, and informa-tion about whether proteins are essential for sur-vival (6, 20–22). Although none of these infor-mation sources are interaction data per se, theycontain information weakly associated with in-
teraction: Two subunits of the same protein com-plex often have coregulated mRNA expressionand similar biological functions and are morelikely to be both essential or nonessential (8).
For computing the PIE and the PIP, we usedtwo different types of Bayesian networks: a“naıve” network for the PIP and a fully con-nected one for the PIE (19). The naıve networkis simpler to compute but requires informationsources with essentially uncorrelated evidence.In contrast, the fully connected Bayesian net-work accommodates correlated evidence,which is the case for the four experimentalinteraction data sets.
Finally, we combined the PIP, PIE, andgold-standard into a total PI (PIT), whichrepresents our most comprehensive view ofthe known and putative protein complexes inyeast (23). Because the PIP and PIE dataprovide essentially uncorrelated evidence forprotein-protein interactions, we chose a naıvenetwork to construct the PIT.
Figure 1C gives an overview of how wecompared the PIP, PIE, gold-standard, and ournew experiments. In particular, Fig. 2 shows theperformance of the integration resulting in thePIP and PIE. When tested against the gold-standard, we observed that the ratio of true tofalse positives (TP/FP) increases monotonicallywith Lcut, confirming L as an appropriate mea-sure of the odds of a real interaction. Conser-vatively estimated, protein pairs with L " 600have a better than 50% chance of being in thesame complex, suggesting Lcut ! 600 as auseful threshold (19). Unless otherwise noted,we use this throughout our analysis. It gives9897 predicted interactions from the PIP and
Fig. 1. The information sources integrated in our analysis and theircomparison with each other. (A) The three different types of data used:(i) Interaction data from high-throughput experiments. These compriselarge-scale two-hybrid screens (Y2H) (1, 2) and in vivo pull-down exper-iments (3, 4). (ii) Other genomic features. We considered expressiondata, biological function of proteins (from Gene Ontology biological process and the MIPS functionalcatalog), and data about whether proteins are essential (6, 19–22). (iii) Gold-standards of known interac-tions and noninteracting protein pairs. (The MIPS functional catalog differs from the MIPS complexescatalog used for the gold-standard.) (B) Combination of data sets into probabilistic interactomes. (C)Comparison of the probabilistic interactomes with the gold-standards and our new experimental data.Numbers next to the arrows indicate which figures refer to these various comparisons.
R E P O R T S
17 OCTOBER 2003 VOL 302 SCIENCE www.sciencemag.org450
in our filtered version (19)]. A negatives gold-standard is harder to define, but essential forsuccessful training. Thus, we synthesized neg-atives from lists of proteins in separate subcel-lular compartments (9). Our positive and nega-tive gold-standards satisfy the first two criteriaand provide a good practical solution for thethird. Hence, our goal, precisely defined, was topredict whether two proteins are in the samecomplex, not whether they necessarily had di-rect physical contact.
As a measure of reliability, the overlap ofinformation sources (i.e., “interaction datasets,” which could either be noisy experimentaldata or sets of genomic features) with the gold-standards can be expressed in terms of a “like-lihood ratio.” For example, consider a genomicfeature f expressed in binary terms (i.e.,“present” or “absent”). The likelihood ratioL( f ) is then defined as the fraction of gold-standard positives having feature f divided bythe fraction of negatives having f. For twofeatures f1 and f2 with uncorrelated evidence,the likelihood ratio of the combined evidence issimply the product L( f1, f2) ! L( f1)L( f2). Forcorrelated evidence, L( f1, f2) cannot be factor-ized in this way. Bayesian networks are a for-mal representation of such relationships be-tween features. The combined likelihood ratiois proportional to the estimated odds that twoproteins are in the same complex, given multi-ple sources of information.
We predict a protein pair as positive if itscombined likelihood ratio exceeds a particularcutoff (L " Lcut) (negative otherwise). To getan overall assessment of how the predictionperforms, we segmented the gold-standard into
separate training and testing sets (using a sev-enfold cross-validation protocol). Then weevaluated the number of true- (TP) and false-positive (FP) predictions in the testing set. Fi-nally, we applied the Bayesian network beyondthe testing set, computing likelihood ratios forall possible protein pairs in the genome.
Figure 1 schematically shows the infor-mation sources and results of our calcula-tions. We term the results “probabilistic in-teractomes” (PIs), in which each protein pairis associated with a probability measure forbeing in the same complex (i.e., likelihoodratio L). Our procedure not only allows com-bining existing experimental interaction datasets (resulting in a PI-experimental or “PIE”),but also the de novo prediction of proteincomplexes from genomic data sets (when theinput data are not interaction data sets per se,resulting in a PI-predicted or “PIP”).
We combined four interaction data setsfrom high-throughput experiments into thePIE (1–4) (Fig. 1B). The PIE represents atransformation of the individual binary-valued interaction sets into a data set whereevery protein pair is weighted according tothe likelihood that it exists within a complex.
We computed the PIP from several genomicdata sources: the correlation of mRNA amountsin two expression data sets (one with temporalprofiles during the cell cycle, one of expressionlevels under 300 cellular conditions), two sets ofinformation on biological function, and informa-tion about whether proteins are essential for sur-vival (6, 20–22). Although none of these infor-mation sources are interaction data per se, theycontain information weakly associated with in-
teraction: Two subunits of the same protein com-plex often have coregulated mRNA expressionand similar biological functions and are morelikely to be both essential or nonessential (8).
For computing the PIE and the PIP, we usedtwo different types of Bayesian networks: a“naıve” network for the PIP and a fully con-nected one for the PIE (19). The naıve networkis simpler to compute but requires informationsources with essentially uncorrelated evidence.In contrast, the fully connected Bayesian net-work accommodates correlated evidence,which is the case for the four experimentalinteraction data sets.
Finally, we combined the PIP, PIE, andgold-standard into a total PI (PIT), whichrepresents our most comprehensive view ofthe known and putative protein complexes inyeast (23). Because the PIP and PIE dataprovide essentially uncorrelated evidence forprotein-protein interactions, we chose a naıvenetwork to construct the PIT.
Figure 1C gives an overview of how wecompared the PIP, PIE, gold-standard, and ournew experiments. In particular, Fig. 2 shows theperformance of the integration resulting in thePIP and PIE. When tested against the gold-standard, we observed that the ratio of true tofalse positives (TP/FP) increases monotonicallywith Lcut, confirming L as an appropriate mea-sure of the odds of a real interaction. Conser-vatively estimated, protein pairs with L " 600have a better than 50% chance of being in thesame complex, suggesting Lcut ! 600 as auseful threshold (19). Unless otherwise noted,we use this throughout our analysis. It gives9897 predicted interactions from the PIP and
Fig. 1. The information sources integrated in our analysis and theircomparison with each other. (A) The three different types of data used:(i) Interaction data from high-throughput experiments. These compriselarge-scale two-hybrid screens (Y2H) (1, 2) and in vivo pull-down exper-iments (3, 4). (ii) Other genomic features. We considered expressiondata, biological function of proteins (from Gene Ontology biological process and the MIPS functionalcatalog), and data about whether proteins are essential (6, 19–22). (iii) Gold-standards of known interac-tions and noninteracting protein pairs. (The MIPS functional catalog differs from the MIPS complexescatalog used for the gold-standard.) (B) Combination of data sets into probabilistic interactomes. (C)Comparison of the probabilistic interactomes with the gold-standards and our new experimental data.Numbers next to the arrows indicate which figures refer to these various comparisons.
R E P O R T S
17 OCTOBER 2003 VOL 302 SCIENCE www.sciencemag.org450
88
CPI PPI Tutorial 2004 - AMH
Prediction Performance
163 from the PIE. In contrast, likelihood ratiosderived from single genomic features (e.g.,mRNA coexpression) or from individual inter-action experiments (e.g., the Ho data set) didnot exceed the cutoff when used alone, withTP/FP values far below 1. This demonstratesthat information sources that, taken alone, areonly weak predictors of interactions canyield reliable predictions when combined.
The PIP had a higher sensitivity than thePIE for comparable TP/FP ratios (Fig. 2C).(“Sensitivity” measures coverage and is definedas TP/P, where P is the number of gold-standard positives.) Specifically, the sensitivityof the PIP is !27% at our cutoff. This mayseem low, but compares favorably with the PIE,which had a sensitivity of less than 1%. Thismeans that we can predict, at comparable errorlevels, more complex interactions de novo thanare present in the high-throughput experimentalinteraction data sets.
One might ask whether simpler voting pro-cedures can match the performance of more
complicated machine-learning methods such asBayesian networks. To test this hypothesis, wecompared the PIP with a voting procedurewhere each of the four genomic features con-tributes an additive vote toward positive classi-fication. We found that the Bayesian networkachieved greater sensitivity for comparable TP/FP ratios (Fig. 2C) (19).
Figure 3 shows parts of the PIP and PIEgraphs and how these compare with the gold-standard and our new experiments. First, totest whether the thresholded PIP was biasedtoward certain complexes, we looked at thedistribution of predictions among gold-stan-dard positives (Fig. 3A); they were roughlyequally apportioned among the differentcomplexes, suggesting a lack of bias.
We have thus far treated all interactions asindependent. However, the joint distribution ofinteractions in the PIs can help identify largecomplexes: An ideal complex should be a“clique” in an interaction graph (i.e., a subgraphwith N(N " 1)/2 links between N proteins).
Although this rarely happens in practice, be-cause of incorrect or missing links, large com-plexes tend to have many interconnectionswithin them, whereas false-positive links to out-side proteins tend to occur randomly, without acoherent pattern (Fig. 4).
Figure 3B shows parts of the thresholdedPIP that are restricted to proteins with !20links (23), highlighting large complexes. Somepredicted complexes overlap with the gold-standard positives (cytoplasmic ribosome) orthe PIE (exosome, RNA polymerase I, 26Sproteasome). Comparison with the gold-standard negatives showed where the PIP likelyproduced false complexes. Many protein asso-ciations only appear in the PIP and thus poten-tially represent new interactions and complex-es. An interesting example is the mitochondrialribosome; it has appreciable overlap with bothgold-standard positives and the PIE and con-tains plausible, newly predicted interactionswith three proteins (19).
To further test the predictions in the PIP,we conducted TAP-tagging experiments, inwhich a protein expressed at its normal intra-cellular concentration (“bait”) is tagged andused to “pull down” endogenous proteincomplexes. We picked 98 proteins as TAP-tagging baits. These produced 424 experi-mental interactions overlapping with the PIPthresholded at Lcut # 300. (Of these, 185, inturn, overlapped with gold-standard posi-tives, and 16 with negatives, highlighting thereliability of our experiments.)
Figure 3C shows three examples of theoverlap between the PIP and TAP-tagging. Wepredicted that the putative DEAD-box RNAhelicase Dbp3 interacts with three other RNAhelicases (Hca4, Mak5, and Dbp7), with pro-teins implicated in ribosomal RNA (rRNA) me-tabolism (e.g., Nop2, Rrp5, Mak5, and compo-nents of RNA polymerase I), and with Nsr1, theyeast homolog of mammalian Nucleolin and aGAR domain–containing protein (24). WhenDbp3 was TAP-tagged and purified, we foundpreviously unknown interactions with Nsr1,Hca4, and Nop1, connecting Dbp3 with knownrRNA-processing proteins. Further purifica-tions with TAP-tagged versions of Mak5, Rrp5,Dbp7, Dbp3, Nsr1, Hca4, and Nop2 verified thephysical association.
The nucleosome, a fundamental unit with-in chromatin, provides a second example ofoverlap. It is composed of eight histones (twoH2A, two H2B, two H3, and two H4), whichcan block RNA polymerase II progression.This blockage is relieved upon interactionwith the FACT complex (also known as SPNor yFACT), which consists of Spt16 andPob3 in yeast. Mammalian Pob3 has a highmobility group (HMG) domain for interac-tion with histones; however, yeast Pob3 lacksthis domain. Instead, the HMG protein Nhp6(with two virtually identical isoforms,Nhp6A and Nhp6B) binds histones (25–27).
Fig. 2. Comparison of PIP and PIE with eachother and with the individual informationsources. (A) The TP/FP ratio as a function of Lcutfor the PIP and the individual data from whichit was computed. The ratio is computed asfollows:
TP$Lcut%/FP$Lcut% " & L # L cutpos$L%/&L # Lcutneg$L%
where pos(L) and neg(L) are the number ofpositives and negatives in the gold-standardwith a given likelihood ratio L. The vertical lineindicates our standard threshold Lcut# 600. (B)The same plot as in (A), but for the PIE. (C)Comparison of TP/FP ratios between the PIPand PIE. The abscissa represents the sensitivityof the probabilistic interactomes. The gray areaindicates the gain of sensitivity of the PIP overthe PIE for equal TP/FP ratios. The arrow showsthe difference in sensitivity at TP/FP # 0.3. At this level, the PIP contains 183,295 protein pairs, ofwhich 6179 are gold-standard positives (75% sensitivity), whereas the PIE contains 31,511 proteinpairs and 1758 gold-standard positives among these (21% sensitivity). This difference in sensitivitybetween PIE and PIP illustrates the value of the de novo prediction. It also reflects, to some degree,that the experiments were done only on subsets of the genome and may have been measuringdifferent types of interactions than the complexes’ gold-standard, which we used to parameterizethe PIP. The white circles show the performance of a voting procedure in which each of the fourgenomic features (from which we computed the PIP) contributed an additive vote. There are fourpossible outcomes in the additive voting procedure, depending on how many data sets contributea positive vote (19).
R E P O R T S
www.sciencemag.org SCIENCE VOL 302 17 OCTOBER 2003 451
Jansen et al. “A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data” Science 302 449-452 2003
• Experimental validation
• Relies on
89
CPI PPI Tutorial 2004 - AMH
Protein-Protein Docking
• Predict interactions and/or binding orientations of complexes based on molecular structure of constituent proteins
• CAPRI: Critical Assessment of Prediction of Interactions
• Challenges
• Conformational flexibility
• Transient complexes
Table 1
Targets and predictions in the CAPRI experiment.
Target Number of prediction! Quality of modelsy Remarks Refs
Groups Models High Medium Acceptable
Round 1 (July–September 2001)T01 HPr–HPr kinase 16 69 0 0 8 Helix movement in kinase [35"]T02 Rotavirus VP6–Fab 15 70 0 1 6 Published electron micrograph
of Fab bound to the virus [33]T03 Flu virus hemagglutinin–Fab 13 62 0 2 0 [32"]Round 2 (January–March 2002)T04 a-amylase–VHH domain AM-D10 13 65 0 0 0 Camelid single-chain antibody [31"]T05 a-amylase–VHH domain AM-B07 13 64 0 0 0 Camelid single-chain antibody [31"]T06 a-amylase–VHH domain AM-D09 13 65 4 4 0 Camelid single-chain antibody [31"]T07 Streptococcal superantigen–TCRb 14 70 5 7 8 Homolog complex in PDB [34"]!Number of groups submitting models and number of models submitted for each target. yTwo criteria were used to judge a docking model: Irms, theroot mean square distance between the Ca of interface residues in the X-ray structure and the model; and fNC, the fraction of native contacts,defined as the number of correctly predicted pairs of contact residues divided by the number of pairs present in the X-ray structure. High-qualitymodels: Irms <1 A, fNC >0.5; medium-quality models: Irms <2 A, fNC >0.3; acceptable models: Irms <4 A, fNC >0.1 [28""].
Figure 2
A CAPRI target and its prediction. The ribbon drawing shows the X-ray structure [31"] of T06, a complex between pig a-amylase (green) and theVHH domain of a camelid antibody (purple), which binds at the enzyme active site. Spheres mark the geometric centers of the VHH domain inhigh-quality (green), medium-quality (blue) and incorrect (yellow) models of the complex derived by docking the VHH domain on the a-amylase.Not only the green and blue spheres, but also about one-third of the yellow spheres cluster in the active site region. Figure courtesy of R Leplae andSJ Wodak (Brussels).
386 Sequences and topology
Current Opinion in Structural Biology 2003, 13:383–388 www.current-opinion.com
Janin and Seraphin, 2003
90
CPI PPI Tutorial 2004 - AMH
Summary
• Integration of prediction methods with experiments needed for more efficient experimentation and better interpretation of experimental data
• Standard supervised classification algorithms can be applied, if suitable positive and negative training data can be assembled
• Prediction methods already showing significant success in combination with experimental validation
91