detecting topological patterns in protein...
TRANSCRIPT
Detecting topological patterns in complex networks
Sergei MaslovBrookhaven National Laboratory
Networks in complex systemsComplex systems
Large number of componentsThey interact with each otherAll components and/or interactions are differentfrom each other (unlike in traditional physics)104 types of proteins in an organism,106 routers, 109 web pages, 1011 neurons
The simplest property of a complex system: who interacts with whom? can be visualized as a networkBackbone for real dynamical processes
Why study the topology of complex networks?
Lots of easily available data: that’s where the state of the art information is (at least in biology)Large networks may contain information about basic design principles and/or evolutionary history of the complex systemThis is similar to paleontology: learning about an animal from its backbone
Plan of my lecturesPart 1: Bio-molecular (protein) networks
What is a protein network?Degree distributions in real and random networksBeyond degree distributions: How it all is wired together? Correlations in degreesHow do protein networks evolve?
Part 2: Interplay between an auxiliary diffusion process (random walk) and modules and communities on the Internet and WWW
Collaborators:Biology:
Kim Sneppen – U. of CopenhagenKasper Eriksen – U. of LundKoon-Kiu Yan – Stony Brook U.Ilya Mazo – Ariadne GenomicsIaroslav Ispolatov – Ariadne Genomics
Internet and WWW:Kasper Eriksen – U. of LundKim Sneppen – U. of CopenhagenIngve Simonsen – NORDITAKoon-Kiu Yan – Stony Brook U.Huafeng Xie –City University of New York
Postdoc positionLooking for a postdoc to work in my group at Brookhaven National Laboratory in New Yorkstarting Fall 2005Topic - large-scale properties of (mostly) bionetworks (partially supported by a NIH/NSF grant with Ariadne Genomics)E-mail CV and 3 letters of recommendation to: [email protected] www.cmth.bnl.gov/~maslov
Protein networksNodes – proteins Edges – interactions between proteins
Metabolic (all metabolic pathways of an organism)Bindings (physical interactions)Regulations (transcriptional regulation, protein modifications)Disruptions in expression of genes (remove or overexpress one gene and see which other genes are affected) Co-expression networks from microarray data (connect genes with similar expression patterns under many conditions)Synthetic lethals (removal of A doesn’t kill, removal of B doesn’t kill but removal of both does)Etc, etc, etc.
Sources of data on protein networks
Genome-wide experimentsBinding – two-hybrid high throughput experiments Transcriptional regulation – ChIP-on-chip, or ChIP-then-SAGEExpression, disruption networks – microarraysLethality of genes (including synthetic lethals):
Gene knockout – yeast RNAi –worm, fly
Many small or intermediate-scale experimentsPublic databases: DIP, BIND, YPD (no longer public), SGD, Flybase, Ecocyc, etc.
MAPK signalingInhibition of apoptosis
Images from ResNet3.0 by Ariadne Genomics
Pathway network paradigm shift
Pathway Network
Transcription regulatory networks
Bacterium: E. coli3:2 ratio
Single-celled eukaryote:S. cerevisiae; 3:1 ratio
Data from Ariadne Genomics
Homo sapiens
Total: 120,000 interacting protein pairs extracted from PubMed as of 8/2004
Degree (or connectivity) of a node – the # of neighbors
DegreeK=4
DegreeK=2
Directed networks havein- and out-degrees
Out-degreeKout=5
In-degreeKin=2
How many transcriptional regulators are out there?
Fraction of transcriptional regulators in bacteria
from Stover et al., Nature (2000)
Complexity of regulation grows with complexity of organism
NR<Kout>=N<Kin>=number of edgesNR/N= <Kin>/<Kout> increases with N
<Kin> grows with NIn bacteria NR~N2 (Stover, et al. 2000) In eucaryots NR~N1.3 (van Nimwengen, 2002)
Networks in more complex organisms are more interconnected then in simpler ones
Complexity is manifested in Kin distribution
E. coli vs H. sapiens
Figure from Erik van Nimwegen, TIG 2003
Table from Erik van Nimwegen, TIG 2003
Protein-protein binding (physical interaction)
networks
Two-hybrid experiment
To test if A interacts with B create two hybridsA* (with Gal4p DNA-binding domain) and B* (with Gal4p activation domain)
High-throughput: all pairs among 6300 yeast proteins are tested: Yeast: Uetz, et al. Nature (2000), Ito, et al., PNAS (2001); Fly: L. Giot et al. Science (2003)
Gal4-activated reporter gene, say GAL2::ADE2Gal4-binding domain
A* B*
Protein binding network in yeast
Protein binding networks
Two-hybrid nuclear Database (DIP) core set
S. cerevisiae
Degree distributionsin random and real networks
Degree distribution in a random network
Randomly throw E edges among N nodesSolomonoff, Rapaport, Bull. Math. Biophysics (1951)Erdos-Renyi (1960)Degree distribution –PoissonK~λ±√λ with no hubs(fast decay of N(K))
( ) exp( )!
2 /
K
N K NK
K E N
λ λ
λ
= −
= =
Degree distribution in real protein networks
Histogram N(K) is broad: most nodes have low degree ~ 1, few nodes – high degree ~100Can be approximately fitted with N(K)~K-γ
functional formwith γ~=2.5
H. Jeong et al. (2001)100 101 102 10310-4
10-2
100
102
104
x-2.5
Network properties of self-binding proteins AKA homodimers
I. Ispolatov, A. Yuryev, I. Mazo, SMq-bio.GN/0501004.
There are just TOO MANY homodimers
• Null-model • Pself ~<k>/N• Ndimer=N • Pself= <k>• Not surprising ashomodimers have many functional roles
Network properties around homodimers
Fly: two-hybrid data Human: database data
Likelihood to self-interact vs. K
Pself~0.05, Pothers~0.0002 Pself~0.003, Pothers~0.0002
What we think it means?In random networks pdimer(K)~K2 not ~K like our empirical observationK is proportional to the “stickiness” of the protein which in its turn scales with
the area of hydrophobic residues on the surface# copies/cellits popularity (in datasets taken from a database)Etc.
Real interacting pairs consists of an “active” and “passive” protein and binding probability scales only with the “stickiness” of the active protein“Stickiness” fully accounts for higher than average connectivity of homodimers
Propagation of signals and perturbations in networks
Naïve argumentAn average node has <K> first neighbors, <K><K-1> second neighbors, <K><K-1><K-1> third neighborsTotal number of neighbors affected is1+<K> + <K><K-1> +<K><K-1><K-1>+…If <K-1> ≥ 1 a perturbation starting at a single node MAY affect the whole network
Where is it wrong?
Probability to arrive to a node with K neighbors is proportional to K!The right answer: <K(K-1)>/<K> ≥ 1a perturbation would spreadCorrelations between degrees of neighbors would affect the answer
How many clusters?
If <K(K-1)>/<K> >> 1 there is one “giant” cluster and few small ones. If <K(K-1)>/<K> ∼ 1 small clusters have a broad (power-law) distributionPerturbation which affects neighbors with probability p propagates if p<K(K-1)>/<K> ≥ 1For scale-free networks P(K)~K-γ with γ<3, <K2>=∞ ⇒ perturbation always spreads in a large enough network
Problem 2: For a given p and γ how large should be the network for the perturbation to spread
Amplification ratios
• A(dir): 1.08 - E. Coli, 0.58 - Yeast• A(undir): 10.5 - E. Coli, 13.4 – Yeast• A(PPI): ? - E. Coli, 26.3 - Yeast
Problem 3: derive the above formula for directed networks
Beyond degree distributions:How is it all wired together?
Central vs peripheral network architecture
central(hierarchical)
peripheral(anti-hierarchical)
From A. Trusina, P. Minnhagen, SM, K. Sneppen, Phys. Rev. Lett. (2004)
random
Correlation profile
Count N(k0,k1) – the number of links between nodes with connectivities k0 and k1
Compare it to Nr(k0,k1) – the same property in a random networkQualitative features are very noise-tolerant with respect to both false positives and false negatives
PRP45
SYF1
RFA1RFA3
SMT3
HTB2
ALPHA1
FUS3DIG2
DIG1
APN2
GCN4
RRN10
RPN10
LSM2PAT1
RVB1
SKI6
MED6
RFC5RFC2
DCP1
SIF2RNA14
MGA1
NPL4CDC48
YAP6
STD1
NTC20
MSI1
NEM1CDC73
TAF90PIF1
NUP100
YAP1
YBR223C
YBR239CYPL133C
NUP192
NIP29
BBP1
DPB3ADA2
HAS1
KAR4
IME4
ALPHA2TOP1
POL4GZF3
CIN8
POP3
RSC6RSC8
UME6
PRP11
PRP21
PRP9
UBC9
RRP42
MTR3
NIP100
NUP84PDS1
NUP145
DPB11RRS1
SAS10
DEG1PRP43
NET1MPP10
RAP1
KSS1
PBP1
CDC13
STN1
HSP26YDR026C
SRP40
LYS14YDR449C
GTS1
DAM1
CHA4
SEN2
CTF13CBC2
RPO26
RAD55RAD51
TRM1
GBP2
NGG1
TID3
RVB2
NUP42
SPC19
SPC34
HMT1
FKH1
RRP45
NUP53
TFB1PTA1
JNM1MCM21
SNU23SNM1 SMB1
SPO12
SKP1SGT1
MSN5
LSM6LSM5
RPT3
NCB2
BUR6
REC107
PRP3
PHO85
PRP4
CCL1
NUP82
SLT2
SRB2
RPC40
SRB4
DOA4HTA1YFR038W
CRM1
IME1CBF5
TSP1MCM1
MED11
RIM11
MER1POP2
CIN5
CKB2
TAF47
MAF1
PRE10
PUP3
GLE2
DMC1
GCD14
MSH4
MSH5
UPF3
GAT1SCL1
YFL052W
NIC96
SFH1
NUP57
RNA15NUP2RRN9
RAD6
TOA1DUO1
RPB9
POP7CDC34 ATC1
SWM1FLO8
RPS22A
TAF17
NAB2GAR1
FZF1
CTK3
SPT2
DOT5SFL1
RDH54
HAP2HAP3
SNU71
PRP40
REC102
NUP49
NSP1
SYF2
ULP1
ZPR1HRT1
RIM101
SMD3
POP8
PRE8
TFC7
ARP1
CDC23
LEU3
STB5
CDC9
MEK1
RPT5
SIK1
FPR4
SNP1
HDA1
LSM1
HYS2
POL32
SRP1
LSM8
ISY1
RPA12
RPA14
TAF25
SPC97
RAD27
PHD1 NUP120
HMO1
DBR1
DAL80SAP1
NUP133
STU2
CDC45
DOT6
MSL5
SME1
RED1
SMD2
SMX3YSH1
RPC11
KAP95
RIF2 SIR2
PRI1
SWI3
CNA1
YOX1
RNH35RRP6
SLK19YOR206W
WTM1
YPR144C
NBP1
NDC1
OGG1
RCL1
MFT1PHO23
CUP2SNF6
RAD10
SPT4
TAF19
CAC2
RLF2CTK2
NUP116SPC72
SWH1CAD1
YAP3SPS18
YNG1
PUP1
SAS5
TYE7
SEN54
MEX67
STO1
ASM4RGM1
CEF1
KRI1
CUS1
CAT8
TFC5
IMP4
TOP2
TFG2
FIP1
LSM7
DAL82
NUF2
RFC4
RFC3
TRF4 NVJ1
HRP1
SWI6
BUB1
WHI2
PRE9
RGT1
CKA2
RPT4
RIS1
CDC31
KAR1
YKR017C
HSH49
RTF1
NOP4
KAP120
PRP46
RAD53
HRR25
TFB3
NOT5
SUA7
RNR3
ESS1
GCD10
SPT20CDC21
PRE2
FHL1
YTH1
RPN7 RPN3
DBF20
Pajek
Correlation profile of the protein interaction network
R(k0,k1)=N(k0,k1)/Nr(k0,k1) Z(k0,k1) =(N(k0,k1)-Nr(k0,k1))/∆Nr(k0,k1)
Similar profile is seen in the yeast regulatory network
Pajek
Correlation profile of the yeast regulatory network
R(kout, kin)=N(kout, kin)/Nr(kout,kin) Z(kout,kin)=(N(kout,kin)-Nr(kout,kin))/ ∆Nr(kout,kin)
Some scale-free networks may appear similar
In both networks the degree distribution is scale-free P(k)~ k-γ with γ~2.2-2.5
But: correlation profiles give them unique identities
InternetProtein interactions
How to construct a proper random network?
Randomization of a network
given complex network random
Basic rewiring algorithm
Randomly select and rewire two edges Repeat many times
• R. Kannan, P. Tetali, and S. Vempala, Random Structures and Algorithms (1999)
• SM, K. Sneppen, Science (2002)
Metropolis rewiring algorithm
Randomly select two edgesCalculate change ∆E in “energy function” E=(Nactual-Ndesired)2/Ndesired
Rewire with probability p=exp(-∆E/T)
“energy” E “energy” E+∆E
SM, K. Sneppen:cond-mat preprint (2002),Physica A (2004)
How do protein networks evolve?
Gene duplication
Pair of duplicated proteins
2 Shared interactionsOverlap=2/5=0.4
Pair of duplicated proteins
5 Shared interactionsOverlap=5/5=1.0
Right after duplication After some time
Yeast regulatory network
SM, K. Sneppen, K.Eriksen, K-K. Yan, BMC Evolutionary Biology (2003)
100 million years ago
Protein interaction networksSM, K. Sneppen, K.Eriksen, K-K. Yan, BMC Evolutionary Biology (2003)
Why so many genes could be deleted without any
consequences?
Possible sources of robustness to gene deletionsBackup via the network (e.g. metabolic network could have several pathways for the production of the necessary metabolite)Not all genes are needed under a given condition (say rich growth medium)Affects fitness but not enough to killProtection by closely related homologs in the genome
Protective effect of duplicates
Gu, et al 2003Maslov, Sneppen, Eriksen, Yan 2003
Yeast Worm
Maslov, Sneppen, Eriksen, Yan 2003
Z. Gu, … W.-H. Li, Nature (2003)
Maslov, Sneppen, Eriksen, Yan BMC Evol. Biol.(2003)
SummaryThere are many kinds of protein networksNetworks in more complex organisms are more interconnectedMost have hubs – highly connected proteinsHubs often avoid each other (networks are anti-hierarchical)There are many self-interacting proteins. Probability to self-interact linearly scales with the degree K.Networks evolve by gene duplicationsRobustness is often (but not always) provided by gene duplicates
Part 2
Modules in networks and how to detect them using the Random walks/diffusion
K. Eriksen, I. Simonsen, SM, K. Sneppen, PRL (2003)
What is a module?Nodes in a given module (or community group or functional unit) tend to connect with other nodes in the same module
Biology: proteins of the same function or sub-cellular localizationWWW – websites on a common topicInternet – geography or organization (e.g. military)
Do you see any modules here?
Random walkers on a networkStudy the behavior of many VIRTUAL random walkers on a networkAt each time step each random walker steps on a randomly selected neighborThey equilibrate to a steady state ni ~ ki (solid state physics: ni = const)Slow modes allow to detect modules and extreme edges
Matrix formalism
ˆ( 1) ( )
1/ if j iˆ0 otherwise
i ij jj
jij
n t T n t
KT
+ =
↔⎧= ⎨⎩
∑
Eigenvectors of the transfer matrix Tij
( )
( ) ( ) ( )
( ) ( )
( )
ˆ
( )
1 1
i ij jj
t
i i
v T v
n t v
α α α
α α
α
λ
λ
λ
=
=
− ≤ ≤
∑
US Military
Russia
2 0.9626 RU RU RU RU CA RU RU ?? ?? US US US US ??
(US Department of Defence)
3 0.9561 ?? FR FR FR ?? FR ??
RU RU RU ?? ?? RU ??
4 0.9523 US ?? US ?? ?? ?? ?? (US Navy)
NZ NZ NZ NZ NZ NZ NZ
5. 0.9474 KR KR KR KR KR ?? KR
UA UA UA UA UA UA UA
Hacked site
How Modules in the WWW influence Google ranking
H. Xie, K.-K. Yan, SM, cond-mat/0409087
How Google works?Too many hits to a typical query: need to rank One could rank the importance of webpages by Kin, but:
Too democratic: It doesn’t take into account importance of webpages sending links it’s easy to trick and artificially boost the rank
Google’s solution: simulate the behavior of many “random surfers” and then count the number of hits for every webpage
Popular pages send more surfers your way the Google weight is proportional to Kin weighted by popularity Surfer get bored following links use α=0.15 for random jumps without any linksLast rule also solves the problem that some pages have no out-links
Mathematics of the Google
Google solves a self-consistent Eq.:
Gi ~ Σj Tij Gj - find principal eigenvector Tij= Aij/Kout (j)Model with random jumps:
Gi ~ (1-α)Σj Tij Gj + α Σj Gj
How do WWW communities (modules) influence the Gi?
Naïve argument: communities tend to “trap” random surfers they should increase the Google ranking of nodes in the community
log 1
0(G
c/G
w)
Ecc
Test of a naïve argument
Naïve argument is completely wrong!
EccEww
Gc – average Google rank of pages in the community; Gw – in the outside world
Ecw Gc/<Kout>c – current from C to Wmust be equal to: Ewc Gw/<Kout>w – current from W to C
Gc /Gw depends on the ratio between Ecwand Ewc – the number of edges (hyperlinks) between the community and the world
Networks with artificial communities
To test let’s generate a scale-free network with an artificial community of Nc pre-selected nodes Use metropolis algorithm with H=-(# of intra-community nodes) and some inverse temperature βDetailed balance:
Collaborators:
Kim Sneppen – Nordita + NBIKasper Eriksen –Lund U.Ingve Simonsen – U. of Trondheim
Huafen Xie – City University of NYKoon-Kiu Yan - Stony Brook U.
Summary
Diffusion process and modules (communities) in a network influence each otherIn the “hardware” part of the Internet – the network of routers (Autonomous systems) --diffusion allows one to detect modules and extreme edgesIn the “software” part – WWW communities affect Google ranking in a non-trivial way
THE END