funcoup data integration and networks of functional coupling in eukaryotes andrey alexeyenko
Post on 19-Dec-2015
216 views
TRANSCRIPT
FunCoupdata integration and networks
of functional coupling in eukaryotes
Andrey Alexeyenko
FunCoup is a data integration framework to discover
functional coupling in eukaryotic proteomes with
data from model organisms
AHuman
BHuman
?F
ind
ort
hol
og
s*
Mouse
Worm
Fly
Yeast
Hig
h-th
roug
hput
ev
iden
ce
* Remm M, Storm CE, Sonnhammer ELL (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314:1041-1052.
FunCoup is a naïve Bayesian network (NBN)
Bayesian inference:
Genes A and B are functionally coupled
Genes A and B co-expressed
P(C|E) = (P(C) * P(E|C)) / P(E)
A<->B
Problem: Solution:
Naïve Bayesian network.Calculate a belief change instead
(likelihood ratios, LR)
Absolute probabilities of FC are intractable. The full Bayesian network is impossible
A<->B
P(B|C), P(C|B)
P(B|A), P(A|B)
P(B|D), P(D|B)
P(A|C), P(C|A)
P(D|C), P(C|D)
P(A|D), P(D|A)
P(E|+) / P(E|-)
A<->B
P(E|+) / P(E|-)
P(E|+) / P(E|-)
P(E|+) / P(E|-)
gene evolutionfunctional link
Problem: Solution:
Via groups of orthologs that emerged via the speciation
How to establish optimal bridges between species?
Problem: Solution:Treat ALL inparalogs equally, and
choose the BEST valueIn situatons with multiple inparalogs, how to deal with alternative evidence?
Problem: Solution:Render data uncorrelated with principal components analysis
(PCA)
Collected features are often telling the same: badly compatible with NBN
X: Feature A
Y:
Fea
ture
B
PC1
= α 11
X+ α 21
Y
PC2 = α21 X+ α
22 Y
: a pair of proteins
X: Feature A
Y:
Fea
ture
B Y
X
Problem: Solution:
Render features discrete
A feature distribution shape may be unpredictable: hard to learn the “feature -> evidence” mapping
Problem: Solution:Find them individually for each data set and FC class, accounting for the joint “feature – class” distribution
Distribution areas informative of FC may vary
0-1 1Pearson r
+ + + + + + + +++ +++ +++ ++ + ++
- - - ----- -- ------ - - -- - - -
Problem: Solution:Positive setRandom set________Replace negative
sets with randomly picked ones:
Impossible to guarantee absence of FC in negative training sets
Negative set Positive set
not coupled proteins coupled proteins
Problem: Solution:
Enforce confidence check and remove insignificant nodes
Some LR are weak and arise due to non-representative sampling
P(E|+) / P(E|-)
A<->B
P(E|+) / P(E|-)
P(E|+) / P(E|-)
P(E|+) / P(E|-) test
Problem: Solution:Multinet
Decide which types of FC are needed (provide as positive training sets) and
perform the previous steps customized
Definitions and notions of FC vary
A<>B
P(E|+) / P(E|-)
A| B
P(E|+) / P(E|-)
P(E|+) / P(E|-)
P(E|+) / P(E|-)
P(E|+) / P(E|-)
P(E|+) / P(E|-)
A<>B
A||B
A|B
FunCoup’s web interfaceNew!
Hooper S., Bork P. Medusa: a simple tool for interaction graph analysis. Bioinformatics. 2005 Dec 15;21(24):4432-3. Epub 2005 Sep 27.
http
://w
ww
.sbc
.su.
se/~
anda
le/f
unco
up.h
tml
Proteins of the Parkinson’s disease pathway (KEGG #05020)
Physical protein-protein interaction
“Signaling” link
Metabolic “non-signaling” link
Multinet presents several link types in parallel
Multilateral data transfer
Human
Ciona
Worm
Mouse Rat
Fly
Yeast
Arabidopsis
PCA
NBN
Data from the same species is an important but not indispensable component of the framework. Hence, a network can be constructed for an organism with no experimental datasets at all.
FunCoup builds a network for an uncaracterized organism
(C. intestinalis)
• Build multi-species clusters of ortologs (e.g. human + C.intestinalis + D.melanogaster + C.elegans) [*]
• Extend known metabolic pathway assignments to the novel organism (e.g. Ciona)
• Collect well-studied organisms’ data• Using this data, train FunCoup on the set created
in (2)• Test each pair of proteins in the novel organism
for being coupled
*Alexeyenko A, Tamas I, Liu G, Sonnhammer ELL Automatic clustering of orthologs and inparalogs shared by multiple proteomes. Bioinformatics. 2006 15;22(14):e9-e15.
Reconctructing the “regulatory blueprint”* in C. intestinalis
*Im
ai K
S, L
evin
e M
, Sat
oh N
, Sat
ou Y
(20
06)
Reg
ulat
ory
blue
prin
t for
a c
hord
ate
embr
yo. S
cien
ce, 2
6:11
83-7
.
Proteins of the “Regulatory Blueprint for a Chordate Embryo” [*]
18 links mentioned in [*] AND found by FunCoup
Links found by FunCoup (about 140)
The rest, 202 links from [*] that FunCoup did not find, not shown
Set of outgoing links of the “regulatory blueprint”
…and a tight cluster from it:
00562 Inositol phosphate metabolism00632 Benzoate degradation via CoA ligation00760 Nicotinate and nicotinamide metabolism04310 Wnt signaling pathway04330 Notch signaling pathway04350 TGF-beta signaling pathway04360 Axon guidance04510 Focal adhesion04512 ECM-receptor interaction04514 Cell adhesion molecules04520 Adherent junction04530 Tight junction04630 Jak-STAT signaling pathway04640 Hematopoietic cell lineage04670 Leukocyte transendothelial migration04810 Regulation of actin cytoskeleton
ADAM10 Myosin light chain 2 Cadherin EGF LAG seven-pass G-type receptor 2 Neurotrophic tyrosine kinase, receptor-related 3
Inferred KEGG pathways:
…and annotations of human orthologs:
The Ciona genes were not described, but may receive this annotation via orthology:
The limits of data integration
1 2 3 4 5
N o . o f spec ies
0.004
0.005
0.006
0.007
0.008
0.009
0.010
0.011
0.012
0.013
Are
a un
der
RO
C,
spec
ifici
ty >
96% P C A -p rocessed
R aw da ta
4 8 12 16 20 24 28 32 36 40 44
N o . o f features
0.004
0.005
0.006
0.007
0.008
0.009
0.010
0.011
0.012
0.013
Are
a un
der
RO
C,
spec
ifici
ty >
96%
P C A -p rocessed R aw da ta
Condfidence estimation
Sensitivity (from “gold standard” set of FC):
Sens = TP / (TP + FN) Specificity (from a set of “No / not known FC”)
Spec = TN / (TN + FP)Positive Predictive Value (from everything predicted by
FunCoup):
PPV = TP / (TP + FP)
PPV answers the question:
“How much should we trust the FunDoup predictions”
2 4 6 8 10 12
F ina l B ayes ian sco re
0 .0
0 .1
0 .2
0 .4
0 .6
0 .8
1 .0
PP
V e
stim
ate
s ignaling m etabo lic P P I
320 ,000 / 12 ,750 9 4,0 00 / 11 ,100 37 ,000 / 8 ,500 17 ,000 / 5 ,450
N o . o f links / ind iv idual pro te ins in the hum an netwo rk
0 .1
0 .2
0 .3
0 .4
0 .5
0 .6
0 .7
0 .8
Con
fiden
ce s
core
M em bers o f sam e s ignaling pathway
P hysica l pro te in-pro te in interactio n
Correction of confidence by amount of evidence
1. Record the amount of information (AOE ~ non-empty values) that describes each pair of proteins A<->B
2. Correct each final Bayesian score:
FBS’(A<->B) = FBS(A<->B) + beta * (M(AOE) – AOE(A<-
>B));
beta is the linear regression coefficient of:FBS = alpha + beta * AOE
Confidence saturated at FBS = 12.5
2 4 6 8 10 12
FunC oup b ina l B ayes ian sco re
0%
20%
40%
60%
80%
100%P
PV
est
imat
e (~
con
fiden
ce)
N o co rrectio n -> O prior = 0 .0008 C o rrectio n = 0.007 -> O prior = 0 .0008 C o rrectio n = 0.015 -> O prior = 0 .0007
How the yeast complex entities are conserved?
Log overlap between KEGG and Gavin et al., 2006
1 2 3 4 5 6 7
yeast
worm
fly
mouse
human
thaliana
Lo
g o
verl
ap
KE
GG
vs.
"G
avi
n e
t al.,
20
06
"
Core-Core Core-Modu Core-Attr Modu-Modu Modu-Attr DiffModules Attr-Attr
Conclusions
http://FunCoup.sbc.su.se
• After the optimization, the naïve Bayesian network is well suited
for collection/evaluation of sparse, diverse, and noisy features,
and is, in itself, efficient to discover novel cases of FC
• Orthologs are optimal to transfer information across species
• The multiple class training enabled specific prediction of different
types of functional coupling
• Across-species information flow is not symmetrical but reversible
– hence the networks of uncharacterized proteomes In FunCoup
• In the Bayesian output, no missing values exist – thus a
multivariate classification technique may be applied as a post-
processor
Acknowledgements:
• Erik Sonnhammer• Tomas Ohlson• Mats Lindskog• Kristoffer Forslund• Gabriel Östlund• Kevin O’Brien• Carsten Daub
ValidationJack-knife procedure:
Take “positive” and “negative” sets Split each randomly as 50:50 Use the first parts to train the algorithm, the second to test the
performance Repeat a number of times
Analysis Of VAriance:
Introduce features A, B, C in the workflow of FunCoup (e.g., using PCA, selecting nodes of BN by relevance, ways of using ortholog data etc.)
Run FunCoup with all possible combinations of absence/presence of A, B, C to produce a balanced and orthogonal ANOVA design with replicates
Study effects of A,B,C or their combinations AxB, BxC,.. AxBxC to see if they influence the performance significantly (whereas all other effects did not exist)
Estimating quality of prediction
Sensitivity: TP / (TP + FN)1 - Specificity: FP / (FP + TN) Individual points represent varying cut-offs