the identification of scale-free gene-protein networks

66
1 The Identification of Scale-Free Gene-Protein Networks Ronald Westra Department of Mathematics Maastricht University

Upload: eagan-flowers

Post on 03-Jan-2016

24 views

Category:

Documents


4 download

DESCRIPTION

The Identification of Scale-Free Gene-Protein Networks. Ronald Westra Department of Mathematics Maastricht University. Items in this Presentation. 1. Biological background and problem formulation 2. Modeling of dynamic gene/proteins interactions 3. Scale-free network structures - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Identification of  Scale-Free  Gene-Protein Networks

1

The Identification of Scale-Free

Gene-Protein Networks

Ronald Westra

Department of Mathematics

Maastricht University

Page 2: The Identification of  Scale-Free  Gene-Protein Networks

2

1. Biological background and problem formulation

2. Modeling of dynamic gene/proteins interactions

3. Scale-free network structures

4. Reconstruction of scale-free gene/proteins networks

5. Conclusions

Items in this Presentation

Page 3: The Identification of  Scale-Free  Gene-Protein Networks

3

1. Biological background

Do gene-protein networks exhibit characteristic architectural and structural properties that may act as a format for reconstruction?

Some observations ...

Page 4: The Identification of  Scale-Free  Gene-Protein Networks

4

Mycoplasma genitalium500 nm580 Kbp477 genes74% coding DNAObligatory parasitic endosymbiont

Mycoplasma genitalium Metabolic Network

Metabolic NetworkNodes are genes, edges are gene co-expressions

Degree distributionHorizontally log of degree (= number of connections), vertically log of number of genes with this degree

Page 5: The Identification of  Scale-Free  Gene-Protein Networks

5

Protein complex networkand connected complexes in yeast S. cerevisiae, Gavin et al., Nature 2002.

Cumulative degree distributions of Saccharomyces cerevisiae, Jeong et al, Nature 2001

Yeast

Page 6: The Identification of  Scale-Free  Gene-Protein Networks

6

Functional modules of the kinome network [Hee, Hak, 2004]

Page 7: The Identification of  Scale-Free  Gene-Protein Networks

7

Degree distributions in human gene coexpression network. Coexpressed genes are linked for different values of the correlation r, King et al, Molecular Biology and Evolution, 2004

Page 8: The Identification of  Scale-Free  Gene-Protein Networks

8

Statistical properties of the human gene coexpression network.

(a)Node degree distribution.

(b)Clustering coefficient plotted against the node degree

King et al, Molecular Biology and Evolution, 2004

Page 9: The Identification of  Scale-Free  Gene-Protein Networks

9

Objective:

* Are there distinctive architectural properties in gene-protein networks that facilitate their reconstruction from experimental data? (it helps if you know how it looks like)

Example: sparsity (Yeung et al. 2003, etc)

* Are there other special network properties that work

similarly? Or even better?

Problem formulation

Page 10: The Identification of  Scale-Free  Gene-Protein Networks

10

2. Modeling Interactions between Genes and Proteins

Prerequisite for the successful reconstruction of gene-protein networks is the way in which the dynamics of their interactions is modeled.

Page 11: The Identification of  Scale-Free  Gene-Protein Networks

11

Components in Gene-Protein networks

Genes: ON/OFF-switches (→ continuous)

RNA&Proteins: vectors of information exchange between genes

External inputs: interact with higher-order proteins

Page 12: The Identification of  Scale-Free  Gene-Protein Networks

12

General state space dynamics

The evolution of the n-dimensional state space vector x (gene expressions/protein densities) depend on p-dim inputs u, system parameters θ and Gaussian white noise ξ.

ispnii

i uuxxxfdt

tdxx ),,,,,,,,,(

)(1121

Page 13: The Identification of  Scale-Free  Gene-Protein Networks

13

external inputs

input-coupling

genes/proteins

interaction-coupling

Example of an general dynamics network topology

Page 14: The Identification of  Scale-Free  Gene-Protein Networks

14

The general case is too complex

Strongly dependent on unknown microscopic details

Relevant parameters are unidentified and thus unknown

Therefore approximate interaction potentials and qualitative methods seem appropriate

Here some (of the many, many) practical approaches …

Problems with modeling the general network dynamics

Page 15: The Identification of  Scale-Free  Gene-Protein Networks

15

1. Linear stochastic state-space models

Following P. D'Haeseleer, M. B. Eisen, S. Yeung, P. T. Spellman, and many others

x : the vector (x1, x2,..., xn) where xi is the

relative gene expression of gene ‘í’u : the vector (u1, u2,..., up) where ui is the

value of external input ‘í’ (e.g. a toxic agent)νξ(t) : white Gaussian noise

)(tvBA ξuxx

Page 16: The Identification of  Scale-Free  Gene-Protein Networks

16

2. Piecewise Linear Models

Following Mestl, Plahte, Omhold 1995 and others

bil sum of step-functions s+,–

Page 17: The Identification of  Scale-Free  Gene-Protein Networks

17

3. More complex non-linear interaction models

Example: rational functions = quotient of polynomials:

),(

),(

ux

uxx

m

n

P

P

dt

d

Example: Michaelis-Menten

Page 18: The Identification of  Scale-Free  Gene-Protein Networks

18

Objectives in reconstruction of (linear) networks

Mathematical model M:

Experimental data D:

Objective: Find the model parameters A and B suchthat the model M matches the data D.

Page 19: The Identification of  Scale-Free  Gene-Protein Networks

19

Reconstruction of SPARSE LINEAR networks

In most cases the mathematical complexities in finding a realistic network structure are too severe

Therefore, some researchers have introduced new constraints that facilitate the computation

The best example is SPARSITY in a LINEAR network :

Page 20: The Identification of  Scale-Free  Gene-Protein Networks

20

Major Problem in reconstruction of sparse networks

The system is severely under-constrained as there are typically far more model parameters A and B than there is experimental data D.

A useful trick is to assume that the system is heavily sparse and linear [Yeung et al, Guthke et al, …]

In that case the system can be: (i) decomposed row-for-row, and (ii) L1-regression can be employed

Page 21: The Identification of  Scale-Free  Gene-Protein Networks

21

Decoupling: →

pDzzz

:tosubject,min1

Sparsity:

L1-regression: →

M:

D:

D z p

Page 22: The Identification of  Scale-Free  Gene-Protein Networks

22

Result: Above a minimum number Mmin of measurements and with a maximum number kC of non-zeros the reconstruction is perfect. Mmin is much smaller than in L2-regression, Mmin and kC depend on N.

Page 23: The Identification of  Scale-Free  Gene-Protein Networks

23

Critical number Mmin versus the problem size N,

Page 24: The Identification of  Scale-Free  Gene-Protein Networks

24

3. Using special architectures of gene-protein networks

So far we used the fact that biological information processing networks mostly exhibit only a few connections (=sparse) and only a few genes and proteins control a considerable amount of all others (=hierarchic)

Other interesting properties of networks are also observed : regular, small world, scale free, exponential, apollonian, …

Page 25: The Identification of  Scale-Free  Gene-Protein Networks

25

Network Architectures

There is more internal structure in a gene-protein network which we can use to derive more powerful constraints, and the most interseting is the Scale-Free (SF) property

Page 26: The Identification of  Scale-Free  Gene-Protein Networks

26

What is the Scale-free property?

In a scale-free network the degree distribution follows a power law.

The degree distribution is the fraction nSF(k) of nodes in the network having k connections to other nodes.

In SF networks this goes (for large values of k) as:

nSF(k) ~ k−γ

where γ is a constant whose value is typically in the range 1<γ<3, although occasionally it may lie outside these bounds.

Page 27: The Identification of  Scale-Free  Gene-Protein Networks

27

Special Network Architectures

Page 28: The Identification of  Scale-Free  Gene-Protein Networks

28

Special Network Architectures

Page 29: The Identification of  Scale-Free  Gene-Protein Networks

29

Why Scale-free?

Scale-free networks are noteworthy because many empirically observed networks appear to be scale-free, including the world wide web, protein networks, citation networks, and social networks.

Page 30: The Identification of  Scale-Free  Gene-Protein Networks

30Cumulative degree distributions for six different networks.

Page 31: The Identification of  Scale-Free  Gene-Protein Networks

31

Cumulative degree distributions in the interaction network of genes and proteins in the metabolism of Saccharomyces cerevisiae [Jeong et al, Nature 2001]

Page 32: The Identification of  Scale-Free  Gene-Protein Networks

32

Clustering of co-expression profiles using K-nearest neighbor algorithm

For each node (gene/protein) determine the K closest (= most similar) nodes

Two nodes are joined in the graph if they are in each others K-nearest neighbor set

Examine the resulting network graph – especially for SF-ness

Page 33: The Identification of  Scale-Free  Gene-Protein Networks

33

Clustering of co-expression profiles using K-nearest neighbor algorithm

Cumulative distribution F of degree distribition P:

Page 34: The Identification of  Scale-Free  Gene-Protein Networks

34

Colon cancer data of Alon et al. PNAS 1999, Breast cancer data of Perou et al. PNAS 1999

Page 35: The Identification of  Scale-Free  Gene-Protein Networks

35

Order parameter Λ:

Clustering of co-expression profiles using K-nearest neighbor algorithm

H. Agrawal, Physical Review letters, 2002

Page 36: The Identification of  Scale-Free  Gene-Protein Networks

36

Colon cancer data of Alon et al. PNAS 1999,

Breast cancer data of Perou et al. PNAS 1999

This closes the case for the biological relevance of scale-free networks …

Page 37: The Identification of  Scale-Free  Gene-Protein Networks

37

CENTRAL THOUGHT

Conjecture:

Scalefree-ness in a (gene regulatory) network implies sparsity.

SF is much stronger than sparsity … it also requires a specific distribution of connections in the network – and hence in the connectivity matrix, namely the SF powerlaw

Not only a large number of zeros are required, they are also grouped in a special manner.

Page 38: The Identification of  Scale-Free  Gene-Protein Networks

38

Relation between Scalefree and Sparse

Define: sparsity = number of connections/ n(n-1)/2

For n=10,000 :

gamma log(sparsity)

1 -1.5888 2 -6.9176 3 -9.5158 4 -10.7187 5 -11.6457

Page 39: The Identification of  Scale-Free  Gene-Protein Networks

39

Page 40: The Identification of  Scale-Free  Gene-Protein Networks

40

Reconstruction of scalefree networks

For these reasons, the reconstruction of networks using the SF-property should be much more effective than from sparse networks

Page 41: The Identification of  Scale-Free  Gene-Protein Networks

41

Requirements

For the reconstruction of a scalefree (gene-protein) interaction system we need:

1. a suitable parametrised formal model

2. a method for optimising the scalefreeness of the system with respect to the model parameters for a given set of measurements (e.g. microarrays)

We will visit these items in the following slides ...

Page 42: The Identification of  Scale-Free  Gene-Protein Networks

42

Philosophy:

The experimental data bounds the feasible parameter set A and B, and the scalefree-ness (SF) of A and B should be as high as possible consistent with the data D

4. Reconstruction of scalefree gene-protein networks

Page 43: The Identification of  Scale-Free  Gene-Protein Networks

43

For simplicity we assume a non-symmetric, and SF gene/protein network with a linear state space dynamics

Suppose we have a set of M observations of genome-wide expression profiles (e.g. microarrays)

Linear Model of gene-protein networks

Page 44: The Identification of  Scale-Free  Gene-Protein Networks

44

Linearized form of a subsystem

First order linear approximation of system separates state vector x and inputs u.

uxx

BAdt

d

Page 45: The Identification of  Scale-Free  Gene-Protein Networks

45

Experimental Data:

Now, suppose that we have M data items (e.g. microarray measurements) we want to map to the network:

Page 46: The Identification of  Scale-Free  Gene-Protein Networks

46

The relation between the desired patterns (state derivatives, states and inputs) defines constraints on the data matrices A and B, which have to be computed.

Data Match

Page 47: The Identification of  Scale-Free  Gene-Protein Networks

47

][][]1[ kBUkAXkX

If you don’t like a continuous model just use a discrete model:

Data Match

Page 48: The Identification of  Scale-Free  Gene-Protein Networks

48

Now compute the observed degree-distribution in the system matrix M :

DegDist(k,M) : the number of nodes with degree k

As we are now dealing with a directed graph, there is a difference between in-coming and out-going connections. We will here consider only the out-degree.

Note that hierarchy of the net relates to the in-degree.

Scalefree-ness

Page 49: The Identification of  Scale-Free  Gene-Protein Networks

49

The out-degree distribution Degree(k,C) of a connectivity matrix C is the sum of the k-th column:

Degree(k,C) = Σm cmk = 1T.C

Scalefree-ness

1232

0110

0111

0010

1001

.1111

0110

0111

0010

1001

.

Example:

C = Degree(k,C) =

Page 50: The Identification of  Scale-Free  Gene-Protein Networks

50

The degree is the basis for computing the degree distribution DegDist(k,C) of a connectivity matrix C. How can we determine the connectivity matrix for an arbitrary interaction matrix M like the matrices A and B in our linear model?

Answer: we approximate the connectivity matrix of M to an accuracy ε as Cε(M), similar to the approximation δε(x) of the δ-function δ(x) in measure theory …

Scalefree-ness

Page 51: The Identification of  Scale-Free  Gene-Protein Networks

51

Approximation to accuracy ε of the Kronecker delta function δ(x – 0.5) (=1 if x=0.5 and 0 elsewhere) for various values of ε …

Page 52: The Identification of  Scale-Free  Gene-Protein Networks

52

The out-degree Degree(k,M) is approximated from the column sum of Cε(M), an approximation to accuracy ε of the connectivity matrix of M:

Degree(k,M) ~ Σj Cε(mij)

Scalefree-ness

Page 53: The Identification of  Scale-Free  Gene-Protein Networks

53

Next, compare the observed degree-distribution in matrix M : DegDistε(k,M) with the degree-distribution of a ‘perfect’ scalefree (SF) network: PN(k,γ) ~ k-γ

Scalefree-ness

k

N kPMkDegDistMSF 2) ),(),((½),(

Page 54: The Identification of  Scale-Free  Gene-Protein Networks

54

Computing the optimal A and B for a Scale Free network

Suppose the matrices A and B are Scale Free with fixed parameter γ. Let SF(A, γ) measure the fit between a perfect SF network (of the same size) and the network A.

Using continuous optimization techniques this problem can be defined as:

SF Reconstruction: STEP 1

BUAXXtosubjectBSFASF BABA

),(),(max,

Page 55: The Identification of  Scale-Free  Gene-Protein Networks

55

Now fix the matrices A and B and determine the optimal Scale Free parameters γA and γB.

Again using continuous optimization techniques this problem can be defined as:

SF Reconstruction: STEP 2

),(),(max,

BA BSFASFBA

Page 56: The Identification of  Scale-Free  Gene-Protein Networks

56

2. fix A and B and optimise over γA and γB:

SF Reconstruction: Tandem Approach

),(),(max,

BA BSFASFBA

1. fix parameter γ and optimise over A and B:

BUAXXtosubjectBSFASF BABA

),(),(max,

until a certain convergence criterion is met

Page 57: The Identification of  Scale-Free  Gene-Protein Networks

57

Some results of comparing Sparse and SF reconstruction

Page 58: The Identification of  Scale-Free  Gene-Protein Networks

58Number of reconstruction errors as a function of the number of nonzero entries k, with: M = 150 patterns, N = 50000 genes.

kCkSF

Reconstruction using only sparsity

Same data but now scalefree

Page 59: The Identification of  Scale-Free  Gene-Protein Networks

59

kC

Number of reconstruction errors versus M with fixed N = 50000, k = 10.

Same data but now scalefree

kSF

Reconstruction using only sparsity

Page 60: The Identification of  Scale-Free  Gene-Protein Networks

60Critical number of patterns Mcrit versus the problem size N,

sparse

SF

Page 61: The Identification of  Scale-Free  Gene-Protein Networks

61

The system can not be decoupled as in sparse estimation (at least with L2-norm)

This means that the entire network has to be considered, resulting in long computation times.

If the underlying network is not scalefree this approach of course does not work

SF Reconstruction: Disadvantages

Page 62: The Identification of  Scale-Free  Gene-Protein Networks

62

5. Conclusions

Assuming sparsity in linear time-invariant state space models for gene-protein networks allows for the effective network reconstruction using a small amount of data.

The scale-free property in networks implies network sparsity, and moreover requires a specific degree distribution. This is therefore a much stronger constraint than sparsity, and it is biologically plausible.

A mathematical tandem approach is able to fit a SF network architecture to observed data. The attractive computational properties of sparse identification however seem to be lost.

Page 63: The Identification of  Scale-Free  Gene-Protein Networks

63

* Jeong, H., Mason, S., Barabasi, A.-L., and Oltvai, Z. N.,Lethality and centrality in protein networks, Nature 411, 41–42 (2001).

* Hee YK, Hak YK, Functional modules from protein networks of kinome and cell cycle in Saccharomyces cerevisiae, Proc. of IEEE Computational Systems Bioinformatics Conference (CSB), 2004.

* Jordan IK, Mariño-Ramírez L, Wolf YI, Koonin EV. Conservation and coevolution in the scale-free human gene coexpression network, Mol Biol Evol. 2004 Nov; 21(11):pp. 2058-70.

Some key references

Page 64: The Identification of  Scale-Free  Gene-Protein Networks

64

Other members of the Computational Lifesciences Team

• Jordi Heijman (PhD student) 1,3• Stef Zeemering (PhD students) 1• Ralf Peeters 1• Goele Hollanders (PhD student) 1,2• Geert Jan Bex 2• Marc Gyssens 2• Yoram Rudy 3, 1 (visiting professor)

1: University of Maastricht, Dep.Mathematics (Netherlands): 2: University of Hasselt, Dep. Computer Science (Belgium): 3: Rudy Lab at Washington University (St. Louis USA)

Page 65: The Identification of  Scale-Free  Gene-Protein Networks

65

Page 66: The Identification of  Scale-Free  Gene-Protein Networks

66

Discussion …