predicting protein function from heterogeneous data prof. william stafford noble genome 541 intro to...

92
Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Upload: gervais-turner

Post on 14-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Predicting protein function from

heterogeneous data

Prof. William Stafford NobleGENOME 541

Intro to Computational Molecular Biology

Page 2: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

We can frame functional annotation as a classification

task• Many possible types of labels:

– Biological process– Molecular function– Subcellular localization

• Many possible inputs:– Gene or protein sequence– Expression profile– Protein-protein interactions– Genetic associations

Classifier

Is gene X a penicillin amidase?

Yes

Page 3: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Outline

• Bayesian networks• Support vector machines• Network diffusion / message passing

Page 4: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Annotation transfer

• Rule: If two proteins are linked with high confidence, and one protein’s function is unknown, then transfer the annotation.

Protein of known

function

Protein of unknown function

Page 5: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Bayesian networks(Troyanskaya PNAS 2003)

Page 6: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Burglary Earthquake

Alarm

John callsMary calls

P(B) = 0.001P(E) = 0.002

P(M|A) = 0.70P(M|¬A) = 0.01

P(J|A) = 0.90P(J|¬A) = 0.05

P(A|B,E) = 0.95P(A|B, ¬E) = 0.94P(A|¬B,E) = 0.29P(A|¬B, ¬E) = 0.001

Page 7: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Create one network per gene pair

Probability that genes A

and B are functionally

linked

A

B Data type 1

Data type 2

Data type 3

Page 8: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Bayesian Network

Page 9: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Conditional probability tables• A pair of yeast proteins

that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%.

• Two proteins that do not physically interact in vivo will have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.

Page 10: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Inputs

• Protein-protein interaction data from GRID.

• Transcription factor binding sites data from SGD.

• Stress-response microarray data set.

Page 11: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

ROC analysis

Using Gene Ontology biological process annotation as the gold standard.

Page 12: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Pros and cons

+ Bayesian network framework is rigorous.+ Exploits expert knowledge.- Does not (yet) learn from data.- Treats each gene pair independently.

Page 13: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

The SVM is a hyperplane classifier

++

+

+ +

+

+

+ +

+ +

+-- -

-

-

-

-

--

-

--

-+

+

-

--

Locate a plane that separates

positive from negative

examples.

Focus on the examples closest to the boundary.

Page 14: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Four key concepts

1. Separating hyperplane

2. Maximum margin hyperplane

3. Soft margin

4. Kernel function (input space feature space)

Page 15: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Input space

gene1 gene2patient1 -1.7 2.1patient2 0.3 0.5patient3 -0.4 1.9patient4 -1.3 0.2patient5 0.9 -1.2

1

2

3

4

5

gene1

gene2

Page 16: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

• Each subject may be thought of as a point in an m-dimensional space.

Page 17: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Separating hyperplane

• Construct a hyperplane separating ALL from AML subjects.

Page 18: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Choosing a hyperplane

• For a given set of data, many possible separating hyperplanes exist.

Page 19: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Maximum margin hyperplane

• Choose the separating hyperplane that is farthest from any training example.

Page 20: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Support vectors

• The location of the hyperplane is specified via a weight associated with each training example.

• Examples near the hyperplane receive non-zero weights and are called support vectors.

Page 21: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Soft margin

• When no separating hyperplane exists, the SVM uses a soft margin hyperplane with minimal cost.

• A parameter C specifies the relative cost of a misclassifcation versus the size of the margin.

Page 22: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Incorrectly measured or labeled data

No separating hyperplane exists

The separating hyperplane does not

generalize well

Page 23: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Soft margin

Page 24: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

The kernel function

• “The introduction of SVMs was very good for the most part, but I got confused when you began to talk about kernels.”

• “I found the discussion of kernel functions to be slightly tough to follow.”

• “I understood most of the lecture. The part that was more challenging was the kernel functions.”

• “Still a little unclear on how the kernel is used in the SVM.”

Page 25: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Why kernels?

Page 26: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Separating previously unseparable data

Page 27: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Input space to feature space

• SVMs first map the data from the input space to a higher-dimensional feature space.

Page 28: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

• Consider two training examples A = (a1, a2) and B = (b1, b2).

• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)

• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.

Page 29: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

• Consider two training examples A = (a1, a2) and B = (b1, b2).

• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)

• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.• (A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Page 30: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)

Page 31: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

Page 32: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

Page 33: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

Page 34: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

= [(a1, a2) • (b1, b2)]2

Page 35: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function as dot product

(A) • (B)

= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2

= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2

= (a1b1 + a2b2) (a1b1 + a2b2)

= [(a1, a2) • (b1, b2)]2

= (A • B)2

= K(A, B)

Page 36: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Separating in 2D with a 4D kernel

Page 37: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

“Kernelizing” Euclidean distance

Page 38: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function

• The kernel function plays the role of the dot product operation in the feature space.

• The mapping from input to feature space is implicit.

• Using a kernel function avoids representing the feature space vectors explicitly.

• Any continuous, positive semi-definite function can act as a kernel function.

Need for “positive semidefinite” for kernel function unclear.

Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Page 39: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

YXYXK ,

Page 40: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

31, YXYXK

Page 41: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

2

2

2exp,

YX

YXK

Page 42: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Overfitting with a Gaussian kernel

Page 43: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

The SVM learning problem

• Input: training vectors xi … xn and labels yi … yn.• Output: bias b plus one weight wi per training example• The weights specify the location of the separating

hyperplane.• The optimization problem is a convex, quadratic

optimization.• It can be solved using standard packages such as

MATLAB.

xx

xx

yfyfc

yfcCwn

iii

bwxf T

1,0max,,

,,2

1

1

2minarg

Page 44: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

SVM prediction architectureQuery = x

x1 x2 x3 xn...

k k k k

w1

w2 w3 wn

Page 45: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel function

• The kernel function plays the role of the dot product operation in the feature space.

• The mapping from input to feature space is implicit.

• Using a kernel function avoids representing the feature space vectors explicitly.

• Any continuous, positive semi-definite function can act as a kernel function.

Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.

Page 46: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 47: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Learning gene classes

Predictor

Learner Model

Class

MYGD

79 experiments

79 experiments

3500Genes

2465Genes

Training set

Test set

Eisen et al.

Eisen et al.

Page 48: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Class predictionFP FN TP TN

TCA 4 9 8 2446

Respiration chain complexes

6 8 22 2431

Ribosome 7 3 118 2339

Proteasome 3 8 27 2429

Histone 0 2 9 2456

Helix-turn-helix 0 16 0 2451

Page 49: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

SVM outperforms other methods

Page 50: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Predictions of gene function

Fleischer et al. “Systematic identification and functional screens

of uncharacterized proteins associated with eukaryotic

ribosomal complexes” Genes Dev, 2006.

Page 51: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 52: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Overview

• 218 human tumor samples spanning 14 common tumor types

• 90 normal samples• 16,063 “genes” measured per sample• Overall SVM classification accuracy: 78%.• Random classification accuracy: 1/14 =

9%.

Page 53: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 54: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 55: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Summary: Support vector machine learning

• The SVM learning algorithm finds a linear decision boundary.

• The hyperplane maximizes the margin; i.e., the distance from any training example.

• The optimization is convex; the solution is sparse.

• A soft margin allows for noise in the training set.• A complex decision surface can be learned by

using a non-linear kernel function.

Page 56: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Cost/Benefits of SVMs

+ SVMs perform well in high-dimensional data sets with few examples.

+ Convex optimization implies that you get the same answer every time.

+ Kernels functions allow encoding of prior knowledge.+ Kernel functions handle arbitrary data types.– The hyperplane does not provide a good explanation,

especially with a non-linear kernel function.

Page 57: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 58: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Vector representation

• Each matrix entry is an mRNA expression measurement.

• Each column is an experiment.

• Each row corresponds to a gene.

Page 59: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Similarity measurement

• Normalized scalar product

• Similar vectors receive high values, and vice versa.

iiii

ii

YYXX

YXYXK

,

Similar

Dissimilar

Page 60: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Kernel matrix

Page 61: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH

>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI

Sequence kernels

• We cannot compute a scalar product on a pair of variable-length, discrete strings.

Page 62: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Pairwise comparison kernel

Page 63: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Pairwise comparison kernel

Page 64: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Protein-protein interactions

• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0

11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

protein

protein

Page 65: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Linear interaction kernel

• The simplest kernel counts the number of interactions between each pair.

1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0

3

Page 66: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Diffusion kernel

• A general method for establishing similarities between nodes of a graph.

• Based upon a random walk.

• Efficiently accounts for all paths connecting two nodes, weighted by path lengths.

Page 67: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Hydrophobicity profile

• Transmembrane regions are typically hydrophobic, and vice versa.

• The hydrophobicity profile of a membrane protein is evolutionarily conserved.

Membrane protein

Non-membrane protein

Page 68: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Hydrophobicity kernel

• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.

• Prefilter the profiles.• Compare two profiles by

– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.

• This kernel detects periodicities in the hydrophobicity profile.

Page 69: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Combining kernels

Identical

A B

K(A) K(B)

K(A)+K(B)

A B

A:B

K(A:B)

Page 70: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Semidefinite programming

• Define a convex cost function to assess the quality of a kernel matrix.

• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.

Page 71: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it :

According to a convex quality measure:

Integrate constructed kernels

Learn a linear mix

Large margin classifier (SVM)

Maximize the margin

SDP

Semidefinite programming

i

iiKK

Page 72: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

i

iiKK

Integrate constructed kernels

Learn a linear mix

Large margin classifier (SVM)

Maximize the margin

Page 73: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Markov Random Field

• General Bayesian method, applied by Deng et al. to yeast functional classification.

• Used five different types of data.• For their model, the input data must be

binary.• Reported improved accuracy compared to

using any single data type.

Page 74: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Yeast functional classesCategory SizeMetabolism 1048

Energy 242

Cell cycle & DNA processing 600

Transcription 753

Protein synthesis 335

Protein fate 578

Cellular transport 479

Cell rescue, defense 264

Interaction w/ evironment 193

Cell fate 411

Cellular organization 192

Transport facilitation 306

Other classes 81

Page 75: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Six types of data

• Presence of Pfam domains.• Genetic interactions from CYGD.• Physical interactions from CYGD.• Protein-protein interaction by TAP.• mRNA expression profiles.• (Smith-Waterman scores).

Page 76: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Results

MRF

SDP/SVM(binary)

SDP/SVM(enriched)

Page 77: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Pros and cons

+ Learns relevance of data sets with respect to the problem at hand.

+ Accounts for redundancy among data sets, as well as noise and relevance.

+ Discriminative approach yields good performance.

- Kernel-by-kernel weighting is simplistic.- In most cases, unweighted kernel combination

works fine.- Does not provide a good explanation.

Page 78: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Network diffusionGeneMANIA

Page 79: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

A rose by any other name …

• Network diffusion• Random walk with restart• Personalized PageRank• Diffusion kernel• Gaussian random field• GeneMANIA

Page 80: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 81: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology
Page 82: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Top performing methods

Page 83: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

GeneMANIA

• Normalize each network (divide each element by the square root of the product of the sums of the rows and columns).

• Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand.

• Sum the weighted networks.• Assign labels to the nodes. Use (n+ + n-)/n for

unlabeled genes.• Perform label propagation in the combined network.

Mostafavi et al. Genome Biology. 9:S4, 2008.

Page 84: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Random walk with restart

Positive examples

Page 85: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Random walk with restart

Page 86: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Random walk with restart

Page 87: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Restart

Random walk with restart

Page 88: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Random walk with restart

Page 89: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Random walk with restart

Page 90: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Size indicates frequency of visit

Final node scores

Page 91: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Size indicates frequency of visit

Final node scores

Label propagation is random walk with restart except:

(a) You restart less often from nodes with many neighbours (i.e., Restart probability of a node is inversely related to its degree)

(b) Nodes with many neighbors have their final node scores scaled up

Page 92: Predicting protein function from heterogeneous data Prof. William Stafford Noble GENOME 541 Intro to Computational Molecular Biology

Label propagation vs SVMLabel propagation

SVM

Performance averaged across 992 yeast Gene Ontology Biological Process categories.