predicting protein function from heterogeneous data prof. william stafford noble genome 541 intro to...
TRANSCRIPT
Predicting protein function from
heterogeneous data
Prof. William Stafford NobleGENOME 541
Intro to Computational Molecular Biology
We can frame functional annotation as a classification
task• Many possible types of labels:
– Biological process– Molecular function– Subcellular localization
• Many possible inputs:– Gene or protein sequence– Expression profile– Protein-protein interactions– Genetic associations
Classifier
Is gene X a penicillin amidase?
Yes
Outline
• Bayesian networks• Support vector machines• Network diffusion / message passing
Annotation transfer
• Rule: If two proteins are linked with high confidence, and one protein’s function is unknown, then transfer the annotation.
Protein of known
function
Protein of unknown function
Bayesian networks(Troyanskaya PNAS 2003)
Burglary Earthquake
Alarm
John callsMary calls
P(B) = 0.001P(E) = 0.002
P(M|A) = 0.70P(M|¬A) = 0.01
P(J|A) = 0.90P(J|¬A) = 0.05
P(A|B,E) = 0.95P(A|B, ¬E) = 0.94P(A|¬B,E) = 0.29P(A|¬B, ¬E) = 0.001
Create one network per gene pair
Probability that genes A
and B are functionally
linked
A
B Data type 1
Data type 2
Data type 3
Bayesian Network
Conditional probability tables• A pair of yeast proteins
that have a physical association will have a positive affinity precipitation result 75% of the time and a negative result in the remaining 25%.
• Two proteins that do not physically interact in vivo will have a positive affinity precipitation result in 5% of the experiments, and a negative one in 95%.
Inputs
• Protein-protein interaction data from GRID.
• Transcription factor binding sites data from SGD.
• Stress-response microarray data set.
ROC analysis
Using Gene Ontology biological process annotation as the gold standard.
Pros and cons
+ Bayesian network framework is rigorous.+ Exploits expert knowledge.- Does not (yet) learn from data.- Treats each gene pair independently.
The SVM is a hyperplane classifier
++
+
+ +
+
+
+ +
+ +
+-- -
-
-
-
-
--
-
--
-+
+
-
--
Locate a plane that separates
positive from negative
examples.
Focus on the examples closest to the boundary.
Four key concepts
1. Separating hyperplane
2. Maximum margin hyperplane
3. Soft margin
4. Kernel function (input space feature space)
Input space
gene1 gene2patient1 -1.7 2.1patient2 0.3 0.5patient3 -0.4 1.9patient4 -1.3 0.2patient5 0.9 -1.2
1
2
3
4
5
gene1
gene2
• Each subject may be thought of as a point in an m-dimensional space.
Separating hyperplane
• Construct a hyperplane separating ALL from AML subjects.
Choosing a hyperplane
• For a given set of data, many possible separating hyperplanes exist.
Maximum margin hyperplane
• Choose the separating hyperplane that is farthest from any training example.
Support vectors
• The location of the hyperplane is specified via a weight associated with each training example.
• Examples near the hyperplane receive non-zero weights and are called support vectors.
Soft margin
• When no separating hyperplane exists, the SVM uses a soft margin hyperplane with minimal cost.
• A parameter C specifies the relative cost of a misclassifcation versus the size of the margin.
Incorrectly measured or labeled data
No separating hyperplane exists
The separating hyperplane does not
generalize well
Soft margin
The kernel function
• “The introduction of SVMs was very good for the most part, but I got confused when you began to talk about kernels.”
• “I found the discussion of kernel functions to be slightly tough to follow.”
• “I understood most of the lecture. The part that was more challenging was the kernel functions.”
• “Still a little unclear on how the kernel is used in the SVM.”
Why kernels?
Separating previously unseparable data
Input space to feature space
• SVMs first map the data from the input space to a higher-dimensional feature space.
Kernel function as dot product
• Consider two training examples A = (a1, a2) and B = (b1, b2).
• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)
• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.
Kernel function as dot product
• Consider two training examples A = (a1, a2) and B = (b1, b2).
• Define a mapping from input space to feature space: (X) = (x1x1, x1x2, x2x1, x2x2)
• Let K(X,Y) = (X • Y)2 • Write (A) • (B) in terms of K.• (A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2)
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2
= (a1b1 + a2b2) (a1b1 + a2b2)
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2
= (a1b1 + a2b2) (a1b1 + a2b2)
= [(a1, a2) • (b1, b2)]2
Kernel function as dot product
(A) • (B)
= (a1 a1, a1a2, a2a1, a2a2) • (b1 b1, b1b2, b2b1, b2b2) = a1a1b1b1 + a1a2b1b2 + a2a1b2b1 + a2a2b2b2
= a1b1a1b1 + a1b1a2b2 + a2b2a1b1 + a2b2a2b2
= (a1b1 + a2b2) (a1b1 + a2b2)
= [(a1, a2) • (b1, b2)]2
= (A • B)2
= K(A, B)
Separating in 2D with a 4D kernel
“Kernelizing” Euclidean distance
Kernel function
• The kernel function plays the role of the dot product operation in the feature space.
• The mapping from input to feature space is implicit.
• Using a kernel function avoids representing the feature space vectors explicitly.
• Any continuous, positive semi-definite function can act as a kernel function.
Need for “positive semidefinite” for kernel function unclear.
Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.
YXYXK ,
31, YXYXK
2
2
2exp,
YX
YXK
Overfitting with a Gaussian kernel
The SVM learning problem
• Input: training vectors xi … xn and labels yi … yn.• Output: bias b plus one weight wi per training example• The weights specify the location of the separating
hyperplane.• The optimization problem is a convex, quadratic
optimization.• It can be solved using standard packages such as
MATLAB.
xx
xx
yfyfc
yfcCwn
iii
bwxf T
1,0max,,
,,2
1
1
2minarg
SVM prediction architectureQuery = x
x1 x2 x3 xn...
k k k k
w1
w2 w3 wn
Kernel function
• The kernel function plays the role of the dot product operation in the feature space.
• The mapping from input to feature space is implicit.
• Using a kernel function avoids representing the feature space vectors explicitly.
• Any continuous, positive semi-definite function can act as a kernel function.
Proof of Mercer’s Theorem: Intro to SVMs by Cristianini and Shawe-Taylor, 2000, pp. 33-35.
Learning gene classes
Predictor
Learner Model
Class
MYGD
79 experiments
79 experiments
3500Genes
2465Genes
Training set
Test set
Eisen et al.
Eisen et al.
Class predictionFP FN TP TN
TCA 4 9 8 2446
Respiration chain complexes
6 8 22 2431
Ribosome 7 3 118 2339
Proteasome 3 8 27 2429
Histone 0 2 9 2456
Helix-turn-helix 0 16 0 2451
SVM outperforms other methods
Predictions of gene function
Fleischer et al. “Systematic identification and functional screens
of uncharacterized proteins associated with eukaryotic
ribosomal complexes” Genes Dev, 2006.
Overview
• 218 human tumor samples spanning 14 common tumor types
• 90 normal samples• 16,063 “genes” measured per sample• Overall SVM classification accuracy: 78%.• Random classification accuracy: 1/14 =
9%.
Summary: Support vector machine learning
• The SVM learning algorithm finds a linear decision boundary.
• The hyperplane maximizes the margin; i.e., the distance from any training example.
• The optimization is convex; the solution is sparse.
• A soft margin allows for noise in the training set.• A complex decision surface can be learned by
using a non-linear kernel function.
Cost/Benefits of SVMs
+ SVMs perform well in high-dimensional data sets with few examples.
+ Convex optimization implies that you get the same answer every time.
+ Kernels functions allow encoding of prior knowledge.+ Kernel functions handle arbitrary data types.– The hyperplane does not provide a good explanation,
especially with a non-linear kernel function.
Vector representation
• Each matrix entry is an mRNA expression measurement.
• Each column is an experiment.
• Each row corresponds to a gene.
Similarity measurement
• Normalized scalar product
• Similar vectors receive high values, and vice versa.
iiii
ii
YYXX
YXYXK
,
Similar
Dissimilar
Kernel matrix
>ICYA_MANSEGDIFYPGYCPDVKPVNDFDLSAFAGAWHEIAKLPLENENQGKCTIAEYKYDGKKASVYNSFVSNGVKEYMEGDLEIAPDAKYTKQGKYVMTFKFGQRVVNLVPWVLATDYKNYAINYNCDYHPDKKAHSIHAWILSKSKVLEGNTKEVVDNVLKTFSHLIDASKFISNDFSEAACQYSTTYSLTGPDRH
>LACB_BOVINMKCLLLALALTCGAQALIVTQTMKGLDIQKVAGTWYSLAMAASDISLLDAQSAPLRVYVEELKPTPEGDLEILLQKWENGECAQKKIIAEKTKIPAVFKIDALNENKVLVLDTDYKKYLLFCMENSAEPEQSLACQCLVRTPEVDDEALEKFDKALKALPMHIRLSFNPTQLEEQCHI
Sequence kernels
• We cannot compute a scalar product on a pair of variable-length, discrete strings.
Pairwise comparison kernel
Pairwise comparison kernel
Protein-protein interactions
• Pairwise interactions can be represented as a graph or a matrix.1 0 0 1 0 1 0
11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0
protein
protein
Linear interaction kernel
• The simplest kernel counts the number of interactions between each pair.
1 0 0 1 0 1 0 11 0 1 0 1 1 0 10 0 0 0 1 1 0 00 0 1 0 1 1 0 10 0 1 0 1 0 0 11 0 0 0 0 0 0 10 0 1 0 1 0 0 0
3
Diffusion kernel
• A general method for establishing similarities between nodes of a graph.
• Based upon a random walk.
• Efficiently accounts for all paths connecting two nodes, weighted by path lengths.
Hydrophobicity profile
• Transmembrane regions are typically hydrophobic, and vice versa.
• The hydrophobicity profile of a membrane protein is evolutionarily conserved.
Membrane protein
Non-membrane protein
Hydrophobicity kernel
• Generate hydropathy profile from amino acid sequence using Kyte-Doolittle index.
• Prefilter the profiles.• Compare two profiles by
– Computing fast Fourier transform (FFT), and– Applying Gaussian kernel function.
• This kernel detects periodicities in the hydrophobicity profile.
Combining kernels
Identical
A B
K(A) K(B)
K(A)+K(B)
A B
A:B
K(A:B)
Semidefinite programming
• Define a convex cost function to assess the quality of a kernel matrix.
• Semidefinite programming (SDP) optimizes convex cost functions over the convex cone of positive semidefinite matrices.
Learn K from the convex cone of positive-semidefinite matrices or a convex subset of it :
According to a convex quality measure:
Integrate constructed kernels
Learn a linear mix
Large margin classifier (SVM)
Maximize the margin
SDP
Semidefinite programming
i
iiKK
i
iiKK
Integrate constructed kernels
Learn a linear mix
Large margin classifier (SVM)
Maximize the margin
Markov Random Field
• General Bayesian method, applied by Deng et al. to yeast functional classification.
• Used five different types of data.• For their model, the input data must be
binary.• Reported improved accuracy compared to
using any single data type.
Yeast functional classesCategory SizeMetabolism 1048
Energy 242
Cell cycle & DNA processing 600
Transcription 753
Protein synthesis 335
Protein fate 578
Cellular transport 479
Cell rescue, defense 264
Interaction w/ evironment 193
Cell fate 411
Cellular organization 192
Transport facilitation 306
Other classes 81
Six types of data
• Presence of Pfam domains.• Genetic interactions from CYGD.• Physical interactions from CYGD.• Protein-protein interaction by TAP.• mRNA expression profiles.• (Smith-Waterman scores).
Results
MRF
SDP/SVM(binary)
SDP/SVM(enriched)
Pros and cons
+ Learns relevance of data sets with respect to the problem at hand.
+ Accounts for redundancy among data sets, as well as noise and relevance.
+ Discriminative approach yields good performance.
- Kernel-by-kernel weighting is simplistic.- In most cases, unweighted kernel combination
works fine.- Does not provide a good explanation.
Network diffusionGeneMANIA
A rose by any other name …
• Network diffusion• Random walk with restart• Personalized PageRank• Diffusion kernel• Gaussian random field• GeneMANIA
Top performing methods
GeneMANIA
• Normalize each network (divide each element by the square root of the product of the sums of the rows and columns).
• Learn a weight for each network via ridge regression. Essentially, learn how informative the network is with respect to the task at hand.
• Sum the weighted networks.• Assign labels to the nodes. Use (n+ + n-)/n for
unlabeled genes.• Perform label propagation in the combined network.
Mostafavi et al. Genome Biology. 9:S4, 2008.
Random walk with restart
Positive examples
Random walk with restart
Random walk with restart
Restart
Random walk with restart
Random walk with restart
Random walk with restart
Size indicates frequency of visit
Final node scores
Size indicates frequency of visit
Final node scores
Label propagation is random walk with restart except:
(a) You restart less often from nodes with many neighbours (i.e., Restart probability of a node is inversely related to its degree)
(b) Nodes with many neighbors have their final node scores scaled up
Label propagation vs SVMLabel propagation
SVM
Performance averaged across 992 yeast Gene Ontology Biological Process categories.