do not reproduce without permission 1 1 - lectures.gersteinlab.org (c) '09 bioinformatics...
Post on 29-Dec-2015
232 Views
Preview:
TRANSCRIPT
Do not reproduce without permission 11 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
BIOINFORMATICSDatabases & Datamining II
Mark Gerstein, Yale University
gersteinlab.org/courses/452(last edit in spring '11)
Do not reproduce without permission 22 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Datamining
• Building the data matrix- Generic- Gene centric - Genomic region- Network centric
• Databases• Supervised Mining
• Unsupervised mining- Simple overlaps- Clustering rows &
columns (networks)- PCA- SVD (theory + appl.)- Weighted Gene Co-
Expression Network- Biplot- CCA
Do not reproduce without permission 33 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Spectral Methods Outline & Papers
• Simple background on PCA (emphasizing lingo)• More abstract run through on SVD• Application to
- O Alter et al. (2000). "Singular value decomposition for genome-wide expression data processing and modeling." PNAS 97: 10101
- J Qian et al. (2001). "Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions." J Mol Biol. 314: 1053.
- Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54
- Z Zhang et al. (2007) "Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions." Genome Res 17: 787
- TA Gianoulis et al. (2009) "Quantifying environmental adaptation of metabolic pathways in metagenomics." PNAS 106: 1374.
Do not reproduce without permission 44 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
Simple "Overlaps"
Do not reproduce without permission 55 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Structure of Genomic Features Matrix
Do not reproduce without permission 66 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Example Overlap
Analysis: Biochemically Active Regions Don't all Appear
to be Under Constraint
• Careful Randomization (GSC statistic)
[ENCODE Consortium, Nature 447, 2007]
Do not reproduce without permission 77 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Genomic Features Matrix: Deserts & Forests
Do not reproduce without permission 88 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Non-random distribution of TREs
• TREs are not evenly distributed throughout the encode regions (P < 2.2×10−16 ).
• The actual TRE distribution is power-law.
• The null distribution is ‘Poissonesque.’
• Many genomic subregions with extreme numbers of TREs.
Zhang et al. (2007) Gen. Res.
Do not reproduce without permission 99 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Aggregation & Saturation
[Nat. Rev. Genet. (2010) 11: 559]
Do not reproduce without permission 1010 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Aggregation Analysis
[ENCODE Consortium, Nature 447, 2007]
Do not reproduce without permission 1111 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
Clustering Columns & Rows of the Data
Matrix
Do not reproduce without permission 1212 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Correlating Rows & Columns
[Nat. Rev. Genet. (2010) 11: 559]
Do not reproduce without permission 1313 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
“cluster” predictors
Do not reproduce without permission 1414 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Use clusters to predict Response
Do not reproduce without permission 1515 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Agglomerative Clustering
• Bottom up v top down (K-means, know how many centers)
• Single or multi-link- threshold for
connection?
http://commons.wikimedia.org/wiki/File:Hierarchical_clustering_diagram.png
Do not reproduce without permission 1616 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
K-means
1) Pick ten (i.e. k?) random points as putative cluster centers. 2) Group the points to be clustered by the center to which they areclosest. 3) Then take the mean of each group and repeat, with the means now atthe cluster center.4)Stop when the centers stop moving.
Do not reproduce without permission 1717 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Using Genes to Cluster Cancer Samples
Cheng et al., Genome Biology, 2009
(1) RE-score profile for diff. miRNA in 1 cancer sample. (2) Tabulate over many different breast cancer samples
(3) Clustering based on RE score divides samples into 2 main types of cancer
(4) Clustering better than based on indiv. gene expression levels
18
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Clusteringthe
yeast cell cycle to uncover
interacting proteins
-2
-1
0
1
2
3
4
0 4 8 12 16
RPL19B
TFIIIC
Microarray timecourse of 1 ribosomal protein
mR
NA
exp
ress
ion
leve
l (ra
tio)
Time->
[Brown, Davis]
19
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Clusteringthe
yeast cell cycle to uncover
interacting proteins
-2
-1
0
1
2
3
4
0 4 8 12 16
RPL19B
TFIIIC
Random relationship from ~18M
mR
NA
exp
ress
ion
leve
l (ra
tio)
Time->
20
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Clusteringthe
yeast cell cycle to uncover
interacting proteins
-2
-1
0
1
2
3
4
0 4 8 12 16
RPL19B
RPS6B
Close relationship from 18M (2 Interacting Ribosomal Proteins)
mR
NA
exp
ress
ion
leve
l (ra
tio)
Time->
[Botstein; Church, Vidal]
21
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Clusteringthe
yeast cell cycle to uncover
interacting proteins
-2
-1
0
1
2
3
4
0 4 8 12 16
RPL19B
RPS6B
RPP1A
RPL15A
?????
Predict Functional Interaction of Unknown Member of Cluster
mR
NA
exp
ress
ion
leve
l (ra
tio)
Time->
22
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Global Network of Relationships
~470K significant
relationshipsfrom ~18M
possible
23
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Network = Adjacency Matrix
• Adjacency matrix A=[aij] encodes
whether/how a pair of nodes is connected.
• For unweighted networks: entries are 1
(connected) or 0 (disconnected)
• For weighted networks: adjacency matrix
reports connection strength between gene
pairs
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
24
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Steps for constructing aco-expression network
A) Get microarray gene expression data
B) Do preliminary filteringC) Measure concordance of gene
expression profiles by Pearson correlation
C) The Pearson correlation matrix is either dichotomized to arrive at an adjacency matrix unweighted network
...Or transformed continuously with the power adjacency function weighted network A
dapt
ed fr
om :
http
://w
ww
.gen
etic
s.uc
la.e
du/la
bs/h
orva
th/C
oexp
ress
ionN
etw
ork
25
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Power adjacency function to transform correlation into
adjacency
| ( , ) |ij i ja cor x x To determine β: in general use the “scale free topology criterion” described in Zhang and Horvath 2005
Typical value: β=6 Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
26
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Comparing adjacency functions
Power Adjancy (soft threshold) vs Step Function (hard threshold)
Ada
pted
fro
m :
htt
p://
ww
w.g
enet
ics.
ucla
.edu
/labs
/hor
vath
/Coe
xpre
ssio
nNet
wor
k
27
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Why weighted?
• A continuous spectrum between perfect co-
expression and no co-expression at all
• Could threshold, but will lose information
• Instead, assign a weight to each link that
represents the extent of gene co-expression
• Natural range of weights: 0=no connection,
1=perfect agreement.
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
28
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Local Clustering algorithm identifies further
(reasonable) types of
expression relation-ships
Simultaneous
TraditionalGlobal
Correlation
Inverted
Time-Shifted
29
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Local Alignment
Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,
Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066
Simultaneous
30
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Local Alignment
Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,
Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066
Simultaneous
31
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Local Alignment
Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,
Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066
Time-Shifted
32
(c)
Mar
k G
erst
ein
, 19
99,
Yal
e, b
ioin
fo.m
bb
.yal
e.ed
u
Local Alignment
Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,
Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066
Inverted
Do not reproduce without permission 3333 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
PCA
Do not reproduce without permission 3434 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Relation of Clustering to PCA
• Correlation Matrix C- Similarity of all pairs, either in terms of rows (genes)
or columns (conditions)
- C(genes) = AAT - C(conditions) = ATA
• Clustering - Operates on C, thresholding it to give a network
• PCA gives another view of C- diagonalizes C
• Two different correlation matrices (for rows & cols)- SVD operates on both
Do not reproduce without permission 3535 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
35
Covariance matrix
• N random variables• NxN symmetric matrix
• Diagonal elements are variances
Do not reproduce without permission 3636 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
PCA section will be a "mash up" up a number of PPTs on the web
• pca-1 - black ---> www.astro.princeton.edu/~gk/A542/PCA.ppt• by Professor Gillian R. Knapp gk@astro.princeton.edu
• pca-2 - yellow ---> myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt• by Hal Whitehead.• This is the class main url
http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm
• pca.ppt - what is cov. matrix ----> hebb.mit.edu/courses/9.641/lectures/pca.ppt• by Sebastian Seung. Here is the main page of the course• http://hebb.mit.edu/courses/9.641/index.html
• from BIIS_05lecture7.ppt ----> www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt• by R.S.Gaborski Professor
37
abstract
Principal component analysis (PCA) is a technique that is useful for thecompression and classification of data. The purpose is to reduce thedimensionality of a data set (sample) by finding a new set of variables,smaller than the original set of variables, that nonetheless retains mostof the sample's information.
By information we mean the variation present in the sample,given by the correlations between the original variables. The newvariables, called principal components (PCs), are uncorrelated, and areordered by the fraction of the total information each retains.
Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt
38
Geometric picture of principal components (PCs)
A sample of n observations in the 2-D space
Goal: to account for the variation in a sample in as few variables as possible, to some accuracy
Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt
39
Geometric picture of principal components (PCs)
• the 1st PC is a minimum distance fit to a line in space
PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.
• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC
Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt
40
PCA: General methodologyFrom k original variables: x1,x2,...,xk:
Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
...yk = ak1x1 + ak2x2 + ... + akkxksuch that:
yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
41
PCA: General methodologyFrom k original variables: x1,x2,...,xk:
Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
...
yk = ak1x1 + ak2x2 + ... + akkxk
such that:yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.
yk's arePrincipal
Components
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
42
Principal Components Analysis
4.0 4.5 5.0 5.5 6.02
3
4
5
1st Principal Component, y1
2nd Principal Component, y2
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
43
Principal Components Analysis
• Rotates multivariate dataset into a new configuration which is easier to interpret
• Purposes– simplify data– look at relationships between variables– look at patterns of units
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
44
Principal Components Analysis{a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance
matrix, and coefficients of first principal component
{a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and coefficients of 2nd principal component
…
{ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth principal component
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
45
eigenvalue problem
• The eigenvalue problem is any problem having the following form:
A . v = λ . vA: n x n matrixv: n x 1 non-zero vectorλ: scalarAny value of λ for which this equation has a solution is called the eigenvalue of A and vector v which corresponds to this value is called the eigenvector of A.
[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt
46
eigenvalue problem
2 3 3 12 3
2 1 2 8 2
A . v = λ . v
Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A
Given matrix A, how can we calculate the eigenvector and eigenvalues for A?
x = 4 x=
[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt
47
Principal Components Analysis
So, principal components are given by:
y1 = a11x1 + a12x2 + ... + a1kxk
y2 = a21x1 + a22x2 + ... + a2kxk
...
yk = ak1x1 + ak2x2 + ... + akkxk
xj’s are standardized if correlation matrix is used (mean 0.0, SD 1.0)
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
48
Principal Components Analysis
Score of ith unit on jth principal component
yi,j = aj1xi1 + aj2xi2 + ... + ajkxik
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
49
PCA Scores
4.0 4.5 5.0 5.5 6.02
3
4
5
xi2
xi1
yi,1 yi,2
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
50
Principal Components Analysis
Amount of variance accounted for by:1st principal component, λ1, 1st eigenvalue
2nd principal component, λ2, 2nd eigenvalue
...
λ1 > λ2 > λ3 > λ4 > ...
Average λj = 1 (correlation matrix)
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
51
Principal Components Analysis:Eigenvalues
4.0 4.5 5.0 5.5 6.02
3
4
5
λ1λ2
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
52
PCA: Terminology• jth principal component is jth eigenvector of
correlation/covariance matrix• coefficients, ajk, are elements of eigenvectors and relate original
variables (standardized if using correlation matrix) to components• scores are values of units on components (produced using
coefficients)• amount of variance accounted for by component is given by
eigenvalue, λj
• proportion of variance accounted for by component is given by λj / Σ λj
• loading of kth original variable on jth component is given by ajk√λj --correlation between variable and component
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
53
How many components to use?
• If λj < 1 then component explains less variance
than original variable (correlation matrix)• Use 2 components (or 3) for visual ease• Scree diagram:
0
0.5
1
1.5
2
2.5
1 2 3 4 5
Component number
Eig
enva
lue
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
54
Principal Components Analysis on:• Covariance Matrix:
– Variables must be in same units– Emphasizes variables with most variance– Mean eigenvalue ≠1.0– Useful in morphometrics, a few other cases
• Correlation Matrix:– Variables are standardized (mean 0.0, SD 1.0)– Variables can be in different units– All variables have same impact on analysis– Mean eigenvalue = 1.0
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
55
PCA: Potential Problems
• Lack of Independence– NO PROBLEM
• Lack of Normality– Normality desirable but not essential
• Lack of Precision– Precision desirable but not essential
• Many Zeroes in Data Matrix– Problem (use Correspondence Analysis)
Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
56
PCA applications -Eigenfaces
• the principal eigenface looks like a bland androgynous average human face
http://en.wikipedia.org/wiki/Image:Eigenfaces.png
Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt
(
c) M
Ger
stei
n '0
6, g
erst
ein
.info
/tal
ks
57
Eigenfaces – Face Recognition
• When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.
• Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces
• Hence eigenfaces provide a means of applying data compression to faces for identification purposes.
Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt
Do not reproduce without permission 5858 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
SVD
Puts together slides prepared by Brandon Xia with images from Alter et al. and Kluger et al. papers
59
SVD for microarray data(Alter et al, PNAS 2000)
60
A = USVT
• A is any rectangular matrix (m ≥ n)• Row space: vector subspace
generated by the row vectors of A• Column space: vector subspace
generated by the column vectors of A– The dimension of the row & column
space is the rank of the matrix A: r (≤ n)
• A is a linear transformation that maps vector x in row space into vector Ax in column space
61
A = USVT
• U is an “orthogonal” matrix (m ≥ n)• Column vectors of U form an
orthonormal basis for the column space of A: UTU=I
• u1, …, un in U are eigenvectors of AAT
– AAT =USVT VSUT =US2 UT
– “Left singular vectors”
1 2
| | |
| | |nU
u u u
62
A = USVT
• V is an orthogonal matrix (n by n)• Column vectors of V form an
orthonormal basis for the row space of A: VTV=VVT=I
• v1, …, vn in V are eigenvectors of ATA– ATA =VSUT USVT =VS2 VT
– “Right singular vectors”
1 2
| | |
| | |nV
v v v
63
A = USVT
• S is a diagonal matrix (n by n) of non-negative singular values
• Typically sorted from largest to smallest
• Singular values are the non-negative square root of corresponding eigenvalues of ATA and AAT
64
AV = US
• Means each Avi = siui
• Remember A is a linear map from row space to column space
• Here, A maps an orthonormal basis {vi} in row space into an orthonormal basis {ui} in column space
• Each component of ui is the projection of a row onto the vector vi
65
SVD of A (m by n): recap
• A = USVT = (big-"orthogonal")(diagonal)(sq-orthogonal)
• u1, …, um in U are eigenvectors of AAT
• v1, …, vn in V are eigenvectors of ATA• s1, …, sn in S are nonnegative singular values of
A
• AV = US means each Avi = siui
• “Every A is diagonalized by 2 orthogonal matrices”
66
SVD as sum of rank-1 matrices
• A = USVT
• A = s1u1v1T + s2u2v2
T +… + snunvnT
• s1 ≥ s2 ≥ … ≥ sn ≥ 0
• What is the rank-r matrix A that best approximates A ?– Minimize
• A = s1u1v1T + s2u2v2
T +… + srurvrT
• Very useful for matrix approximation
2
1 1
ˆm n
ij iji j
A A
67
Examples of (almost) rank-1 matrices
• Steady states with fluctuations
• Array artifacts?
• Signals?
101 103 102
302 300 301
203 204 203
401 402 404
1 2 1
2 4 2
1 2 1
0 0 0
101 303 202
102 300 201
103 304 203
101 302 204
68
Geometry of SVD in row space
• A as a collection of m row vectors (points) in the row space of A
• s1u1v1T is the best rank-1 matrix
approximation for A
• Geometrically: v1 is the direction of the best approximating rank-1 subspace that goes through origin
• s1u1 gives coordinates for row vectors in rank-1 subspace
• v1 Gives coordinates for row space basis vectors in rank-1 subspace
x
yv1
iA si iv u
I i iv v
69
Geometry of SVD in row space
x
yv1
x
y
This line segment that goes through origin approximates the original data set
The projected data set approximates the original data set
x
y s1u1v1T
A
70
Geometry of SVD in row space
• A as a collection of m row vectors (points) in the row space of A
• s1u1v1T + s2u2v2
T is the best rank-2 matrix approximation for A
• Geometrically: v1 and v2 are the directions of the best approximating rank-2 subspace that goes through origin
• s1u1 and s2u2 gives coordinates for row vectors in rank-2 subspace
• v1 and v2 gives coordinates for row space basis vectors in rank-2 subspace
x
yx’
y’
iA si iv u
I i iv v
71
What about geometry of SVD in column space?
• A = USVT
• AT = VSUT
• The column space of A becomes the row space of AT
• The same as before, except that U and V are switched
72
Geometry of SVD in row and column spaces
• Row space– siui gives coordinates for row vectors along unit
vector vi
– vi gives coordinates for row space basis vectors along unit vector vi
• Column space– sivi gives coordinates for column vectors along
unit vector ui
– ui gives coordinates for column space basis vectors along unit vector ui
• Along the directions vi and ui, these two spaces look pretty much the same!– Up to scale factors si
– Switch row/column vectors and row/column space basis vectors
– Biplot....
iA si iv u
I i iv v
TiA si iu v
I i iu u
73
When is SVD = PCA?
• Centered data
x
y
x’y’
74
When is SVD different from PCA?
x
yx’
y’
y’’x’’
Translation is not a linear operation, as it moves the origin !
PCA
SVD
Additional Points
• Time Complexity (Cubic)
• Application to text mining– Latent semantic indexing– sparse
75
A
76
Conclusion
• SVD is the “absolute high point of linear algebra”• SVD is difficult to compute; but once we have it, we have
many things• SVD finds the best approximating subspace, using linear
transformation• Simple SVD cannot handle translation, non-linear
transformation, separation of labeled data, etc.• Good for exploratory analysis; but once we know what
we look for, use appropriate tools and model the structure of data explicitly!
Do not reproduce without permission 7777 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
Intuition on interpretation of SVD in terms of genes and
conditions77
78
SVD for microarray data(Alter et al, PNAS 2000)
79
Notation• m=1000 genes
– row-vectors
– 10 eigengene (vi) of dimension 10 conditions
• n=10 conditions (assays)– column vectors
– 10 eigenconditions (ui) of dimension 1000 genes
80
Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix
81
Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix
Bra - ket notation
82
Close up on Eigengenes
83Copyright ©2000 by the National Academy of Sciences
Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106
Genes sorted by correlation with top 2 eigengenes
84Copyright ©2000 by the National Academy of Sciences
Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106
Normalized elutriation
expression in the subspace
associated with the cell cycle
85
Plotting Experiments in Low Dimension Subspace
Do not reproduce without permission 8686 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
Weighted Gene Co-Expression Network
Weighted Gene Co-Expression Network Analysis
Bin Zhang and Steve Horvath (2005)
"A General Framework for Weighted Gene Co-Expression Network Analysis",
Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Art. 17. Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Central concept in network methodology:
• Modules: groups of densely interconnected genes
(not the same as closely related genes)
– a class of over-represented patterns
• Empirical fact: gene co-expression networks exhibit
modular structure
Network Modules
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Module Detection
• Numerous methods exist
• Many methods define a suitable gene-gene dissimilarity measure and use clustering.
• In our case: dissimilarity based on topological overlap
• Clustering method: Average linkage hierarchical clustering
– branches of the dendrogram are modulesAdapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Topological overlap measure,
TOM
• Pairwise measure by Ravasz et al, 2002
• TOM[i,j] measures the overlap of the set of nearest
neighbors of nodes i,j
• Closely related to twinness
• Easily generalized to weighted networks
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Calculating TOM
• Normalized to [0,1] with 0 = no overlap, 1 = perfect overlap• Generalized in Zhang and Horvath (2005) to the case of
weighted networks
1ij ijDistTOM TOM
TOMij=
∑ua
iua
uja
ij
min ki, k
j1−a
ij
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Example of module detection via hierarchical clustering
• Expression data from human brains, 18 samples.
Example of module detection via hierarchical clustering
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Module eigengenes
• Often: Would like to treat modules as single units
– Biologically motivated data reduction
• Construct a representative
• Our choice: module eigengene = 1st principal component of
the module expression matrix
• Intuitively: a kind of average expression profile
• Genes of each module must be highly correlated for a
representative to really represent
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54
ExampleHuman brain expression data, 18
samples
Module consisting of 50 genes
Ada
pte
d fr
om :
htt
p://
ww
w.g
enet
ics.
ucla
.edu
/lab
s/ho
rvat
h/C
oexp
ress
ionN
etw
ork
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54
Module eigengenes are very useful!
• Summarize each module in one synthetic expression profile
• Suitable representation in situations where modules are
considered the basic building blocks of a system
– Allow to relate modules to external information
(phenotypes, genotypes such as SNP, clinical traits) via
simple measures (correlation, mutual information etc)
– Can quantify co-expression relationships of various
modules by standard measures
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54
Human and chimpanzee brain
expression data
Construct gene expression networks in both sets, find
modules
Construct consensus modules
Characterize each module by brain region where it is most
differentially expressed
Represent each module by its eigengene
Characterize relationships among modules by correlation of
respective eigengenes (heatmap or dendrogram)
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Set modules
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Biological information?
Assign modules to brain regions with highest (positive) differential expression
Red means the module genes
are over-expressed in the
brain region;
green means under-
expression
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Visualizing consensus eigengene networks
Heatmap comparisons of module relationships
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
For more information
Weighted Gene Co-expression Networks website:
http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/
A short methodological summary of the publications.• How to construct a gene co-expression network using the scale free topology criterion? Robustness of network results. Relating a gene
significance measure and the clustering coefficient to intramodular connectivity: – Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in
Genetics and Molecular Biology: Vol. 4: No. 1, Article 17• Theory of module networks (both co-expression and protein-protein interaction modules):
– Dong J, Horvath S (2007) Understanding Network Concepts in Modules, BMC Systems Biology 2007, 1:24• What is the topological overlap measure? Empirical studies of the robustness of the topological overlap measure:
– Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007, 8:22
• Software for carrying out neighborhood analysis based on topological overlap. The paper shows that an initial seed neighborhood comprised of 2 or more highly interconnected genes (high TOM, high connectivity) yields superior results. It also shows that topological overlap is superior to correlation when dealing with expression data.
– Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581
• Gene screening based on intramodular connectivity identifies brain cancer genes that validate. This paper shows that WGCNA greatly alleviates the multiple comparison problem and leads to reproducible findings.
– Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu, Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) "Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target", PNAS | November 14, 2006 | vol. 103 | no. 46 | 17402-17407
• The relationship between connectivity and knock-out essentiality is dependent on the module under consideration. Hub genes in some modules may be non-essential. This study shows that intramodular connectivity is much more meaningful than whole network connectivity:
– "Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks" (2006) by Carlson MRJ, Zhang B, Fang Z, Mischel PS, Horvath S, and Nelson SF, BMC Genomics 2006, 7:40
• How to integrate SNP markers into weighted gene co-expression network analysis? The following 2 papers outline how SNP markers and co-expression networks can be used to screen for gene expressions underlying a complex trait. They also illustrate the use of the module eigengene based connectivity measure kME.
– Single network analysis: Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006) "Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight". PLoS Genetics. Volume 2 | Issue 8 | AUGUST 2006
– Differential network analysis: Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) "Weighted Gene Co-expression Network Analysis Strategies Applied to Mouse Weight", Mammalian Genome. In Press
• The following application presents a `supervised’ gene co-expression network analysis. In general, we prefer to construct a co-expression network and associated modules without regard to an external microarray sample trait (unsupervised WGCNA). But if thousands of genes are differentially expressed, one can construct a network on the basis of differentially expressed genes (supervised WGCNA):
– Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang W, He A, Truong A, Patel S, Nelson SF, Horvath S, Berliner J, Kirchgessner T, Lusis AJ (2006) Identification of Inflammatory Gene Modules based on Variations of Human Endothelial Cell Responses to Oxidized Lipids. PNAS 22;103(34):12741-6
• The following paper presents a differential co-expression network analysis. It studies module preservation between two networks. By screening for genes with differential topological overlap, we identify biologically interesting genes. The paper also shows the value of summarizing a module by its module eigengene.
– Oldham M, Horvath S, Geschwind D (2006) Conservation and Evolution of Gene Co-expression Networks in Human and Chimpanzee Brains. 2006 Nov 21;103(47):17973-8
Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork
Do not reproduce without permission 102
102 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
Biplot
103
Biplot• A biplot is a two-dimensional representation of a data matrix showing a point for each of the n
observation vectors (rows of the data matrix) along with a point for each of the p variables (columns of the data matrix).
– The prefix ‘bi’ refers to the two kinds of points; not to the dimensionality of the plot. The method presented here could, in fact, be generalized to a threedimensional (or higher-order) biplot. Biplots were introduced by Gabriel (1971) and have been discussed at length by Gower and Hand (1996). We applied the biplot procedure to the following toy data matrix to illustrate how a biplot can be generated and interpreted. See the figure on the next page.
• Here we have three variables (transcription factors) and ten observations (genomic bins). We can obtain a two-dimensional plot of the observations by plotting the first two principal components of the TF-TF correlation matrix R1.
– We can then add a representation of the three variables to the plot of principal components to obtain a biplot. This shows each of the genomic bins as points and the axes as linear combination of the factors.
• The great advantage of a biplot is that its components can be interpreted very easily. First, correlations among the variables are related to the angles between the lines, or more specifically, to the cosines of these angles. An acute angle between two lines (representing two TFs) indicates a positive correlation between the two corresponding variables, while obtuse angles indicate negative correlation.
– Angle of 0 or 180 degrees indicates perfect positive or negative correlation, respectively. A pair of orthogonal lines represents a correlation of zero. The distances between the points (representing genomic bins) correspond to the similarities between the observation profiles. Two observations that are relatively similar across all the variables will fall relatively close to each other within the two-dimensional space used for the biplot. The value or score for any observation on any variable is related to the perpendicular projection form the point to the line.
• Refs– Gabriel, K. R. (1971), “The Biplot Graphical Display of Matrices with Application to Principal Component Analysis,”
Biometrika, 58, 453–467.– Gower, J. C., and Hand, D. J. (1996), Biplots, London: Chapman & Hall.
Introduction• A biplot is a low-
dimensional (usually 2D) representation of a data matrix A.– A point for each of
the m observation vectors (rows of A)
– A line (or arrow) for each of the n variables (columns of A)
PCA TFs: a, b, c...
Genomic Sites: 1,2,3...
AT
A
AAT (site-site correlation)
ATA (TF-TF corr.)
Biplot to Show Overall Relationship of TFs & Sites
TFs: a, b, c...
Genomic Sites: 1,2,3...
AT
A=USVT
AAT (site-site correlation)
ATA (TF-TF corr.)
Biplot Ex
A
A’A
Biplot Ex #2
AT A = V S2 VT
A AT = U S2 UT
A = (U Sr) (V S1−r) T
A vj = uj sj & AT uj = vj sj
109
Biplot Ex #3iA si iv u
TiA si iu v
Assuming s=1,Av = uATu = v
Results of
BiplotZhang et al. (2007)
Gen. Res.
• Pilot ENCODE (1% genome): 5996 10 kb genomic bins (adding all hits) + 105 TF experiments biplot
• Angle between TF vectors shows relation b/w factors• Closeness of points gives clustering of "sites" • Projection of site onto vector gives degree to which
site is assoc. with a particular factor
Results of
BiplotZhang et al. (2007)
Gen. Res.
• Biplot groups TFs into sequence-specific and sequence-nonspecific clusters.
– c-Myc may behave more like a sequence-nonspecific TF.
– H3K27me3 functions in a transcriptional regulatory process in a rather sequence-specific manner.
• Genomic Bins are associated with different TFs and in this fashion each bin is "annotated" by closest TF cluster
Do not reproduce without permission 112
112 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Unsupervised Mining
CCA
113
Sorcerer II Global Ocean Survey
Sorcerer II journey August 2003- January 2006
Sample approximately every 200 miles
Rusch, et al., PLOS Biology 2007
114
Sorcerer II Global Ocean Survey
Metagenomic Sequence: 6.25 GB of data
0.1–0.8 μm size fraction (bacteria)
6.3 billion base pairs (7.7 million reads)
Reads were assembled and genes annotated
1 million CPU hours to process
Metadata
GPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content
Rusch, et al., PLOS Biology 2007
Metabolic
Pathways
Membrane Protein Families
Additional Metadata via GPS coordinates
115
Nutrient Features Extracted:
Phosphate
Silicate
Nitrate
Apparent Oxygen Utilization
Dissolved Oxygen
Annual Phosphate [umol/l] at the surface
Sample Depth: 1 meter
Water Depth: 32 meters
Chlorophyll: 4.0 ug/kg
Salinity: 31 psu
Temperature: 11 C
Location: 41°5'28"N, 71°36'8"W
Extracting Environmental Data from Other Sources
World Ocean Atlas 2005NOAA/NODC
116
116 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Mapping Raw Metagenomic
Reads to a Matrix of Familes or
Pathways for each Site
Patel et. al., Genome Research 2010
Families Matrix
117
117 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Expressing data as matrices indexed by site, env. var., and pathway usage
Pathway Sequences(Community Function) Environmental
Features
[Rusch et. al., (2007) PLOS Biology; Gianoulis et al., PNAS (in press, 2009]
118
118 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Simple Relationships: Pairwise Simple Relationships: Pairwise CorrelationsCorrelations
[ G
iano
ulis
et
al.,
PN
AS
(in
pre
ss,
2009
) ]
Pathways
Do not reproduce without permission 119
119
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
Canonical Correlation Analysis: Simultaneous weighting
Lifestyle Index = a +b + c
Weight # km run/week Lifestyle Index Fit Index
Fit Index = a + b +c
Do not reproduce without permission 120
120
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
Canonical Correlation Analysis: Simultaneous weighting
Lifestyle Index = a +b + c
Weight # km run/week Lifestyle Index Fit Index
Fit Index = a + b +c
Metabolic Pathways/ Protein Families
EnvironmentalFeatures
Temp
Chlorophyll
etc Photosynthesis
Lipid Metabolism
etc
Do not reproduce without permission 121
121
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
CCA: Finding Variables CCA: Finding Variables with Large Projections in "Correlation Circle"with Large Projections in "Correlation Circle"
The goal of this technique is to interpret cross-variance matricesWe do this by defining a change of basis.
Gianoulis et al., PNAS 2009
Formal CCA def'n
• More intuitively, canonical correlation analysis can be defined as the problem of finding two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are mutually maximized.
[From M Borga, CCA Tutorial, http://www.imt.liu.se/~magnus/cca ]
CCA resultsWe are defining a change of basis of the cross co-variance matrixWe want the correlations between the projections of the variables, X and Y, onto
the basis vectors to be mutually maximized.Eigenvalues squared canonical correlationsEigenvectors normalized canonical correlation basis vectors
Environment Family
Correlation = .3
Correlation= 1
This plot shows the correlations in the first and second dimensions
Correlation Circle: The closer the point is to the outer circle, the higher the correlation
Variables projected in the same direction are correlated
Do not reproduce without permission 124
124
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
Circuit MapCircuit Map
Environmentally invariant
Environmentally variant
Strength of Pathway co-variation with environment
Gianoulis et al., PNAS 2009
Do not reproduce without permission 125
125
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
Conclusion #1: energy conversion strategy, temp and depth
Gianoulis et al., PNAS 2009
Do not reproduce without permission 126
126
- L
ectu
res.
Ger
stei
nL
ab.o
rg (c
) '0
9
Conclusion #2: Outer Membrane components vary with the Conclusion #2: Outer Membrane components vary with the environmentenvironment
Gianoulis et al., PNAS 2009Patel et al. Genome Research 2010
Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes
127
Membrane Proteins: Sensing and Responding the Environment
Water depth
Acidity
App. O2 util.
Salinity
Pollution
Climate change Shipping
Phospahte
NitrateSilicate
Temperature
Chlorophyll
Dissolved O2
UV
Sample Depthvariant
invariant
Dimension 1
Dim
en
sio
n
2
Patel and Gianoulis et al., (in press) Genome Research
44 invariant membrane protein families
107 variant membrane protein families
• 2.3 million predicted membrane proteins
• 1.2 million unique• 850,000 mapped to 151
membrane protein COGs
20% have NO KNOWN FUNCTION
Membrane Proteins co-vary more than Metabolic Pathways
128
Median absolute structural Correlation Coefficient
Metabolic Pathways = 0.17
Dimension 1
Dim
en
sio
n
2
Acidity
App. O2 util.
Salinity
Pollution
Climate change Shipping
Phospahte
NitrateSilicate
Temperature
Chlorophyll
Dissolved O2
UV
Sample Depthvariant
invariant
Dimension 1
Dim
en
sio
n
2
Membrane Proteins = 0.3
Patel and Gianoulis et al., (in press) Genome Research
Acidity
App. O2 util.
Salinity
Pollution
Climate change Shipping
Phospahte
NitrateSilicate
Temperature
Chlorophyll
Dissolved O2
UV
Sample Depthvariant
invariant
Dimension 1
Dim
en
sio
n
2
129
CCA Limitations
Water depth
Both the strength and the
directionality of relationships between nodes is difficult to decipher in this format.
Acidity
App. O2 util.
Salinity
Pollution
Climate change Shipping
Phospahte
NitrateSilicate
Temperature
Chlorophyll
Dissolved O2
UV
Sample Depthvariant
invariant
Dimension 1
Dim
en
sio
n
2
Patel and Gianoulis et al., (in press) Genome Research
130
Protein Families and Environmental Features Network (PEN)
cosbaba
Distance: Dot product between 1st and 2nd Dimension of CCA
Patel et. al., Genome Research 2010
131
Protein Families and Environmental Features Network (PEN)
“Bi-modules”: groups of environmental features and membrane proteins families that are associated
UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network
COG0598, Magnesium Transporter
COG1176, Polyamine Transporter
132
Bi-module 1: Phosphate/Phosphate Transporters
Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation
High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed
Patel et. al., Genome Research 2010
133
Bi-module 2: Iron Transporters/Pollution/Shipping
Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron
Pollution and Shipping may be a proxy for iron concentrations
Patel et. al., Genome Research 2010
Do not reproduce without permission 134
134 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 135
135 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 136
136 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 137
137 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 138
138 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 139
139 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 140
140 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 141
141 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 142
142 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 143
143 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 144
144 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 145
145 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 146
146 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 147
147 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
Do not reproduce without permission 148
148 -
Lec
ture
s.G
erst
ein
Lab
.org
(c)
'09
top related