do not reproduce without permission 1 1 - lectures.gersteinlab.org (c) '09 bioinformatics...

Do not reproduce without permission 11 -

BIOINFORMATICSDatabases & Datamining II

Mark Gerstein, Yale University

gersteinlab.org/courses/452(last edit in spring '11)

Datamining

• Building the data matrix- Generic- Gene centric - Genomic region- Network centric

• Databases• Supervised Mining

• Unsupervised mining- Simple overlaps- Clustering rows &

columns (networks)- PCA- SVD (theory + appl.)- Weighted Gene Co-

Expression Network- Biplot- CCA

Spectral Methods Outline & Papers

• Simple background on PCA (emphasizing lingo)• More abstract run through on SVD• Application to

- O Alter et al. (2000). "Singular value decomposition for genome-wide expression data processing and modeling." PNAS 97: 10101

- J Qian et al. (2001). "Beyond synexpression relationships: local clustering of time-shifted and inverted gene expression profiles identifies new, biologically relevant interactions." J Mol Biol. 314: 1053.

- Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

- Z Zhang et al. (2007) "Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions." Genome Res 17: 787

- TA Gianoulis et al. (2009) "Quantifying environmental adaptation of metabolic pathways in metagenomics." PNAS 106: 1374.

Unsupervised Mining

Simple "Overlaps"

Structure of Genomic Features Matrix

Example Overlap

Analysis: Biochemically Active Regions Don't all Appear

to be Under Constraint

• Careful Randomization (GSC statistic)

[ENCODE Consortium, Nature 447, 2007]

Genomic Features Matrix: Deserts & Forests

Non-random distribution of TREs

• TREs are not evenly distributed throughout the encode regions (P < 2.2×10−16 ).

• The actual TRE distribution is power-law.

• The null distribution is ‘Poissonesque.’

• Many genomic subregions with extreme numbers of TREs.

Zhang et al. (2007) Gen. Res.

Aggregation & Saturation

[Nat. Rev. Genet. (2010) 11: 559]

Aggregation Analysis

[ENCODE Consortium, Nature 447, 2007]

Unsupervised Mining

Clustering Columns & Rows of the Data

Matrix

Correlating Rows & Columns

[Nat. Rev. Genet. (2010) 11: 559]

“cluster” predictors

Use clusters to predict Response

Agglomerative Clustering

• Bottom up v top down (K-means, know how many centers)

• Single or multi-link- threshold for

connection?

http://commons.wikimedia.org/wiki/File:Hierarchical_clustering_diagram.png

K-means

1) Pick ten (i.e. k?) random points as putative cluster centers. 2) Group the points to be clustered by the center to which they areclosest. 3) Then take the mean of each group and repeat, with the means now atthe cluster center.4)Stop when the centers stop moving.

Using Genes to Cluster Cancer Samples

Cheng et al., Genome Biology, 2009

(1) RE-score profile for diff. miRNA in 1 cancer sample. (2) Tabulate over many different breast cancer samples

(3) Clustering based on RE score divides samples into 2 main types of cancer

(4) Clustering better than based on indiv. gene expression levels

Clusteringthe

yeast cell cycle to uncover

interacting proteins

0 4 8 12 16

RPL19B

TFIIIC

Microarray timecourse of 1 ribosomal protein

Time->

[Brown, Davis]

Clusteringthe

0 4 8 12 16

RPL19B

TFIIIC

Random relationship from ~18M

Time->

Clusteringthe

0 4 8 12 16

RPL19B

Close relationship from 18M (2 Interacting Ribosomal Proteins)

Time->

[Botstein; Church, Vidal]

Clusteringthe

0 4 8 12 16

RPL19B

RPL15A

Predict Functional Interaction of Unknown Member of Cluster

Time->

Global Network of Relationships

~470K significant

relationshipsfrom ~18M

possible

Network = Adjacency Matrix

• Adjacency matrix A=[aij] encodes

whether/how a pair of nodes is connected.

• For unweighted networks: entries are 1

(connected) or 0 (disconnected)

• For weighted networks: adjacency matrix

reports connection strength between gene

Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Steps for constructing aco-expression network

A) Get microarray gene expression data

B) Do preliminary filteringC) Measure concordance of gene

expression profiles by Pearson correlation

C) The Pearson correlation matrix is either dichotomized to arrive at an adjacency matrix unweighted network

...Or transformed continuously with the power adjacency function weighted network A

Power adjacency function to transform correlation into

adjacency

| ( , ) |ij i ja cor x x To determine β: in general use the “scale free topology criterion” described in Zhang and Horvath 2005

Typical value: β=6 Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Comparing adjacency functions

Power Adjancy (soft threshold) vs Step Function (hard threshold)

Why weighted?

• A continuous spectrum between perfect co-

expression and no co-expression at all

• Could threshold, but will lose information

• Instead, assign a weight to each link that

represents the extent of gene co-expression

• Natural range of weights: 0=no connection,

1=perfect agreement.

Local Clustering algorithm identifies further

(reasonable) types of

expression relation-ships

Simultaneous

TraditionalGlobal

Correlation

Inverted

Time-Shifted

Local Alignment

Qian J. et al. Beyond Synexpression Relationships: Local clustering of Time-shifted and Inverted Gene Expression Profiles Identifies New,

Biologically Relevant Interactions. J. Mol. Biol. (2001) 314, 1053-1066

Simultaneous

Local Alignment

Simultaneous

Local Alignment

Time-Shifted

Local Alignment

Inverted

Unsupervised Mining

Relation of Clustering to PCA

• Correlation Matrix C- Similarity of all pairs, either in terms of rows (genes)

or columns (conditions)

- C(genes) = AAT - C(conditions) = ATA

• Clustering - Operates on C, thresholding it to give a network

• PCA gives another view of C- diagonalizes C

• Two different correlation matrices (for rows & cols)- SVD operates on both

Covariance matrix

• N random variables• NxN symmetric matrix

• Diagonal elements are variances

PCA section will be a "mash up" up a number of PPTs on the web

• pca-1 - black ---> www.astro.princeton.edu/~gk/A542/PCA.ppt• by Professor Gillian R. Knapp gk@astro.princeton.edu

• pca-2 - yellow ---> myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt• by Hal Whitehead.• This is the class main url

http://myweb.dal.ca/~hwhitehe/BIOL4062/handout4062.htm

• pca.ppt - what is cov. matrix ----> hebb.mit.edu/courses/9.641/lectures/pca.ppt• by Sebastian Seung. Here is the main page of the course• http://hebb.mit.edu/courses/9.641/index.html

• from BIIS_05lecture7.ppt ----> www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt• by R.S.Gaborski Professor

abstract

Principal component analysis (PCA) is a technique that is useful for thecompression and classification of data. The purpose is to reduce thedimensionality of a data set (sample) by finding a new set of variables,smaller than the original set of variables, that nonetheless retains mostof the sample's information.

By information we mean the variation present in the sample,given by the correlations between the original variables. The newvariables, called principal components (PCs), are uncorrelated, and areordered by the fraction of the total information each retains.

Adapted from http://www.astro.princeton.edu/~gk/A542/PCA.ppt

Geometric picture of principal components (PCs)

A sample of n observations in the 2-D space

Goal: to account for the variation in a sample in as few variables as possible, to some accuracy

Geometric picture of principal components (PCs)

• the 1st PC is a minimum distance fit to a line in space

PCs are a series of linear least squares fits to a sample,each orthogonal to all the previous.

• the 2nd PC is a minimum distance fit to a line in the plane perpendicular to the 1st PC

PCA: General methodologyFrom k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

...yk = ak1x1 + ak2x2 + ... + akkxksuch that:

yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

Adapted from http://myweb.dal.ca/~hwhitehe/BIOL4062/pca.ppt

PCA: General methodologyFrom k original variables: x1,x2,...,xk:

Produce k new variables: y1,y2,...,yk:y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

yk = ak1x1 + ak2x2 + ... + akkxk

such that:yk's are uncorrelated (orthogonal)y1 explains as much as possible of original variance in data sety2 explains as much as possible of remaining varianceetc.

yk's arePrincipal

Components

Principal Components Analysis

4.0 4.5 5.0 5.5 6.02

1st Principal Component, y1

2nd Principal Component, y2

• Rotates multivariate dataset into a new configuration which is easier to interpret

• Purposes– simplify data– look at relationships between variables– look at patterns of units

Principal Components Analysis{a11,a12,...,a1k} is 1st Eigenvector of correlation/covariance

matrix, and coefficients of first principal component

{a21,a22,...,a2k} is 2nd Eigenvector of correlation/covariance matrix, and coefficients of 2nd principal component

{ak1,ak2,...,akk} is kth Eigenvector of correlation/covariance matrix, and coefficients of kth principal component

eigenvalue problem

• The eigenvalue problem is any problem having the following form:

A . v = λ . vA: n x n matrixv: n x 1 non-zero vectorλ: scalarAny value of λ for which this equation has a solution is called the eigenvalue of A and vector v which corresponds to this value is called the eigenvector of A.

[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

eigenvalue problem

2 3 3 12 3

2 1 2 8 2

A . v = λ . v

Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A

Given matrix A, how can we calculate the eigenvector and eigenvalues for A?

x = 4 x=

[from BIIS_05lecture7.ppt] Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

So, principal components are given by:

y1 = a11x1 + a12x2 + ... + a1kxk

y2 = a21x1 + a22x2 + ... + a2kxk

yk = ak1x1 + ak2x2 + ... + akkxk

xj’s are standardized if correlation matrix is used (mean 0.0, SD 1.0)

Score of ith unit on jth principal component

yi,j = aj1xi1 + aj2xi2 + ... + ajkxik

PCA Scores

4.0 4.5 5.0 5.5 6.02

yi,1 yi,2

Amount of variance accounted for by:1st principal component, λ1, 1st eigenvalue

2nd principal component, λ2, 2nd eigenvalue

λ1 > λ2 > λ3 > λ4 > ...

Average λj = 1 (correlation matrix)

Principal Components Analysis:Eigenvalues

4.0 4.5 5.0 5.5 6.02

λ1λ2

PCA: Terminology• jth principal component is jth eigenvector of

correlation/covariance matrix• coefficients, ajk, are elements of eigenvectors and relate original

variables (standardized if using correlation matrix) to components• scores are values of units on components (produced using

coefficients)• amount of variance accounted for by component is given by

eigenvalue, λj

• proportion of variance accounted for by component is given by λj / Σ λj

• loading of kth original variable on jth component is given by ajk√λj --correlation between variable and component

How many components to use?

• If λj < 1 then component explains less variance

than original variable (correlation matrix)• Use 2 components (or 3) for visual ease• Scree diagram:

1 2 3 4 5

Component number

Principal Components Analysis on:• Covariance Matrix:

– Variables must be in same units– Emphasizes variables with most variance– Mean eigenvalue ≠1.0– Useful in morphometrics, a few other cases

• Correlation Matrix:– Variables are standardized (mean 0.0, SD 1.0)– Variables can be in different units– All variables have same impact on analysis– Mean eigenvalue = 1.0

PCA: Potential Problems

• Lack of Independence– NO PROBLEM

• Lack of Normality– Normality desirable but not essential

• Lack of Precision– Precision desirable but not essential

• Many Zeroes in Data Matrix– Problem (use Correspondence Analysis)

PCA applications -Eigenfaces

• the principal eigenface looks like a bland androgynous average human face

http://en.wikipedia.org/wiki/Image:Eigenfaces.png

Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Eigenfaces – Face Recognition

• When properly weighted, eigenfaces can be summed together to create an approximate gray-scale rendering of a human face.

• Remarkably few eigenvector terms are needed to give a fair likeness of most people's faces

• Hence eigenfaces provide a means of applying data compression to faces for identification purposes.

Adapted from http://www.cs.rit.edu/~rsg/BIIS_05lecture7.ppt

Unsupervised Mining

Puts together slides prepared by Brandon Xia with images from Alter et al. and Kluger et al. papers

SVD for microarray data(Alter et al, PNAS 2000)

A = USVT

• A is any rectangular matrix (m ≥ n)• Row space: vector subspace

generated by the row vectors of A• Column space: vector subspace

generated by the column vectors of A– The dimension of the row & column

space is the rank of the matrix A: r (≤ n)

• A is a linear transformation that maps vector x in row space into vector Ax in column space

A = USVT

• U is an “orthogonal” matrix (m ≥ n)• Column vectors of U form an

orthonormal basis for the column space of A: UTU=I

• u1, …, un in U are eigenvectors of AAT

– AAT =USVT VSUT =US2 UT

– “Left singular vectors”

| | |nU

A = USVT

• V is an orthogonal matrix (n by n)• Column vectors of V form an

orthonormal basis for the row space of A: VTV=VVT=I

• v1, …, vn in V are eigenvectors of ATA– ATA =VSUT USVT =VS2 VT

– “Right singular vectors”

| | |nV

A = USVT

• S is a diagonal matrix (n by n) of non-negative singular values

• Typically sorted from largest to smallest

• Singular values are the non-negative square root of corresponding eigenvalues of ATA and AAT

AV = US

• Means each Avi = siui

• Remember A is a linear map from row space to column space

• Here, A maps an orthonormal basis {vi} in row space into an orthonormal basis {ui} in column space

• Each component of ui is the projection of a row onto the vector vi

SVD of A (m by n): recap

• A = USVT = (big-"orthogonal")(diagonal)(sq-orthogonal)

• u1, …, um in U are eigenvectors of AAT

• v1, …, vn in V are eigenvectors of ATA• s1, …, sn in S are nonnegative singular values of

• AV = US means each Avi = siui

• “Every A is diagonalized by 2 orthogonal matrices”

SVD as sum of rank-1 matrices

• A = USVT

• A = s1u1v1T + s2u2v2

T +… + snunvnT

• s1 ≥ s2 ≥ … ≥ sn ≥ 0

• What is the rank-r matrix A that best approximates A ?– Minimize

• A = s1u1v1T + s2u2v2

T +… + srurvrT

• Very useful for matrix approximation

ij iji j

Examples of (almost) rank-1 matrices

• Steady states with fluctuations

• Array artifacts?

• Signals?

101 103 102

302 300 301

203 204 203

401 402 404

101 303 202

102 300 201

103 304 203

101 302 204

Geometry of SVD in row space

• A as a collection of m row vectors (points) in the row space of A

• s1u1v1T is the best rank-1 matrix

approximation for A

• Geometrically: v1 is the direction of the best approximating rank-1 subspace that goes through origin

• s1u1 gives coordinates for row vectors in rank-1 subspace

• v1 Gives coordinates for row space basis vectors in rank-1 subspace

iA si iv u

I i iv v

This line segment that goes through origin approximates the original data set

The projected data set approximates the original data set

y s1u1v1T

• A as a collection of m row vectors (points) in the row space of A

• s1u1v1T + s2u2v2

T is the best rank-2 matrix approximation for A

• Geometrically: v1 and v2 are the directions of the best approximating rank-2 subspace that goes through origin

• s1u1 and s2u2 gives coordinates for row vectors in rank-2 subspace

• v1 and v2 gives coordinates for row space basis vectors in rank-2 subspace

iA si iv u

I i iv v

What about geometry of SVD in column space?

• A = USVT

• AT = VSUT

• The column space of A becomes the row space of AT

• The same as before, except that U and V are switched

Geometry of SVD in row and column spaces

• Row space– siui gives coordinates for row vectors along unit

vector vi

– vi gives coordinates for row space basis vectors along unit vector vi

• Column space– sivi gives coordinates for column vectors along

unit vector ui

– ui gives coordinates for column space basis vectors along unit vector ui

• Along the directions vi and ui, these two spaces look pretty much the same!– Up to scale factors si

– Switch row/column vectors and row/column space basis vectors

– Biplot....

iA si iv u

I i iv v

TiA si iu v

I i iu u

When is SVD = PCA?

• Centered data

x’y’

When is SVD different from PCA?

y’’x’’

Translation is not a linear operation, as it moves the origin !

Additional Points

• Time Complexity (Cubic)

• Application to text mining– Latent semantic indexing– sparse

Conclusion

• SVD is the “absolute high point of linear algebra”• SVD is difficult to compute; but once we have it, we have

many things• SVD finds the best approximating subspace, using linear

transformation• Simple SVD cannot handle translation, non-linear

transformation, separation of labeled data, etc.• Good for exploratory analysis; but once we know what

we look for, use appropriate tools and model the structure of data explicitly!

Unsupervised Mining

Intuition on interpretation of SVD in terms of genes and

conditions77

SVD for microarray data(Alter et al, PNAS 2000)

Notation• m=1000 genes

– row-vectors

– 10 eigengene (vi) of dimension 10 conditions

• n=10 conditions (assays)– column vectors

– 10 eigenconditions (ui) of dimension 1000 genes

Understanding Eigengenes (vi) in terms PCA on (large) gene-gene correlation matrix

Understanding Eigenconditions (ui) in terms of PCA on (small) condition-condition correlation matrix

Bra - ket notation

Close up on Eigengenes

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Genes sorted by correlation with top 2 eigengenes

Alter, Orly et al. (2000) Proc. Natl. Acad. Sci. USA 97, 10101-10106

Normalized elutriation

expression in the subspace

associated with the cell cycle

Plotting Experiments in Low Dimension Subspace

Unsupervised Mining

Weighted Gene Co-Expression Network

Weighted Gene Co-Expression Network Analysis

Bin Zhang and Steve Horvath (2005)

"A General Framework for Weighted Gene Co-Expression Network Analysis",

Statistical Applications in Genetics and Molecular Biology: Vol. 4: No. 1, Art. 17. Adapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Central concept in network methodology:

• Modules: groups of densely interconnected genes

(not the same as closely related genes)

– a class of over-represented patterns

• Empirical fact: gene co-expression networks exhibit

modular structure

Network Modules

Module Detection

• Numerous methods exist

• Many methods define a suitable gene-gene dissimilarity measure and use clustering.

• In our case: dissimilarity based on topological overlap

• Clustering method: Average linkage hierarchical clustering

– branches of the dendrogram are modulesAdapted from : http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork

Topological overlap measure,

• Pairwise measure by Ravasz et al, 2002

• TOM[i,j] measures the overlap of the set of nearest

neighbors of nodes i,j

• Closely related to twinness

• Easily generalized to weighted networks

Calculating TOM

• Normalized to [0,1] with 0 = no overlap, 1 = perfect overlap• Generalized in Zhang and Horvath (2005) to the case of

weighted networks

1ij ijDistTOM TOM

TOMij=

min ki, k

j1−a

Example of module detection via hierarchical clustering

• Expression data from human brains, 18 samples.

Example of module detection via hierarchical clustering

Module eigengenes

• Often: Would like to treat modules as single units

– Biologically motivated data reduction

• Construct a representative

• Our choice: module eigengene = 1st principal component of

the module expression matrix

• Intuitively: a kind of average expression profile

• Genes of each module must be highly correlated for a

representative to really represent

Langfelder P, Horvath S (2007) Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1:54

ExampleHuman brain expression data, 18

samples

Module consisting of 50 genes

Module eigengenes are very useful!

• Summarize each module in one synthetic expression profile

• Suitable representation in situations where modules are

considered the basic building blocks of a system

– Allow to relate modules to external information

(phenotypes, genotypes such as SNP, clinical traits) via

simple measures (correlation, mutual information etc)

– Can quantify co-expression relationships of various

modules by standard measures

Human and chimpanzee brain

expression data

Construct gene expression networks in both sets, find

modules

Construct consensus modules

Characterize each module by brain region where it is most

differentially expressed

Represent each module by its eigengene

Characterize relationships among modules by correlation of

respective eigengenes (heatmap or dendrogram)

Set modules

Biological information?

Assign modules to brain regions with highest (positive) differential expression

Red means the module genes

are over-expressed in the

brain region;

green means under-

expression

Visualizing consensus eigengene networks

Heatmap comparisons of module relationships

For more information

Weighted Gene Co-expression Networks website:

http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/

A short methodological summary of the publications.• How to construct a gene co-expression network using the scale free topology criterion? Robustness of network results. Relating a gene

significance measure and the clustering coefficient to intramodular connectivity: – Zhang B, Horvath S (2005) "A General Framework for Weighted Gene Co-Expression Network Analysis", Statistical Applications in

Genetics and Molecular Biology: Vol. 4: No. 1, Article 17• Theory of module networks (both co-expression and protein-protein interaction modules):

– Dong J, Horvath S (2007) Understanding Network Concepts in Modules, BMC Systems Biology 2007, 1:24• What is the topological overlap measure? Empirical studies of the robustness of the topological overlap measure:

– Yip A, Horvath S (2007) Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007, 8:22

• Software for carrying out neighborhood analysis based on topological overlap. The paper shows that an initial seed neighborhood comprised of 2 or more highly interconnected genes (high TOM, high connectivity) yields superior results. It also shows that topological overlap is superior to correlation when dealing with expression data.

– Li A, Horvath S (2006) Network Neighborhood Analysis with the multi-node topological overlap measure. Bioinformatics. doi:10.1093/bioinformatics/btl581

• Gene screening based on intramodular connectivity identifies brain cancer genes that validate. This paper shows that WGCNA greatly alleviates the multiple comparison problem and leads to reproducible findings.

– Horvath S, Zhang B, Carlson M, Lu KV, Zhu S, Felciano RM, Laurance MF, Zhao W, Shu, Q, Lee Y, Scheck AC, Liau LM, Wu H, Geschwind DH, Febbo PG, Kornblum HI, Cloughesy TF, Nelson SF, Mischel PS (2006) "Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target", PNAS | November 14, 2006 | vol. 103 | no. 46 | 17402-17407

• The relationship between connectivity and knock-out essentiality is dependent on the module under consideration. Hub genes in some modules may be non-essential. This study shows that intramodular connectivity is much more meaningful than whole network connectivity:

– "Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-Expression Networks" (2006) by Carlson MRJ, Zhang B, Fang Z, Mischel PS, Horvath S, and Nelson SF, BMC Genomics 2006, 7:40

• How to integrate SNP markers into weighted gene co-expression network analysis? The following 2 papers outline how SNP markers and co-expression networks can be used to screen for gene expressions underlying a complex trait. They also illustrate the use of the module eigengene based connectivity measure kME.

– Single network analysis: Ghazalpour A, Doss S, Zhang B, Wang S, Plaisier C, Castellanos R, Brozell A, Schadt EE, Drake TA, Lusis AJ, Horvath S (2006) "Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight". PLoS Genetics. Volume 2 | Issue 8 | AUGUST 2006

– Differential network analysis: Fuller TF, Ghazalpour A, Aten JE, Drake TA, Lusis AJ, Horvath S (2007) "Weighted Gene Co-expression Network Analysis Strategies Applied to Mouse Weight", Mammalian Genome. In Press

• The following application presents a `supervised’ gene co-expression network analysis. In general, we prefer to construct a co-expression network and associated modules without regard to an external microarray sample trait (unsupervised WGCNA). But if thousands of genes are differentially expressed, one can construct a network on the basis of differentially expressed genes (supervised WGCNA):

– Gargalovic PS, Imura M, Zhang B, Gharavi NM, Clark MJ, Pagnon J, Yang W, He A, Truong A, Patel S, Nelson SF, Horvath S, Berliner J, Kirchgessner T, Lusis AJ (2006) Identification of Inflammatory Gene Modules based on Variations of Human Endothelial Cell Responses to Oxidized Lipids. PNAS 22;103(34):12741-6

• The following paper presents a differential co-expression network analysis. It studies module preservation between two networks. By screening for genes with differential topological overlap, we identify biologically interesting genes. The paper also shows the value of summarizing a module by its module eigengene.

– Oldham M, Horvath S, Geschwind D (2006) Conservation and Evolution of Gene Co-expression Networks in Human and Chimpanzee Brains. 2006 Nov 21;103(47):17973-8

Do not reproduce without permission 102

Unsupervised Mining

Biplot

Biplot• A biplot is a two-dimensional representation of a data matrix showing a point for each of the n

observation vectors (rows of the data matrix) along with a point for each of the p variables (columns of the data matrix).

– The prefix ‘bi’ refers to the two kinds of points; not to the dimensionality of the plot. The method presented here could, in fact, be generalized to a threedimensional (or higher-order) biplot. Biplots were introduced by Gabriel (1971) and have been discussed at length by Gower and Hand (1996). We applied the biplot procedure to the following toy data matrix to illustrate how a biplot can be generated and interpreted. See the figure on the next page.

• Here we have three variables (transcription factors) and ten observations (genomic bins). We can obtain a two-dimensional plot of the observations by plotting the first two principal components of the TF-TF correlation matrix R1.

– We can then add a representation of the three variables to the plot of principal components to obtain a biplot. This shows each of the genomic bins as points and the axes as linear combination of the factors.

• The great advantage of a biplot is that its components can be interpreted very easily. First, correlations among the variables are related to the angles between the lines, or more specifically, to the cosines of these angles. An acute angle between two lines (representing two TFs) indicates a positive correlation between the two corresponding variables, while obtuse angles indicate negative correlation.

– Angle of 0 or 180 degrees indicates perfect positive or negative correlation, respectively. A pair of orthogonal lines represents a correlation of zero. The distances between the points (representing genomic bins) correspond to the similarities between the observation profiles. Two observations that are relatively similar across all the variables will fall relatively close to each other within the two-dimensional space used for the biplot. The value or score for any observation on any variable is related to the perpendicular projection form the point to the line.

• Refs– Gabriel, K. R. (1971), “The Biplot Graphical Display of Matrices with Application to Principal Component Analysis,”

Biometrika, 58, 453–467.– Gower, J. C., and Hand, D. J. (1996), Biplots, London: Chapman & Hall.

Introduction• A biplot is a low-

dimensional (usually 2D) representation of a data matrix A.– A point for each of

the m observation vectors (rows of A)

– A line (or arrow) for each of the n variables (columns of A)

PCA TFs: a, b, c...

Genomic Sites: 1,2,3...

AAT (site-site correlation)

ATA (TF-TF corr.)

Biplot to Show Overall Relationship of TFs & Sites

TFs: a, b, c...

Genomic Sites: 1,2,3...

A=USVT

AAT (site-site correlation)

ATA (TF-TF corr.)

Biplot Ex

Biplot Ex #2

AT A = V S2 VT

A AT = U S2 UT

A = (U Sr) (V S1−r) T

A vj = uj sj & AT uj = vj sj

Biplot Ex #3iA si iv u

TiA si iu v

Assuming s=1,Av = uATu = v

Results of

BiplotZhang et al. (2007)

Gen. Res.

• Pilot ENCODE (1% genome): 5996 10 kb genomic bins (adding all hits) + 105 TF experiments biplot

• Angle between TF vectors shows relation b/w factors• Closeness of points gives clustering of "sites" • Projection of site onto vector gives degree to which

site is assoc. with a particular factor

Results of

BiplotZhang et al. (2007)

Gen. Res.

• Biplot groups TFs into sequence-specific and sequence-nonspecific clusters.

– c-Myc may behave more like a sequence-nonspecific TF.

– H3K27me3 functions in a transcriptional regulatory process in a rather sequence-specific manner.

• Genomic Bins are associated with different TFs and in this fashion each bin is "annotated" by closest TF cluster

Unsupervised Mining

Sorcerer II Global Ocean Survey

Sorcerer II journey August 2003- January 2006

Sample approximately every 200 miles

Rusch, et al., PLOS Biology 2007

Sorcerer II Global Ocean Survey

Metagenomic Sequence: 6.25 GB of data

0.1–0.8 μm size fraction (bacteria)

6.3 billion base pairs (7.7 million reads)

Reads were assembled and genes annotated

1 million CPU hours to process

Metadata

GPS coordinates, Sample Depth, Water Depth, Salinity, Temperature, Chlorophyll Content

Rusch, et al., PLOS Biology 2007

Metabolic

Pathways

Membrane Protein Families

Additional Metadata via GPS coordinates

Nutrient Features Extracted:

Phosphate

Silicate

Nitrate

Apparent Oxygen Utilization

Dissolved Oxygen

Annual Phosphate [umol/l] at the surface

Sample Depth: 1 meter

Water Depth: 32 meters

Chlorophyll: 4.0 ug/kg

Salinity: 31 psu

Temperature: 11 C

Location: 41°5'28"N, 71°36'8"W

Extracting Environmental Data from Other Sources

World Ocean Atlas 2005NOAA/NODC

Mapping Raw Metagenomic

Reads to a Matrix of Familes or

Pathways for each Site

Patel et. al., Genome Research 2010

Families Matrix

Expressing data as matrices indexed by site, env. var., and pathway usage

Pathway Sequences(Community Function) Environmental

Features

[Rusch et. al., (2007) PLOS Biology; Gianoulis et al., PNAS (in press, 2009]

Simple Relationships: Pairwise Simple Relationships: Pairwise CorrelationsCorrelations

Pathways

Canonical Correlation Analysis: Simultaneous weighting

Lifestyle Index = a +b + c

Weight # km run/week Lifestyle Index Fit Index

Fit Index = a + b +c

Canonical Correlation Analysis: Simultaneous weighting

Lifestyle Index = a +b + c

Weight # km run/week Lifestyle Index Fit Index

Fit Index = a + b +c

Metabolic Pathways/ Protein Families

EnvironmentalFeatures

Chlorophyll

etc Photosynthesis

Lipid Metabolism

CCA: Finding Variables CCA: Finding Variables with Large Projections in "Correlation Circle"with Large Projections in "Correlation Circle"

The goal of this technique is to interpret cross-variance matricesWe do this by defining a change of basis.

Gianoulis et al., PNAS 2009

Formal CCA def'n

• More intuitively, canonical correlation analysis can be defined as the problem of finding two sets of basis vectors, one for x and the other for y, such that the correlations between the projections of the variables onto these basis vectors are mutually maximized.

[From M Borga, CCA Tutorial, http://www.imt.liu.se/~magnus/cca ]

CCA resultsWe are defining a change of basis of the cross co-variance matrixWe want the correlations between the projections of the variables, X and Y, onto

the basis vectors to be mutually maximized.Eigenvalues squared canonical correlationsEigenvectors normalized canonical correlation basis vectors

Environment Family

Correlation = .3

Correlation= 1

This plot shows the correlations in the first and second dimensions

Correlation Circle: The closer the point is to the outer circle, the higher the correlation

Variables projected in the same direction are correlated

Circuit MapCircuit Map

Environmentally invariant

Environmentally variant

Strength of Pathway co-variation with environment

Conclusion #1: energy conversion strategy, temp and depth

Conclusion #2: Outer Membrane components vary with the Conclusion #2: Outer Membrane components vary with the environmentenvironment

Gianoulis et al., PNAS 2009Patel et al. Genome Research 2010

Membrane proteins interact with the environment, transporting available nutrients, sensing environmental signals, and responding to changes

Membrane Proteins: Sensing and Responding the Environment

Water depth

Acidity

App. O2 util.

Salinity

Pollution

Climate change Shipping

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

Sample Depthvariant

invariant

Dimension 1

Patel and Gianoulis et al., (in press) Genome Research

44 invariant membrane protein families

107 variant membrane protein families

• 2.3 million predicted membrane proteins

• 1.2 million unique• 850,000 mapped to 151

membrane protein COGs

20% have NO KNOWN FUNCTION

Membrane Proteins co-vary more than Metabolic Pathways

Median absolute structural Correlation Coefficient

Metabolic Pathways = 0.17

Dimension 1

Acidity

App. O2 util.

Salinity

Pollution

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

Sample Depthvariant

invariant

Dimension 1

Membrane Proteins = 0.3

Acidity

App. O2 util.

Salinity

Pollution

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

Sample Depthvariant

invariant

Dimension 1

CCA Limitations

Water depth

Both the strength and the

directionality of relationships between nodes is difficult to decipher in this format.

Acidity

App. O2 util.

Salinity

Pollution

Phospahte

NitrateSilicate

Temperature

Chlorophyll

Dissolved O2

Sample Depthvariant

invariant

Dimension 1

Protein Families and Environmental Features Network (PEN)

cosbaba

Distance: Dot product between 1st and 2nd Dimension of CCA

Protein Families and Environmental Features Network (PEN)

“Bi-modules”: groups of environmental features and membrane proteins families that are associated

UV, dissolved oxygen, apparent oxygen utilization, sample depth, and water depth are not in the network

COG0598, Magnesium Transporter

COG1176, Polyamine Transporter

Bi-module 1: Phosphate/Phosphate Transporters

Low Phosphate, high affinity phosphate transporters which are induced during phosphate limitation

High Phosphate, low affinity inorganic phosphate ion transporter which are constitutively expressed

Bi-module 2: Iron Transporters/Pollution/Shipping

Negative relationship between areas of high ocean-based pollution and shipping and transporters involved in the uptake of iron

Pollution and Shipping may be a proxy for iron concentrations

do not reproduce without permission 1 1 - lectures.gersteinlab.org (c) '09 bioinformatics...

original variables

new variables

original set of variables

new set of variables

possible of original

data set sample

data sety2

n x n matrixv

Documents

1 (c) m gerstein, 2006, yale, gersteinlab.org bioinformatics...

1 cs/cbb 545 - data mining spectral methods (pca,svd) #2 -...

array informatics mark gerstein

1 (c) mark gerstein, 1999, yale, bioinfo.mbb.yale.edu...

mark gerstein, yale university gersteinlab.org/courses/452...

sound of colors - gerstein

1 1 - lectures.gersteinlab.org (c) '09 scientific publishing...

david gerstein - exhibition catalogue 2014

mark gerstein williams professor of biomedical informatics...

lectures.gersteinlab.org larva: an integrative framework for...

do not reproduce without permission 1...

mark gerstein williams professor of biomedical informatics,...

outline – im @ gerstein background: gerstein im & other...

(c) m gerstein '06, gerstein.info/talks 1 cs/cbb 545 - data...

bioinformatics datamining - gerstein...

sustainability workshop ellen gerstein ari russell

1 (c) m gerstein, 2006, yale, gersteinlab.org seq....

simulation bioinformatics - gerstein...

gerstein et al-2015-anesthesia_&_analgesia

1 (c) mark gerstein, 2006, yale, bioinfo.mbb.yale.edu...