goal a: find groups of genes that have correlated expression profiles. these genes are believed to...

Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.

Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression.

Unsupervised Analysis

Clustering Methods

K-means: The Algorithm

Given a set of numeric points in d dimensional space, and integer k

Algorithm generates k (or fewer) clusters as follows:

1. Assign all points to a cluster at random2. Compute centroid for each cluster3. Reassign each point to nearest centroid

4. If centroids changed go back to stage 2

K-means: Example, k = 3

Step 1:Step 1: Make random assignments Make random assignments and compute centroids (big dots)and compute centroids (big dots)

Step 2:Step 2: Assign points to nearest Assign points to nearest centroidscentroids

Step 3:Step 3: Re-compute centroids (in this Re-compute centroids (in this example, solution is now stable)example, solution is now stable)

Fuzzy K means

The clusters produced by the k-means procedure are sometimes called "hard" or "crisp" clusters, since any feature vector x either is or is not a member of a particular cluster. This is in contrast to "soft" or "fuzzy" clusters, in which a feature vector x can have a degree of membership in each cluster.

The fuzzy-k-means procedure allows each feature vector x to have a degree of membership in Cluster i:

Fuzzy K means Algorithm

Make initial guesses for the means m1, m2,..., mk

Until there are no changes in any mean: Use the estimated means to find the degree of membership u(j,i) of xj in

Cluster i; for example, if dist(j,i) = exp(- || xj - mi ||2 ), one might use u(j,i)

= dist(j,i) / j dist(j,i) For i from 1 to k

Replace mi with the fuzzy mean of all of the examples for Cluster i

end_for end_until

j

jj

i iju

xiju

m2

2

),(

),(

Time course experiment

K-means: Sample Application

Gene clustering. Given a series of microarray

experiments measuring the expression of a set of genes at regular time intervals in a common cell line.

Normalization allows comparisons across microarrays.

Produce clusters of genes which vary in similar ways over time.

Hypothesis: genes which vary in the same way may be co-regulated and/or participate in the same pathway.

Sample Array. Rows are genes Sample Array. Rows are genes and columns are time points.and columns are time points.

A cluster of co-regulated genes.A cluster of co-regulated genes.

Iteration = 3

•Start with random position of K centroids.

•Iteratre until centroids are stable

•Assign points to centroids

•Move centroids to centerof assign points

Centroid Methods - K-means

Application of K-means to tome course experiments

Agglomerative Hierarchical Clustering

Results depend on distance update method Single linkage: elongated clusters Complete linkage: sphere-like clusters

Greedy iterative process Not robust against noise No inherent measure to choose the clusters

Gene Expression Data

Cluster genes and conditions

2 independent clustering: Genes represented as

vectors of expression in all conditions

Conditions are represented as vectors of expression of all genes

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Experiments

Ge

ne

s

Colon cancer data (normalized genes)

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

1. Identify tissue classes (tumor/normal)

First clustering - Experiments

2. Find Differentiating And Correlated Genes

Second Clustering - Genes

Ribosomal proteins Cytochrome C

HLA2

metabolism

Two-wayClustering

Coupled Two-way Clustering (CTWC)

Motivation: Only a small subset of genes play a role

in a particular biological process; the other genes

introduce noise, which may mask the signal of the

important players. Only a subset of the samples exhibit

the expression patterns of interest.New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets.CTWC is a heuristic to solve this problem.

0 10 20 30 40 50 60

0

10

20

30

40

50

60

0 10 20 30 40 50 60

0

10

20

30

40

50

60

CTWC of Colon Cancer Data

A

B

A

B

10 20 30 40 50 60

200

400

600

800

1000

1200

1400

1600

1800

2000

(A)

(B)

Multiple Testing Problem

Simultaneously test m null hypotheses, one for each gene j

Hj: no association between expression measure of gene j and the response

Because microarray experiments simultaneously monitor expression levels of thousands of genes, there is a large multiplicity issue

Increased chance of false positives

Hypothesis Truth Vs. Decision

# not rejected # rejected totals

# true H U V

Type I error

m0

# non-true H T

Type II error

S m1

totals m - R R m

TruthDecision

Strong Vs. Weak Control

All probabilities are conditional on which hypotheses are true

Strong control refers to control of the Type I error rate under any combination of true and false nulls

Weak control refers to control of the Type I error rate only under the complete null hypothesis (i.e. all nulls true)

In general, weak control without other safeguards is unsatisfactory

Adjusted p-values (p*)

Test level (e.g. 0.05) does not need to be determined in advance

Some procedures most easily described in terms of their adjusted p-values

Usually easily estimated using resampling

Procedures can be readily compared based on the corresponding adjusted p-values

A Little Notation

For hypothesis Hj, j = 1, …, m

observed test statistic: tj

observed unadjusted p-value: pj

Ordering of observed (absolute) tj: {rj}

such that |tr1| |tr2

| … |trG|

Ordering of observed pj: {rj}

such that |pr1| |pr2

| … |prG|

Denote corresponding RVs by upper case letters (T, P)

Control of the type I errors

Bonferroni single-step adjusted p-values

pj* = min (mpj, 1)

Sidak single-step (SS) adjusted p-values

pj * = 1 – (1 – pj)m

Sidak free step-down (SD) adjusted p-values

pj * = 1 – (1 – p(j))(m – j + 1)


Holm (1979) step-down adjusted p-values

prj* = maxk = 1…j {min ((m-k+1)prk, 1)}

Intuitive explanation: once H(1) rejected by Bonferroni, there are only m-1 remaining hyps that might still be true (then another Bonferroni, etc.)

Hochberg (1988) step-up adjusted p-values (Simes inequality)

prj* = mink = j…m {min ((m-k+1)prk, 1) }


Westfall & Young (1993) step-down minP adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} Pl prk H0C )}

Westfall & Young (1993) step-down maxT adjusted p-values

prj* = maxk = 1…j { p(maxl{rk…rm} |Tl| ≥ |trk| H0C )}

Westfall & Young (1993) Adjusted p-values

Step-down procedures: successively smaller adjustments at each step

Take into account the joint distribution of the test statistics

Less conservative than Bonferroni, Sidak, Holm, or Hochberg adjusted p-values

Can be estimated by resampling but computer-intensive (especially for minP)

goal a: find groups of genes that have correlated expression profiles. these genes are believed to...

Documents

clusters of genes

cluster i

stable slide

twoway clustering slide

set of genes

groups of genes

correlated genes

cluster of coregulated