clustering and machine learning for gene expression...

Clustering and machine learning for gene expression data

Stefan Enroth

Original slides by Torgeir

R. Hvidsten The Linnaeus Centre for Bioinformatics

S. Enroth2009.02.182

Machine learning: to learn general concepts from examples

Real world Data (Feature space)

Knowledge (classes)

Assumed functional relationship partially described by the examples

Data collection

Abstraction

Machine learning

S. Enroth2009.02.183

Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

•

Molecular function: the tasks performed by individual gene products

•

Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

•

Cellular component: subcellular

structures,

locations, and macromolecular complexes

Gene Ontology

S. Enroth2009.02.184

Protein structure classification (CATH)

S. Enroth2009.02.185

Microarray

S. Enroth2009.02.186

Numerical data

Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97

-0.04 = log(2.3/2.4) = log(“Red/Green”)

M < 100

N ≈

10k-100k

S. Enroth2009.02.187

Next Generation RNA-Sequencing

Nature Reviews Genetics 10, 57-63 (January 2009)

S. Enroth2009.02.188

Numerical data

• Ideally, counts of the actual number of transcripts in the cell

• Also, information on isoforms, splice variants etc

• Ongoing reaserch!

Wang & Sandberg et al, Nature 456, 470-476 (27 November 2008)

S. Enroth2009.02.189

Data analysis goalsWhat to study?

•

Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.

•

Classes of genes; expression profiles of genes with similar biological function

•

Both of the above

S. Enroth2009.02.1810

Data analysis methods

•

Unsupervised learning

(clustering, class discovery); used to “discover”

natural groups of

genes/experiments e.g.–

discover subclasses of a form of cancer that is clinically homogenous

•

Supervised learning; used to “learn”

a model of a set of predefined classes of genes/experiments e.g.–

diagnosis of cancer/subclasses of cancer

S. Enroth2009.02.1811

Clustering analysis

Need to define;•

measure of similarity

•

algorithm for using the measure of similarity to discover natural groups in the data

The number of ways to divide n

items into k clusters: kn/k!

Example: 10500/10! = 2.756 ×

10493

S. Enroth2009.02.1812

Measure of similarity

E1

E2

d

What is similar?

Euclidean distanceAppl. Dependent

S. Enroth2009.02.1813

Hierarchical clustering

•

INPUT: n

genes/experiments

Consider each gene/experiment as an individual cluster and initiate an n

×

n

distance matrix d

Repeat–

identify the two most similar clusters in d (i.e. smallest number in d)

–

merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)

•

OUTPUT: A tree of merged genes/experiments (called a dendrogram)

S. Enroth2009.02.1814

Hierarchical clustering (cont’d)Popular inter-cluster similarity measures:

(a) single linkage (smallest), (b) complete linkage (largest) and(c) average linkage

S. Enroth2009.02.1815

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

S. Enroth2009.02.1816

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g.

dEN = 2 dEDu = 7 dSpI = 1

Intercluster strategy: SINGLE LINKAGE

Iteration 1

E N Da Du G Fr Sp I P H FiE 0N 2 0

Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr

Iteration 2

I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0

Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da N

Iteration 3

Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0

Du 5 9 7 0G 4 7 6 5 0Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp

Iteration 4

Sp I Fr

Da N E Du G P H Fi

Sp I Fr 0Da N 5 0

E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp E

Iteration 5

E Da N

Sp I Fr Du G P H Fi

E Da N 0

Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0

I

12345678

Fr Da NSp EP

Iteration 6

P Sp I Fr

E Da N Du G H Fi

P Sp I Fr 0

E Da N 5 0

Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0

I

12345678

Fr Da NSp EP G

Iteration 7

G E Da N

P Sp I Fr Du H Fi

G E Da N 0

P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 8

Du G E Da N

P Sp I Fr H Fi

Du G E Da N 0

P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Iteration 9

P Sp I Fr Du G E Da N H Fi

P Sp I Fr Du G E Da N 0

H 8 0Fi 9 8 0

I

12345678

Fr Da NSp EP G Du H

Iteration 10

Fi H

P Sp I Fr Du G E Da N

Fi H 0P Sp I

Fr Du G E Da N 8 0

I

12345678

Fr Da NSp EP G Du H Fi

S. Enroth2009.02.1828

Any data mining result needs to be consistent BOTH with the data and current knowledge!

S. Enroth2009.02.1829

Evaluation of clusters

I

12345678

Fr Da NSp EP G Du H Fi

Clusters may be evaluated according to how well they describe current knowledge

RomanSlavicGermanicUgro-Finnish

S. Enroth2009.02.1830

Hierarchical clustering: properties

•

Huge memory requirements: stores the n

×

n

matrix•

Running time: O(n3)

•

Deterministic: produces the same clustering each time

•

Nice visualization: dendrogram•

Number of clusters can be selected using the dendrogram

•

Different interpretations depending on distance method used.

S. Enroth2009.02.1831

K-means clustering

•

Split the data into k

random clustersRepeat:

–

calculate the centroid

of each cluster–

(re-)assign each gene/experiment to the closest centroid

–

stop if no new assignments are made

Example of K-means: two dimensions

Initial clustersK=2

Iteration 1

Calculate centroids

xx

Iteration 1

(Re-)assign

xx

Iteration 2

Calculate centroids

x

x

Iteration 2

(Re-)assign

x

x

Iteration 3

Calculate centroid

x

x

Iteration 3

(Re-)assign

No new assignments! STOP

x

x

S. Enroth2009.02.1839

K-means: properties

•

Low memory usage•

Running time: O(n)

•

Improves iteratively: not trapped in previous mistakes

•

Non-deterministic: will in general produce different clusters with different initializations

•

Number of clusters must be decided in advance–

Algorithms that “grow”

number of clusters if inter-

cluster variance is too high (Growing k-means, 2002).

S. Enroth2009.02.1840

Hierarchical vs. k-means

•

Hierarchical clustering: –

computationally expensive -> relatively small data sets

–

nice visualization, no. of clusters can be selected–

deterministic

–

cannot correct early ”mistakes”•

K-means: –

computationally efficient -> large data sets

–

predefined no. of clusters–

non-deterministic -> should be run several times

–

iterative improvement

•

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

S. Enroth2009.02.1841

Hierarchical vs. k-means•

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

S. Enroth2009.02.1842

Supervised learning•

Uses examples of known classes to learn a model

• Examples are, for instance, expression profiles of genes with known classes (clinical state or function)

• The model can be e.g.

–

hyperplanes

separating classes in n dimensions (SVM)–

artificial neural networks

–

decision trees (Random Forrest, C4.5) –

IF-THEN rules (Rough Sets)

• Can be used for e.g.

–

diagnostics–

predicting gene function for unknown genes

S. Enroth2009.02.1843

Support Vector MachinesMaximum marginseparating ”hyperplane”

Support vectors

Soft margin

Decision boundary

S. Enroth2009.02.1844

Artificial Neural Networks (ANN)

Input layer Output layer

x1

x2

x3

x4

f(x)

…x1

xn ⎪⎪⎪

⎩

⎪⎪⎪

⎨

⎧

−

∑=

>

otherwise

n1i

if

1

01 ixiww1

wn

ABC

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,

PortugalGroup 3: Benelux countries, Switzerland,

Austria, Italy, Germany

Christian Democrats > 16

Group 3

Yes

Agrarians > 4

YesGroup 1 Group 2

No

Decision tree learning

No

Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)

Rule learning: Rough sets

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany

S. Enroth2009.02.1847

Supervised vs. clustering

Clustering+

class discovery

+

robust towards incorrect knowledgeSupervised

+

evaluation+

predictive/descriptive model

+

based on actual knowledge rather than idealized hypotheses

clustering and machine learning for gene expression...

Documents