clustering and machine learning for gene expression...

47
Clustering and machine learning for gene expression data Stefan Enroth Original slides by Torgeir R. Hvidsten The Linnaeus Centre for Bioinformatics

Upload: others

Post on 07-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Clustering and machine learning for gene expression data

Stefan Enroth

Original slides by Torgeir

R. Hvidsten The Linnaeus Centre for Bioinformatics

Page 2: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.182

Machine learning: to learn general concepts from examples

Real world Data (Feature space)

Knowledge (classes)

Assumed functional relationship partially described by the examples

Data collection

Abstraction

Machine learning

Page 3: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.183

Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products

Molecular function: the tasks performed by individual gene products

Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions

Cellular component: subcellular

structures,

locations, and macromolecular complexes

Gene Ontology

Page 4: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.184

Protein structure classification (CATH)

Page 5: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.185

Microarray

Page 6: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.186

Numerical data

Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97

-0.04 = log(2.3/2.4) = log(“Red/Green”)

M < 100

N ≈

10k-100k

Page 7: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.187

Next Generation RNA-Sequencing

Nature Reviews Genetics 10, 57-63 (January 2009)

Page 8: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.188

Numerical data

• Ideally, counts of the actual number of transcripts in the cell

• Also, information on isoforms, splice variants etc

• Ongoing reaserch!

Wang & Sandberg et al, Nature 456, 470-476 (27 November 2008)

Page 9: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.189

Data analysis goalsWhat to study?

Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.

Classes of genes; expression profiles of genes with similar biological function

Both of the above

Page 10: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1810

Data analysis methods

Unsupervised learning

(clustering, class discovery); used to “discover”

natural groups of

genes/experiments e.g.–

discover subclasses of a form of cancer that is clinically homogenous

Supervised learning; used to “learn”

a model of a set of predefined classes of genes/experiments e.g.–

diagnosis of cancer/subclasses of cancer

Page 11: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1811

Clustering analysis

Need to define;•

measure of similarity

algorithm for using the measure of similarity to discover natural groups in the data

The number of ways to divide n

items into k clusters: kn/k!

Example: 10500/10! = 2.756 ×

10493

Page 12: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1812

Measure of similarity

E1

E2

d

What is similar?

Euclidean distanceAppl. Dependent

Page 13: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1813

Hierarchical clustering

INPUT: n

genes/experiments

Consider each gene/experiment as an individual cluster and initiate an n

×

n

distance matrix d

Repeat–

identify the two most similar clusters in d (i.e. smallest number in d)

merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)

OUTPUT: A tree of merged genes/experiments (called a dendrogram)

Page 14: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1814

Hierarchical clustering (cont’d)Popular inter-cluster similarity measures:

(a) single linkage (smallest), (b) complete linkage (largest) and(c) average linkage

Page 15: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1815

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

Page 16: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1816

Hierarchical clustering (cont’d)

Single linkage Average linkage

Exactly the same data!

Page 17: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Example of hierarchical clustering: languages of Europe

Distance: Frequency of numbers with different first letter e.g.

dEN = 2 dEDu = 7 dSpI = 1

Intercluster strategy: SINGLE LINKAGE

Page 18: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

E N Da Du G Fr Sp I P H FiE 0N 2 0

Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr

Page 19: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0

Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da N

Page 20: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0

Du 5 9 7 0G 4 7 6 5 0Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp

Page 21: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 4

Sp I Fr

Da N E Du G P H Fi

Sp I Fr 0Da N 5 0

E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0

I

12345678

Fr Da NSp E

Page 22: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 5

E Da N

Sp I Fr Du G P H Fi

E Da N 0

Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0

I

12345678

Fr Da NSp EP

Page 23: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 6

P Sp I Fr

E Da N Du G H Fi

P Sp I Fr 0

E Da N 5 0

Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0

I

12345678

Fr Da NSp EP G

Page 24: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 7

G E Da N

P Sp I Fr Du H Fi

G E Da N 0

P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Page 25: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 8

Du G E Da N

P Sp I Fr H Fi

Du G E Da N 0

P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0

I

12345678

Fr Da NSp EP G Du

Page 26: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 9

P Sp I Fr Du G E Da N H Fi

P Sp I Fr Du G E Da N 0

H 8 0Fi 9 8 0

I

12345678

Fr Da NSp EP G Du H

Page 27: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 10

Fi H

P Sp I Fr Du G E Da N

Fi H 0P Sp I

Fr Du G E Da N 8 0

I

12345678

Fr Da NSp EP G Du H Fi

Page 28: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1828

Any data mining result needs to be consistent BOTH with the data and current knowledge!

Page 29: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1829

Evaluation of clusters

I

12345678

Fr Da NSp EP G Du H Fi

Clusters may be evaluated according to how well they describe current knowledge

RomanSlavicGermanicUgro-Finnish

Page 30: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1830

Hierarchical clustering: properties

Huge memory requirements: stores the n

×

n

matrix•

Running time: O(n3)

Deterministic: produces the same clustering each time

Nice visualization: dendrogram•

Number of clusters can be selected using the dendrogram

Different interpretations depending on distance method used.

Page 31: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1831

K-means clustering

Split the data into k

random clustersRepeat:

calculate the centroid

of each cluster–

(re-)assign each gene/experiment to the closest centroid

stop if no new assignments are made

Page 32: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Example of K-means: two dimensions

Initial clustersK=2

Page 33: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

Calculate centroids

xx

Page 34: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 1

(Re-)assign

xx

Page 35: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

Calculate centroids

x

x

Page 36: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 2

(Re-)assign

x

x

Page 37: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

Calculate centroid

x

x

Page 38: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Iteration 3

(Re-)assign

No new assignments! STOP

x

x

Page 39: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1839

K-means: properties

Low memory usage•

Running time: O(n)

Improves iteratively: not trapped in previous mistakes

Non-deterministic: will in general produce different clusters with different initializations

Number of clusters must be decided in advance–

Algorithms that “grow”

number of clusters if inter-

cluster variance is too high (Growing k-means, 2002).

Page 40: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1840

Hierarchical vs. k-means

Hierarchical clustering: –

computationally expensive -> relatively small data sets

nice visualization, no. of clusters can be selected–

deterministic

cannot correct early ”mistakes”•

K-means: –

computationally efficient -> large data sets

predefined no. of clusters–

non-deterministic -> should be run several times

iterative improvement

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

Page 41: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1841

Hierarchical vs. k-means•

Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!

Page 42: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1842

Supervised learning•

Uses examples of known classes to learn a model

• Examples are, for instance, expression profiles of genes with known classes (clinical state or function)

• The model can be e.g.

hyperplanes

separating classes in n dimensions (SVM)–

artificial neural networks

decision trees (Random Forrest, C4.5) –

IF-THEN rules (Rough Sets)

• Can be used for e.g.

diagnostics–

predicting gene function for unknown genes

Page 43: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1843

Support Vector MachinesMaximum marginseparating ”hyperplane”

Support vectors

Soft margin

Decision boundary

Page 44: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1844

Artificial Neural Networks (ANN)

Input layer Output layer

x1

x2

x3

x4

f(x)

…x1

xn ⎪⎪⎪

⎪⎪⎪

∑=

>

otherwise

n1i

if

1

01 ixiww1

wn

ABC

Page 45: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,

PortugalGroup 3: Benelux countries, Switzerland,

Austria, Italy, Germany

Christian Democrats > 16

Group 3

Yes

Agrarians > 4

YesGroup 1 Group 2

No

Decision tree learning

No

Page 46: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)

Rule learning: Rough sets

Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany

Page 47: Clustering and machine learning for gene expression dataxray.bmc.uu.se/kurs/BioinfX3/2009/l13_clustering.pdf · Data analysis methods • Unsupervised learning (clustering, class

S. Enroth2009.02.1847

Supervised vs. clustering

Clustering+

class discovery

+

robust towards incorrect knowledgeSupervised

+

evaluation+

predictive/descriptive model

+

based on actual knowledge rather than idealized hypotheses