clustering and machine learning for gene expression...
TRANSCRIPT
Clustering and machine learning for gene expression data
Stefan Enroth
Original slides by Torgeir
R. Hvidsten The Linnaeus Centre for Bioinformatics
S. Enroth2009.02.182
Machine learning: to learn general concepts from examples
Real world Data (Feature space)
Knowledge (classes)
Assumed functional relationship partially described by the examples
Data collection
Abstraction
Machine learning
S. Enroth2009.02.183
Ordered controlled vocabulary organized in a taxonomy for describing the molecular role of gene products
•
Molecular function: the tasks performed by individual gene products
•
Biological process: broad biological goals that are accomplished by ordered assemblies of molecular functions
•
Cellular component: subcellular
structures,
locations, and macromolecular complexes
Gene Ontology
S. Enroth2009.02.184
Protein structure classification (CATH)
S. Enroth2009.02.185
Microarray
S. Enroth2009.02.186
Numerical data
Gene/Expr E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 … EMG1 -0.47 -3.32 -0.81 0.11 -0.60 -1.36 -1.03 -1.84 -1.00 -0.60 … -0.94G2 0.66 0.07 0.20 0.29 -0.89 -0.45 -0.29 -0.29 -0.15 -0.45 … -0.42G3 0.14 -0.04 0.00 -0.15 -0.58 -0.30 -0.18 -0.38 -0.49 -0.81 … -1.12G4 -0.04 0.00 -0.23 -0.25 -0.47 -0.60 -0.56 -1.09 -0.71 -0.76 … -0.62G5 0.28 0.37 0.11 -0.17 -0.18 -0.60 -0.23 -0.58 -0.79 -0.29 … -0.74G6 0.54 0.53 0.16 0.14 0.20 -0.34 -0.38 -0.36 -0.49 -0.58 … -1.47G7 0.20 0.14 0.00 0.11 -0.34 -0.03 0.04 -0.76 -0.81 -1.12 … -1.36G8 0.40 0.43 0.18 0.00 -0.14 0.29 0.07 -0.79 -0.81 -0.92 … -1.22G9 0.01 0.46 0.28 -0.34 -0.23 -0.36 -0.45 -0.64 -0.79 -1.22 … -1.09… … … … … … … … … … … … …GN -0.23 0.04 0.00 -0.30 -0.29 -0.45 -0.97 -2.06 -0.89 -1.22 … -0.97
-0.04 = log(2.3/2.4) = log(“Red/Green”)
M < 100
N ≈
10k-100k
S. Enroth2009.02.187
Next Generation RNA-Sequencing
Nature Reviews Genetics 10, 57-63 (January 2009)
S. Enroth2009.02.188
Numerical data
• Ideally, counts of the actual number of transcripts in the cell
• Also, information on isoforms, splice variants etc
• Ongoing reaserch!
Wang & Sandberg et al, Nature 456, 470-476 (27 November 2008)
S. Enroth2009.02.189
Data analysis goalsWhat to study?
•
Classes of experiments; changes in expression levels in tissue samples with different e.g. diseases, treatments, environmental effects etc.
•
Classes of genes; expression profiles of genes with similar biological function
•
Both of the above
S. Enroth2009.02.1810
Data analysis methods
•
Unsupervised learning
(clustering, class discovery); used to “discover”
natural groups of
genes/experiments e.g.–
discover subclasses of a form of cancer that is clinically homogenous
•
Supervised learning; used to “learn”
a model of a set of predefined classes of genes/experiments e.g.–
diagnosis of cancer/subclasses of cancer
S. Enroth2009.02.1811
Clustering analysis
Need to define;•
measure of similarity
•
algorithm for using the measure of similarity to discover natural groups in the data
The number of ways to divide n
items into k clusters: kn/k!
Example: 10500/10! = 2.756 ×
10493
S. Enroth2009.02.1812
Measure of similarity
E1
E2
d
What is similar?
Euclidean distanceAppl. Dependent
S. Enroth2009.02.1813
Hierarchical clustering
•
INPUT: n
genes/experiments
Consider each gene/experiment as an individual cluster and initiate an n
×
n
distance matrix d
Repeat–
identify the two most similar clusters in d (i.e. smallest number in d)
–
merge the two most similar clusters and update the matrix (i.e. substitute the two clusters with the new cluster)
•
OUTPUT: A tree of merged genes/experiments (called a dendrogram)
S. Enroth2009.02.1814
Hierarchical clustering (cont’d)Popular inter-cluster similarity measures:
(a) single linkage (smallest), (b) complete linkage (largest) and(c) average linkage
S. Enroth2009.02.1815
Hierarchical clustering (cont’d)
Single linkage Average linkage
Exactly the same data!
S. Enroth2009.02.1816
Hierarchical clustering (cont’d)
Single linkage Average linkage
Exactly the same data!
Example of hierarchical clustering: languages of Europe
Distance: Frequency of numbers with different first letter e.g.
dEN = 2 dEDu = 7 dSpI = 1
Intercluster strategy: SINGLE LINKAGE
Iteration 1
E N Da Du G Fr Sp I P H FiE 0N 2 0
Da 2 1 0Du 7 5 6 0G 6 4 5 5 0Fr 6 6 6 9 7 0Sp 6 6 5 9 7 2 0I 6 6 5 9 7 1 1 0P 7 7 6 10 8 5 3 4 0H 9 8 8 8 9 10 10 10 10 0Fi 9 9 9 9 9 9 9 9 9 8 0
I
12345678
Fr
Iteration 2
I Fr E N Da Du G Sp P H FiI Fr 0E 6 0N 6 2 0
Da 5 2 1 0Du 9 7 5 6 0G 7 6 4 5 5 0Sp 1 6 6 5 9 7 0P 4 7 7 6 10 8 3 0H 10 9 8 8 8 9 10 10 0Fi 9 9 9 9 9 9 9 9 8 0
I
12345678
Fr Da N
Iteration 3
Da N I Fr E Du G Sp P H FiDa N 0I Fr 5 0E 2 6 0
Du 5 9 7 0G 4 7 6 5 0Sp 5 1 6 9 7 0P 6 4 7 10 8 3 0H 8 10 9 8 9 10 10 0Fi 9 9 9 9 9 9 9 8 0
I
12345678
Fr Da NSp
Iteration 4
Sp I Fr
Da N E Du G P H Fi
Sp I Fr 0Da N 5 0
E 6 2 0Du 9 5 7 0G 7 4 6 5 0P 3 6 7 10 8 0H 10 8 9 8 9 10 0Fi 9 9 9 9 9 9 8 0
I
12345678
Fr Da NSp E
Iteration 5
E Da N
Sp I Fr Du G P H Fi
E Da N 0
Sp I Fr 5 0Du 5 9 0G 4 7 5 0P 6 3 10 8 0H 8 10 8 9 10 0Fi 9 9 9 9 9 8 0
I
12345678
Fr Da NSp EP
Iteration 6
P Sp I Fr
E Da N Du G H Fi
P Sp I Fr 0
E Da N 5 0
Du 9 5 0G 7 4 5 0H 10 8 8 9 0Fi 9 9 9 9 8 0
I
12345678
Fr Da NSp EP G
Iteration 7
G E Da N
P Sp I Fr Du H Fi
G E Da N 0
P Sp I Fr 5 0Du 5 9 0H 8 10 8 0Fi 9 9 9 8 0
I
12345678
Fr Da NSp EP G Du
Iteration 8
Du G E Da N
P Sp I Fr H Fi
Du G E Da N 0
P Sp I Fr 5 0H 8 10 0Fi 9 9 8 0
I
12345678
Fr Da NSp EP G Du
Iteration 9
P Sp I Fr Du G E Da N H Fi
P Sp I Fr Du G E Da N 0
H 8 0Fi 9 8 0
I
12345678
Fr Da NSp EP G Du H
Iteration 10
Fi H
P Sp I Fr Du G E Da N
Fi H 0P Sp I
Fr Du G E Da N 8 0
I
12345678
Fr Da NSp EP G Du H Fi
S. Enroth2009.02.1828
Any data mining result needs to be consistent BOTH with the data and current knowledge!
S. Enroth2009.02.1829
Evaluation of clusters
I
12345678
Fr Da NSp EP G Du H Fi
Clusters may be evaluated according to how well they describe current knowledge
RomanSlavicGermanicUgro-Finnish
S. Enroth2009.02.1830
Hierarchical clustering: properties
•
Huge memory requirements: stores the n
×
n
matrix•
Running time: O(n3)
•
Deterministic: produces the same clustering each time
•
Nice visualization: dendrogram•
Number of clusters can be selected using the dendrogram
•
Different interpretations depending on distance method used.
S. Enroth2009.02.1831
K-means clustering
•
Split the data into k
random clustersRepeat:
–
calculate the centroid
of each cluster–
(re-)assign each gene/experiment to the closest centroid
–
stop if no new assignments are made
Example of K-means: two dimensions
Initial clustersK=2
Iteration 1
Calculate centroids
xx
Iteration 1
(Re-)assign
xx
Iteration 2
Calculate centroids
x
x
Iteration 2
(Re-)assign
x
x
Iteration 3
Calculate centroid
x
x
Iteration 3
(Re-)assign
No new assignments! STOP
x
x
S. Enroth2009.02.1839
K-means: properties
•
Low memory usage•
Running time: O(n)
•
Improves iteratively: not trapped in previous mistakes
•
Non-deterministic: will in general produce different clusters with different initializations
•
Number of clusters must be decided in advance–
Algorithms that “grow”
number of clusters if inter-
cluster variance is too high (Growing k-means, 2002).
S. Enroth2009.02.1840
Hierarchical vs. k-means
•
Hierarchical clustering: –
computationally expensive -> relatively small data sets
–
nice visualization, no. of clusters can be selected–
deterministic
–
cannot correct early ”mistakes”•
K-means: –
computationally efficient -> large data sets
–
predefined no. of clusters–
non-deterministic -> should be run several times
–
iterative improvement
•
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!
S. Enroth2009.02.1841
Hierarchical vs. k-means•
Hierarchical k-means: top-down hierarchical clustering using k-means iteratively with k=2 -> best of both worlds!
S. Enroth2009.02.1842
Supervised learning•
Uses examples of known classes to learn a model
• Examples are, for instance, expression profiles of genes with known classes (clinical state or function)
• The model can be e.g.
–
hyperplanes
separating classes in n dimensions (SVM)–
artificial neural networks
–
decision trees (Random Forrest, C4.5) –
IF-THEN rules (Rough Sets)
• Can be used for e.g.
–
diagnostics–
predicting gene function for unknown genes
S. Enroth2009.02.1843
Support Vector MachinesMaximum marginseparating ”hyperplane”
Support vectors
Soft margin
Decision boundary
S. Enroth2009.02.1844
Artificial Neural Networks (ANN)
Input layer Output layer
x1
x2
x3
x4
f(x)
…x1
xn ⎪⎪⎪
⎩
⎪⎪⎪
⎨
⎧
−
∑=
>
otherwise
n1i
if
1
01 ixiww1
wn
ABC
Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain,
PortugalGroup 3: Benelux countries, Switzerland,
Austria, Italy, Germany
Christian Democrats > 16
Group 3
Yes
Agrarians > 4
YesGroup 1 Group 2
No
Decision tree learning
No
Agrarians([4, *)) AND Christian Democrats([*, 16)) => Class(1)Agrarians([*, 4)) AND Christian Democrats([*, 16)) => Class(2)Christian Democrats([16, *)) => Class(3)
Rule learning: Rough sets
Class knowledge:Group 1: Nordic countriesGroup 2: UK, France, Greece, Spain, PortugalGroup 3: Benelux countries, Switzerland, Austria, Italy, Germany
S. Enroth2009.02.1847
Supervised vs. clustering
Clustering+
class discovery
+
robust towards incorrect knowledgeSupervised
+
evaluation+
predictive/descriptive model
+
based on actual knowledge rather than idealized hypotheses