clustering jarno tuimala. clustering aim –grouping objects (genes or chips) into clusters so that...

17
Clustering Jarno Tuimala

Upload: daisy-hensley

Post on 18-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Clustering

Jarno Tuimala

Page 2: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Clustering

• Aim– Grouping objects (genes or chips) into clusters so that

the objects inside one cluster are more closely related to each other than to objects in other clusters

• Exploratory data analysis– View all data simultaneously– Identify clusters and patterns in data

• Uses:– Time series analysis– Visualization of known classes

Page 3: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Unsupervised vs. Supervised

Page 4: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Clustering methods

• Hierarchical clustering– single, average (UPGMA) and complete

linkage

• Non-hierarchical clustering– K-means

Page 5: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Hierarchical clustering

• Two phases– Pick a distance method

• Euclidian• Pearson / Spearman correlation

– Pick the dendrogram drawing method• Single linkage• Average linkage• Complete linkage

Page 6: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Distances

• Euclidian– Average difference between gene or chip

expression profiles– Similar values are clustered together

• Correlation– Difference in trends– Similar trends are clustered together– Typically: Pearson or Spearman correlation

Page 7: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Single, average, and complete linkage

Dendrogram drawing

Page 8: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

UPGMA example

Page 9: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

X55123Gata3 Kcnd2

2...

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

3 gene...

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

5 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

Y13090Casp12 Gria4

7 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

8 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

M33760Fgfr1

L06443Gdf3

10 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

M33760Fgfr1

L06443Gdf3

10 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

M33760Fgfr1

L06443Gdf3

10 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

M33760Fgfr1

L06443Gdf3

10 gene tree

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

X55123Gata3 Kcnd2 Api6

Y18280Dyrk1b

U16297Cyb561

U39827Gpcr25

Y13090Casp12 Gria4

M33760Fgfr1

L06443Gdf3

10 gene tree non-binary

Time 0 , Strain chocolate_addict

Time 4 , Strain chocolate_addict

Time 24 , Strain chocolate_addict

Time 0 , Strain normal

Time 4 , Strain normal

Time 24 , Strain normal

Hierarchical Clustering

Gata3 Kcnd2 Api6 Dyrk1b Cyb561 Casp12 Gria4 Gpcr25 Fgfr1 Gdf3

Silicon Genetics, 2003Silicon Genetics, 2003

Page 10: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Heatmap

Page 11: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

K-means clustering

• Partitioning method– The dataset is divided into K clusters– User needs to deside on the K before the run

• K-means is heuristic algorithm, so different runs can give dissimilar results– Make several runs, and select the one giving

the minimum sum of within-clusters variance

Page 12: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

K-means Clustering

Silicon Genetics, 2003Silicon Genetics, 2003

Page 13: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

K-means Clustering

Silicon Genetics, 2003Silicon Genetics, 2003

Page 14: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

K-means Clustering

Silicon Genetics, 2003Silicon Genetics, 2003

Page 15: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

K-means Clustering

Silicon Genetics, 2003Silicon Genetics, 2003

Page 16: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Visualization

Page 17: Clustering Jarno Tuimala. Clustering Aim –Grouping objects (genes or chips) into clusters so that the objects inside one cluster are more closely related

Gene selection

• Genes are usually filtered before clustering.– This decreases calculation time.

• Typically a few hundred genes with highest variance (or standard deviation) are selected.

• If you have, e.g., two types of cancers, do not use t-test for selecting genes. You will always get a result where the cancer type is differentiates between the clusters.