Download - Poggi analytics - clustering - 1
![Page 1: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/1.jpg)
Buenos Aires, junio de 2016Eduardo Poggi
![Page 2: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/2.jpg)
Clustering
Supervised vs. Unsupervised Learning Clustering Concepts Non-Hierarchical Clustering
K-means EM-Algorithm
Hierarchical Clustering Hierarchical Agglomerative Clustering (HAC)
![Page 3: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/3.jpg)
Supervised vs. UnSupervised Learning
Supervised Learning Classification: partition examples into groups according to pre-
defined categories Regression: assign value to feature vectors Requires labeled data for training
Unsupervised Learning Clustering: partition examples into groups when no pre-defined
categories/classes are available Novelty detection: find changes in data Outlier detection: find unusual events (e.g. hackers) Only instances required, but no labels
![Page 4: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/4.jpg)
Clustering Concepts
El objetivo básico del análisis de clusters es descubrir grupos en los datos, de modo tal que los objetos del mismo grupo sean similares, mientras que los objetos de diferentes grupos sean tan disímiles como sea posible.
Partition unlabeled examples into disjoint subsets of clusters, such that:
Examples within a cluster are similar Examples in different clusters are different
Discover new categories in an unsupervised manner (no sample category labels provided).
![Page 5: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/5.jpg)
Clustering Concepts (2)
Las aplicaciones son muy numerosas, por ejemplo la clasificación de plantas y animales, en ciencias sociales la clasificación de personas considerando sus costumbres y preferencias, en marketing la identificación de grupos de consumidores con necesidades parecidas, etc.
Cluster retrieved documents (e.g. Teoma) to present more organized and understandable results to user
Detecting near duplicates Entity resolution E.g. “Thorsten Joachims” == “Thorsten B Joachims”
Cheating detection Exploratory data analysis Automated (or semi-automated) creation of taxonomies e.g. Yahoo-style
![Page 6: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/6.jpg)
Clustering Concepts (3)
Consideraremos dos tipos de algoritmos de clustering: Métodos de partición: clasifican los datos en k grupos que deben cumplir
los requerimientos de una partición Cada grupo debe contener al menos un objeto Cada objeto debe pertenecer exactamente a un grupo.
Métodos jerárquicos: Aglomerativos: empiezan con n clusters de una observación cada uno, en
cada paso se combinan dos grupos hasta terminar en un sólo cluster con n observaciones.
Divisorios: comienzan con un sólo cluster de n observaciones y en cada paso se divide un grupo en dos hasta tener n clusters con una observación cada uno.
![Page 7: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/7.jpg)
K-Means Clustering Method
1. Ask user how many clusters they’d like. (e.g. k=5)2. Randomly guess k cluster Center locations3. For each datapoint find out which Center it’s closest to.
(Thus each Center “owns” a set of datapoints)4. For each Center find the centroid of the points it owns5. …and jumps there 6. …Repeat until terminated!
(Are we sure it will terminate?)
![Page 8: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/8.jpg)
K-Means Step by step (1 & 2)
1. Ask user how many clusters they’d like. (e.g. k=5)
2. Randomly guess k cluster Center locations
![Page 9: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/9.jpg)
K-Means Step by step (3)
1. Ask…2. Randomly guess k
cluster Center locations
3. For each datapoint find out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
![Page 10: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/10.jpg)
K-Means Step by step (4)
1. Ask…2. Randomly guess…3. For each datapoint
find out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)
4. For each Center find the centroid of the points it owns
![Page 11: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/11.jpg)
K-Means Step by step (5 & 6)
1. Ask…2. Randomly guess…3. For each datapoint
…4. For each Center find
the centroid of the points it owns
5. …and jumps there6. …Repeat until
terminated!
![Page 12: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/12.jpg)
K-Means Q&A
What is it trying to optimize? Are we sure it will terminate? Are we sure it will find an optimal
clustering? How should we start it? How could we automatically choose the
number of centers?
![Page 13: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/13.jpg)
K-Means Q&A (2)
This clustering method is simple and reasonably effective.
The final cluster centers do not represent a global minimum but only a local one.
Completely different final clusters can arise from differerences in the initial randomly chosen cluster centers.
![Page 14: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/14.jpg)
K-Means Q&A (3)
Are we sure it will terminate? There are only a finite number of ways of partitioning R records into
k groups. So there are only a finite number of possible configurations in
which all Centers are the centroids of the points they own. If the configuration changes on an iteration, it must have improved
the distortion. So each time the configuration changes it must go to a
configuration it’s never been to before. So if it tried to go on forever, it would eventually run out of
configurations.
![Page 15: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/15.jpg)
K-Means Q&A (4)
Will we find the optimal configuration? Can you invent a configuration that has converged, but
does not have the minimum distortion?
![Page 16: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/16.jpg)
K-Means Q&A (5)
Will we find the optimal configuration? Can you invent a configuration that has converged, but
does not have the minimum distortion?
![Page 17: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/17.jpg)
K-Means Q&A (6)
Trying to find good optima Idea 1: Be careful about where you start
Neat trick: Place first center on top of randomly chosen datapoint. Place second center on datapoint that’s as far away as possible from
first center: Place j’th center on datapoint that’s as far away as possible from the
closest of Centers 1 through j-1
Idea 2: Do many runs of k-means, each from a different random start configuration
Many other ideas floating around.
![Page 18: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/18.jpg)
K-Means Q&A (7)
Choosing the number of Centers A difficult problem Most common approach is to try to find the solution that
minimizes the Schwarz Criterion
Trying k from 2 to n !! Incrementally (k=2, then do 2-Means for each cluster, and
so on…)
![Page 19: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/19.jpg)
Common uses of K-means
Often used as an exploratory data analysis tool In one-dimension, a good way to quantize realvalued
variables into k non-uniform buckets Used on acoustic data in speech understanding to convert
waveforms into one of k categories (known as Vector Quantization)
Also used for choosing color palettes on old fashioned graphical display devices!
![Page 20: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/20.jpg)
Single Linkage Hierarchical Clustering
1. Say “Every point is its own cluster”
![Page 21: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/21.jpg)
Single Linkage Hierarchical Clustering (2)
1. Say “Every point is its own cluster”
2. Find “Most similar” pair of clusters
![Page 22: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/22.jpg)
Single Linkage Hierarchical Clustering (3)
1. Say “Every point is its own cluster”
2. Find “Most similar” pair of clusters
3. Merge it into a parent cluster
![Page 23: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/23.jpg)
Single Linkage Hierarchical Clustering (4)
1. Say “Every point is its own cluster”
2. Find “Most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat... until you’ve merged the whole dataset into one cluster
![Page 24: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/24.jpg)
Single Linkage Hierarchical Clustering (5)
1. Say “Every point is its own cluster”
2. Find “Most similar” pair of clusters
3. Merge it into a parent cluster
4. Repeat... until you’ve merged the whole dataset into one cluster
![Page 25: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/25.jpg)
Hierarchical Clustering Q&A
How do we define similarity between clusters? Minimum distance between points in clusters (in which
case we’re simply doing Euclidian Minimum Spanning Trees)
Maximum distance between points in clusters Average distance between points in clusters And more…
![Page 26: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/26.jpg)
Hierarchical Clustering Q&A (bis)
![Page 27: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/27.jpg)
Hierarchical Clustering Q&A (2)
Single Linkage Comments Also known in the trade as Hierarchical Agglomerative
Clustering (note the acronym) It’s nice that you get a hierarchy instead of an amorphous
collection of groups If you want k groups, just cut the (k-1) longest links There’s no real statistical or information-theoretic foundation to
this. Makes your lecturer feel a bit queasy.
![Page 28: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/28.jpg)
Cluster Silhouettes
Para cada ejemplo i definimos a(i), con A el cluster asignado a i
Luego calculamos d(i, C) para los clusters distintos a A
Nos quedamos con b(i) como la menor distancia un cluster. El cluster B para el cual este mínimo se cumple, es decir d(i,B) = b(i) se llama el vecino del objeto i. (La segunda opción de pertenencia)
![Page 29: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/29.jpg)
Cluster Silhouettes (2)
Ahora definimos s(i) como:
Para entender el significado de s(i) veamos que sucede en las situaciones extremas: Cuando s(i) es cercano a 1, a(i) es decir, el promedio de las disimilaridades entre i y los objetos de
su cluster son mucho más pequeñas que b(i) la disimilaridad entre i y el cluster vecino. Por lo tanto podemos decir que i está bien clasificado.
Cuando s(i) es cercano a 0, b(i) y a(i) son aproximadamente iguales no es claro si i debe ser asignado a A ó al cluster vecino. El objeto i está tan lejos de uno como de otro.
La peor situación se da cuando s(i) es cercano a –1, a(i) es mucho más grande que b(i), entonces i en promedio está más cerca del cluster vecino que de A.
![Page 30: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/30.jpg)
Cluster Silhouettes (3)
0.0 0.2 0.4 0.6 0.8 1.0
Li
J
Le
P
Ti
I
K
Ta
Silhouette widthAverage silhouette width : 0.8
C1
C2
SC Interpretación 0.71-1 Fuerte estructura 0.51-0.7 Razonable estructura 0.26-0.5 La estructura es débil y podría ser artificial < 0.25 No se ha hallado estructura
![Page 31: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/31.jpg)
eduardo-poggi
http://ar.linkedin.com/in/eduardoapoggi
https://www.facebook.com/eduardo.poggi
@eduardoapoggi
![Page 32: Poggi analytics - clustering - 1](https://reader034.vdocument.in/reader034/viewer/2022042706/587beb481a28ab765a8b5c37/html5/thumbnails/32.jpg)
Bibliografía