Download - Hierarchical clustering in Python and beyond
![Page 1: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/1.jpg)
Hierarchical clustering
in Python & elsewhere
For @PyDataConf London, June 2015, by Frank Kelly
Data Scientist, Engineer @analyticsseo
@norhustla
![Page 2: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/2.jpg)
Hierarchical Clustering
Theory Practice Visualisation
Origins & definitions
Methods & considerations
Hierachical theory
Metrics & performance
My use case
Python libraries
Example
Static
Interactive
Further ideas
All opinions expressed are my own
![Page 3: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/3.jpg)
Who am I?
All opinions expressed are my own
![Page 5: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/5.jpg)
Clustering is an unsupervised learning problem
"SLINK-Gaussian-data" by Chire - Own work. Licensed under CC BY-SA 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:SLINK-Gaussian-data.svg#/media/File:SLINK-Gaussian-data.svg
based on some notion of similarity.
whereby we aim to group subsets of entities with one another
![Page 6: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/6.jpg)
Origins
1930s:
Anthropology&Psychology
http://dienekes.blogspot.co.uk/2013/12/europeans-neolithic-farmers-mesolithic.html
![Page 7: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/7.jpg)
Diverse applications
Attribution: stack overflow, wikipedia, scikit-learn.org, http://www.poolparty.biz/
![Page 8: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/8.jpg)
Two main purposes
Exploratory analysis – standalone tool
(Data mining)
As a component of a supervised learning pipeline (in which distinct classifiers or regression models are trained for each cluster).
(Machine Learning)
![Page 9: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/9.jpg)
Clustering considerations
Partitioning criteria(single / multi level)
SeparationExclusive / non-exclusive
Clustering space(Full-space / sub-space)
Similarity measure(distance / connectivity)
![Page 10: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/10.jpg)
Use case: search keywords
RD
P
P
P
KW
KW
KW
KW
KW
CP
CP
KW
KW
KW
The competition!
KW
KW
CP
CDYou
Opportunity!
CD = Competing domainsCP = Competitor’s pages
RD = Ranking domainP = Your pageKW = Keyword
![Page 11: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/11.jpg)
….x 100,000 !!
![Page 12: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/12.jpg)
Use case: search keywords
KW…so we have found 100,000 new
‘s – now what?
How do we summarise and present these to a client?
![Page 13: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/13.jpg)
Clients’ questions…
• Do search categories in general align with my website structure?
• Which categories of opportunity keywords have the highest search volume, bring the most visitors, revenue etc.?
• Which keywords are not relevant?
![Page 14: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/14.jpg)
Website-like structure
![Page 15: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/15.jpg)
Requirements• Need: visual insights;
structure
• Allow targeting of problem in hand
• May develop into a semi- supervised solution
![Page 16: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/16.jpg)
• High-dimensional and sparse data set
• Values correspond to word frequencies
• Recommended methods include: hierarchical clustering, Kmeans with an appropriate distance measure, topic modelling (LDA, LSI), co-clustering
Options for text clustering?
![Page 17: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/17.jpg)
Hierarchical Clusteringbringing structure
![Page 18: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/18.jpg)
2 types
Agglomerative
Divisive Deterministic algorithms!
Attribution: Wikipedia
![Page 19: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/19.jpg)
Agglomerative
Start with many “singleton” clusters
…Merge 2 at a time
continuously…
Build a hierarchy
Divisive
Start with a huge “macro” cluster
…Iteratively split into 2
groups…
Build a hierarchy
![Page 20: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/20.jpg)
Agglomerative method: Linkage types
• Single (similarity between most similar – based on nearestneighbour - two elements)
• Complete (similarity between most dissimilar two elements)
Attribution: https://www.coursera.org/course/clusteranalysis
![Page 21: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/21.jpg)
![Page 22: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/22.jpg)
![Page 23: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/23.jpg)
Agglomerative method: Linkage types
Average link( avg. of similarity between all inter-cluster pairs )Computationally expensive (Na*Nb)
Trick: Centroid link (similaritybetween centroid of two clusters)
Attribution: https://www.coursera.org/course/clusteranalysis
![Page 24: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/24.jpg)
Ward’s criterion
• Minimise a function: total in-cluster variance
• As defined by, e.g.:
• Once merged, then the SSE will increase (cluster becomes bigger) by:
https://en.wikipedia.org/wiki/Ward's_method
![Page 25: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/25.jpg)
Divisive clustering • Top-down approach
• Criterion to split: Ward’s criterion
• Handling noise: Use a threshold to determine the termination criteria
Attribution: https://www.coursera.org/course/clusteranalysis
![Page 26: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/26.jpg)
Similarity measures
This will certainly influence the shape of the clusters!
• Numerical: Use a variation of the Manhattan distance (e.g. City block, Euclidean)
• Binary: Manhattan, Jaccard co-efficient, Hamming
• Text: Cosine similarity.
![Page 27: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/27.jpg)
Cosine similarityRepresent a document by a bag of terms
Record the frequency of a particular term (word/ topic/ phrase)
If d1 and d2 are two term vectors,
…can thus calculate the similarity between them
Attribution: https://www.coursera.org/course/clusteranalysis
![Page 28: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/28.jpg)
![Page 29: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/29.jpg)
Gather word documents = keyword phrases
![Page 30: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/30.jpg)
Aggregate search words with URL “words”
![Page 31: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/31.jpg)
Text clustering: preparations
• Add features where possibleo I added URL words to my word set
• Stem wordso Choose the right stemmer – too severe can be bad
• Stop wordso NLTK tokenisero Scikit learn TF-IDF tokeniser
• Low frequency cut-offo 2 => words appearing less than twice in whole corpus
• High frequency cut-offo 0.5 => words that appear in more than 50% of documents
• N-gramso Single words, bi-grams, tri-grams
• Beware of foreign languageso Separate datasets if possible
![Page 32: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/32.jpg)
Text preparation
![Page 33: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/33.jpg)
Dimensionality• Get a sparse matrix
o Mostly zeros
• Reduce the number of dimensionso PCAo Spectral clustering
• The “curse” of dimensionality
![Page 34: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/34.jpg)
![Page 35: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/35.jpg)
Results: reduced
dimensions
![Page 36: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/36.jpg)
Results: reduced
dimensions
![Page 37: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/37.jpg)
The
dendrogra
m
![Page 38: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/38.jpg)
Assess the quality of your clusters
• Internal: Purity, completeness & homogeneity
• External: Adjusted Rand index, Normalised Information index
![Page 39: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/39.jpg)
Topic labelling
![Page 40: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/40.jpg)
Hierarchical ClusteringBeyond Python (!?)
![Page 41: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/41.jpg)
Life on the inside: Elasticsearch
• Why not perform pre-processing and clustering inside elasticsearch?
• Document store
• TF-IDF and other• Stop words• Language specific analysers
![Page 42: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/42.jpg)
Elasticsearch - try it ! -
• https://www.elastic.co/
• NoSQL document store• Aggregations and stats• Fast, distributed• Quick to set up
![Page 43: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/43.jpg)
Document storage in ES
![Page 44: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/44.jpg)
Lingo 3G algorithm
• Lingo 3G: Hierarchical clustering off-the-shelf• Built-in part of speech (POS)• User-defined word/synonym/label dictionaries• Built-in stemmer / word inflection database• Multi-lingual support, advanced tuning• Commercial: costs attached
http://download.carrotsearch.com/lingo3g/manual/#section.es
http://project.carrot2.org/algorithms.html
![Page 45: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/45.jpg)
Elasticsearch with clustering – Utopia?
Carrot2’s Lingo3G in action : http://search.carrot2.org/stable/search
Foamtree visualisation example
Visualisation of hierarchical structure possible for large datasets via “lazy loading”
http://get.carrotsearch.com/foamtree/demo/demos/large.html
![Page 46: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/46.jpg)
Limitations of hierarchical clustering• Can’t undo what’s done (divisive method, work on
sub clusters, cannot re-merge). Even true for agglomerative (once merged will never split it again)
• Every split or merge must be refined• Methods may not scale well, checking all possible
pairs, complexity goes high
There are extensions: BIRCH, CURE and CHAMELEON
![Page 47: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/47.jpg)
Thank you!A decent introductory course to clustering;https://www.coursera.org/course/clusteranalysis
Hierarchical (agglomerative) clustering in Python:http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
Recent (ish) relevant Kaggle challenge: https://www.kaggle.com/c/lshtc
Visualisation: http://carrotsearch.com/foamtree-overview
Clustering elsewhere (Lingo, Lingo3G) with Carrot2:http://download.carrotsearch.com/
Elasticsearch: https://www.elastic.co/
Analytics SEO: http://www.analyticsseo.com/
Me: @norhustla / [email protected]
Attribution: http://wynway.com/
![Page 48: Hierarchical clustering in Python and beyond](https://reader031.vdocument.in/reader031/viewer/2022013115/55c575d8bb61eb6e358b47dc/html5/thumbnails/48.jpg)
Extra slide: Why work inside the database?
1. Sharing data (management of)Support concurrent access by multiple readers and writers
2. Data Model Enforcement Make sure all applications see clean, organised data
3. Scale Work with datasets too large to fit in memory (over a certain size, need specialised algorithms to deal with the data -> bottleneck)The database organises and exposes algorithms for you conveniently
4. Flexibility Use the data in new, unanticipated ways -> anticipate a broad set of ways of accessing the data