may 17 2001casi 2001 interactive clustering overview and tools wayne oldford university of waterloo
TRANSCRIPT
![Page 1: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/1.jpg)
May 17 2001 CASI 2001
Interactive ClusteringOverview and Tools
Wayne Oldford
University of Waterloo
![Page 2: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/2.jpg)
May 17 2001 CASI 2001
Overview
1. Finding groups in data
2. Interactive data analysis
3. Enlarging the problem
4. Putting it together
5. Software modelling (illustration)
6. Summary
![Page 3: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/3.jpg)
May 17 2001 CASI 2001
1.Finding groups in data
– Web documents as objects to be grouped
• Objects to be grouped together– locations– pairwise (dis)similarities
• Applications:
– Building groups to understand/model
– Building groups to use later as classification– Building groups to serve as templates
![Page 4: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/4.jpg)
May 17 2001 CASI 2001
group definition is a problem
Group definition (like with like)• homogeneous vs heterogeneous• part of pattern
![Page 5: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/5.jpg)
May 17 2001 CASI 2001
Clustering approaches
• Agglomerative (near points/clusters are joined)
– Single linkage– Complete linkage– Average linkage
• Recursive splitting– e.g. minimal spanning tree
![Page 6: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/6.jpg)
May 17 2001 CASI 2001
Cluster hierarchies
• Clusters are nested• Often represented as a tree (dendrogram) • Join/split history and ‘strength’ preserved
![Page 7: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/7.jpg)
May 17 2001 CASI 2001
Other approaches• k-means
– assign points to k groups– re-assign to improve objective function
• model-based– likelihood/Bayesian; model search/averaging
• density estimation– groups = high-density regions
• classification to cluster
• visually motivated methods
![Page 8: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/8.jpg)
May 17 2001 CASI 2001
Visual Empirical Regions of Influence (VERI)
![Page 9: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/9.jpg)
May 17 2001 CASI 2001
Notes
• many choices– between and within methods
• built-in biases for shapes
• computationally costly– O(n2) ...
Conceptual model: algorithmic, run to completion
![Page 10: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/10.jpg)
May 17 2001 CASI 2001
typical software
• resources dedicated to numerical computation
– teletype interaction
– runs to completion
– graphical “output”
Compare to interactive data analysis
![Page 11: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/11.jpg)
May 17 2001 CASI 2001
2. Interactive data analysis
QuickTime™ and aGraphics decompressor
are needed to see this picture.
![Page 12: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/12.jpg)
May 17 2001 CASI 2001
Interactive data analysis
• exploratory, tentative
• graphical
• non-algorithmic– varied granularity
• integrated
• deep interaction
![Page 13: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/13.jpg)
May 17 2001 CASI 2001
3. Enlarging the problem
Mutually exclusive and exhaustive groups
g1, g2, …, gk
form a partition
P = {g1, g2, …, gk}
of the set of data objects.
Goal: Explore the space of possible partitions.
![Page 14: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/14.jpg)
May 17 2001 CASI 2001
Structuring the partition space
PA={g1, g2, …, ga} and PB={h1, h2, …, hb}
• When a > b, PA call a finer partition than PB.
• When a = b, PA is called a reassignment of PB
• PA is nested in PB only if a > b and every gi is a
subset of a single hj - write PA } PB or PB { PA
– PA is called a refinement of PB (or PB a reduction of PA)
![Page 15: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/15.jpg)
May 17 2001 CASI 2001
Reduction
P1={g1, ..., g6} -> P2={h1, …, h4} -> P3={m1, m2, m3}
• hi = gi i = 1, 2 ; h3= join (g3, g4) ; h4= join (g5, g6)
– nesting: P1 } P2
– split (h4) = {h1*, h2
*, h3*}; mi = join (hi
*, hi )
– P2 } P3 is false
• disperse elements of h4 over hi i = 1, 2, 3 to give mi for i = 1, 2, 3.
![Page 16: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/16.jpg)
May 17 2001 CASI 2001
Reduction decisions/options• join operations: which groups?
– e.g. inner, outer, centres, …– distance measures to use …
• dispersal operations:– selecting group(s)
• Max volume, eigen-value, MST…
– determining partitional method• random, VERI, MST, …
– choosing join …
![Page 17: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/17.jpg)
May 17 2001 CASI 2001
Refinement
P2={h1, …, h4} ---> P1={g1, ..., g6}
• gi = hi i = 1, 2 ; split (h3) -> g3, g4
split (h4) -> g5, g6
nesting: P2 { P1
![Page 18: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/18.jpg)
May 17 2001 CASI 2001
Refinement decisions/options
• which groups to split?– e.g. inner, outer, directions, …– distance measures to use …
• how to split? – MST, outlying points,
reassignment, ...
![Page 19: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/19.jpg)
May 17 2001 CASI 2001
Reassignment
• objective function d(P) to be minimized. P <- P1
• for each object o in gi, assign it to one of gj (j != i) forming a new partition Pij and find largest
ij(o) = d(P) - d(Pij)
• repeat for all i, j. If max ij > 0 move o from gi, to gj
• Repeat until max <=0
P1={g1, ..., gk} -> P2={h1, …, hk}
![Page 20: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/20.jpg)
May 17 2001 CASI 2001
Reassignment decisions/options
• Objective function– distances, centres, …– within vs between/within, ...– variates/directions
• Iteration strategy – single-pass, k-means, complete
looping (greedy), start, …
![Page 21: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/21.jpg)
May 17 2001 CASI 2001
4. Putting it together
Series of moves in partition space:
1. Refine (P) -- > Pnew
2. Reduce (P) -- > Pnew
3. Reassign (P) -- > Pnew
![Page 22: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/22.jpg)
May 17 2001 CASI 2001
Additional ops on partitions
• Unary:– Subset (P)– Operate any of R (subset (P)) – Manual (P) … change P according to manual
intervention (e.g. colouring)
![Page 23: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/23.jpg)
May 17 2001 CASI 2001
n-ary operators• resolve (P1, ..., Pm) --> Pnew
• dissimilarity (Pi, Pj) --> di,j
• display (P1, ..., Pm)– dendrogram if P1 { …{ Pm
– mds plot of all clusters in P1, …, Pm
– mds plot of all partitions P1, …, Pm
![Page 24: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/24.jpg)
May 17 2001 CASI 2001
5. Software modelling
• Principal control panel:– current partition and list of saved partitions– refine, reduce, re-assign, re-start buttons– cluster plot button (mds plot)– random select button– subset focus and join toggle– operation on partitions button– manual button (form partition from point colours)
![Page 25: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/25.jpg)
May 17 2001 CASI 2001
Secondary panels• Refine:
– performs refine, offers access to arguments
• Reduce – performs reduce, offers access to arguments
• Reassign – performs reassign, offers access to arguments
• Each will operate on only those points highlighted or on all if none selected.
![Page 26: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/26.jpg)
May 17 2001 CASI 2001
Secondary panels (continued)
• Operate on partitions– saved partitions list– resolve selected partition– plot selected partitions using selected
dissimilarity– dendrogram of selected partitions (if nested)– cluster-plot for clusters of selected parttitions
(esp. for non-nested)
![Page 27: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/27.jpg)
May 17 2001 CASI 2001
Software modelling (details)
• Objects:– Point-symbols, case-objects (existing in Quail)– Cluster-points – Clusters– Partitions
• Methods– Reduce, refine, reassign, ...
![Page 28: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/28.jpg)
May 17 2001 CASI 2001
Software illustration
• Two prototype displays (buggy)– Single-window– Separate windows
• Integration with existing Quail graphics
• Manual, dendrogram, cluster plots, …
• VERI clustering
![Page 29: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/29.jpg)
May 17 2001 CASI 2001
6. Summary
• Cluster analysis is naturally exploratory and needs integration with modern interactive data analysis.
• Enlarging the problem to partitions:– simplifies and gives structure– encourages exploratory approach– integrates naturally– introduces new possibilities (analysis and
research)
![Page 30: May 17 2001CASI 2001 Interactive Clustering Overview and Tools Wayne Oldford University of Waterloo](https://reader035.vdocument.in/reader035/viewer/2022070414/5697c01e1a28abf838cd11c7/html5/thumbnails/30.jpg)
May 17 2001 CASI 2001
Acknowledgements
• Erin McLeish, several undergraduates and graduate students in statistical computing course at Waterloo
• Quail: Quantitative Analysis in Lisp
http:/www.stats.uwaterloo.ca/Quail