aaron sherman
DESCRIPTION
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998) David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases. Aaron Sherman. Presentation. What is this presentation about? Definitions and Algorithms Evaluations with Generated Data - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/1.jpg)
Clustering Categorical Data: An Approach Based on Dynamical Systems (1998)
David Gibson, Jon Kleinberg, Prabhakar Raghavan VLDB Journal: Very Large Data Bases
Aaron Sherman
![Page 2: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/2.jpg)
Presentation
What is this presentation about? Definitions and Algorithms Evaluations with Generated Data Real World test Conclusions + Q&A
![Page 3: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/3.jpg)
Categorize this!
Categorizing int’s are easy, but what about words like “red,” “blue,” “august,” and “Moorthy?”
STIRR – Sieving Through iterated Relational Reinforcement
![Page 4: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/4.jpg)
Why is STIRR Better?
No a Priori Quantization Correlation vs. Categorical Similarity New Methods for Hypergraph Clustering
![Page 5: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/5.jpg)
Definitions
Table of Relational Data – Set T of Tuples– Set of K Fields – many possible values (Columns)
– Abstract Node – each possible field
– Г є T – consists of one node from each field
Configuration – weight wv to each node v –w N(w) – Normalization Function – rescale all
weights so their squares add up to 1 Dynamical System – repeated application of f Fixed Point – point u where f(u) = u
![Page 6: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/6.jpg)
Where is all this going?
![Page 7: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/7.jpg)
Weighting Scheme
To update the weight wv:
– For each tuple Г = {v,u1,…uk-1} containing v• X Г § (u1,…uk-1 )
– Wv Σ Г X Г
N() f(w)
![Page 8: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/8.jpg)
Combining Operator П
Product Operator П: §(w1…wk ) = w1 w2… wk
Non-linear term – encode co-occurrence strongly
Does not converge Relatively small # of large basins Very useful data in early iterations
![Page 9: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/9.jpg)
Combining Operator +
Addition Operator +: §(w1…wk ) = w1 +w2+…+
wk
Linear Does a good job converging
![Page 10: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/10.jpg)
Combining Operator Sp
Sp – Combining Rule: §(w1…wk ) =
Non-linear term – encode co-occurrence strongly
Does a good job converging
![Page 11: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/11.jpg)
Combining Operator Sω
Sω – Limiting version of Sp
Take the largest value among the weights Easy to compute, sum like properties Converges the best of all options shown
![Page 12: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/12.jpg)
Initial Configuration
Uniform Initialization – all weights = 1 Random Initialization – independently
choose o1 for each weight then normalize– Some operators more sensitive to initial
configurations then others
Masking / Modification – specific rule for certain nodes to set to higher or lower value
![Page 13: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/13.jpg)
Run Time - Linear
![Page 14: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/14.jpg)
Quasi-Random Input
Create semi random data, and then add tuples to the data to create artificial clusters– Use this to test whether STIRR works
Questions• # of iterations
• Density of cluster to background
![Page 15: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/15.jpg)
How well does STIRR distil a cluster in nodes with above average co-occurrence
# of iterations Purity
![Page 16: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/16.jpg)
How well does STIRR separate distinct planted clusters?Will the data partition?
How long to partition?
S(A,B) = (|a0 – b0| + |a1 –b1| ) / total nodesClusters A,B, a0 nodes from cluster, and a1nodes at other end
![Page 17: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/17.jpg)
How well does STIRR cope with clusters in a few columns with the rest random?
Want to mask irrelevant factors (columns)
![Page 18: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/18.jpg)
Effect of Convergence Operator Max function is
the best Product rule
does not converge
Sum rule is good, but slow
![Page 19: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/19.jpg)
Real World Data
Papers on theory and Database Systems– (Author 1, Author 2, Journal Year)– The two sets of papers were clearly separated in
the STIRR representation– Done using Sp– Grouped most theoretical papers around 1976
![Page 20: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/20.jpg)
Login Data from IBM Servers
Masked one user who logged in / out very frequently
4 highest weight (similar) users – root, help, 2 administrators names
8pm-12am very similar
![Page 21: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/21.jpg)
Conclusion
Powerful technique to categorize data Relatively fast algorithm O(n) Questions?
![Page 22: Aaron Sherman](https://reader036.vdocument.in/reader036/viewer/2022062408/56813650550346895d9dd3d4/html5/thumbnails/22.jpg)
Additional References
Data Clustering Techniques - Qualifying Oral Examination Paper - Periklis Andritsos– http://www.cs.toronto.edu/~periklis/p
ubs/depth.pdf