Download - CS206 --- Electronic Commerce
![Page 1: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/1.jpg)
1
More Clustering
CURE AlgorithmNon-Euclidean Approaches
![Page 2: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/2.jpg)
2
The CURE Algorithm
Problem with BFR/k -means:Assumes clusters are normally distributed in each dimension.And axes are fixed --- ellipses at an angle are not OK.
CURE:Assumes a Euclidean distance.Allows clusters to assume any shape.
![Page 3: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/3.jpg)
3
Example: Stanford Faculty Salaries
e e
ee
e e
e
e ee
e
h
h
h
h
h
h h
h
hh h
hh
salary
age
![Page 4: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/4.jpg)
4
Starting CURE
1. Pick a random sample of points that fit in main memory.
2. Cluster these points hierarchically ---group nearest points/clusters.
3. For each cluster, pick a sample of points, as dispersed as possible.
4. From the sample, pick representatives by moving them (say) 20% toward the centroid of the cluster.
![Page 5: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/5.jpg)
5
Example: Initial Clusters
e e
ee
e e
e
e ee
e
h
h
h
h
h
h
h h
h
h
h
h h
salary
age
![Page 6: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/6.jpg)
6
Example: Pick Dispersed Points
e e
ee
e e
e
e ee
e
h
h
h
h
h
h
h h
h
h
h
h h
salary Pick (say) 4remote pointsfor eachcluster.
age
![Page 7: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/7.jpg)
7
Example: Pick Dispersed Points
e e
ee
e e
e
e ee
e
h
h
h
h
h
h
h h
h
h
h
h h
salary Move points(say) 20%toward thecentroid.
age
![Page 8: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/8.jpg)
8
Finishing CURE
Now, visit each point p in the data set.Place it in the “closest cluster.”
Normal definition of “closest”: that cluster with the closest (to p ) among all the sample points of all the clusters.
![Page 9: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/9.jpg)
9
Curse of Dimensionality
One way to look at it: in large-dimension spaces, random vectors are perpendicular. Why?
Argument #1: Lots of 2-dim subspaces. There must be one where the vectors’ projections are almost perpendicular.Argument #2: Expected value of cosine of angle is 0.
![Page 10: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/10.jpg)
10
Cosine of Angle Between Random Vectors
Assume vectors emanate from the origin (0,0,…,0).Components are random in range [-1,1].(a1,a2,…,an).(b1,b2,…,bn) has expected value 0 and a standard deviation that grows as √n.But lengths of both vectors grow as √n.So dot product around √n/ (√n * √n) = 1/√n.
![Page 11: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/11.jpg)
11
Random Vectors --- Continued
Thus, a typical pair of vectors has an angle whose cosine is on the order of 1/√n.As n -> ∞, that’s 0; i.e., the angle is about 90°.
![Page 12: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/12.jpg)
12
Interesting Consequence
Suppose “random vectors are perpendicular,” even in non-Euclidean spaces.Suppose we know the distance from A to B, say d (A,B ), and we also know d (B,C ), but we don’t know d (A,C ).Suppose B and C are fairly close, say in the same cluster.What is d (A,C )?
![Page 13: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/13.jpg)
13
Diagram of Situation
A
C
B
Approximatelyperpendicular
Assuming points lie in a plane:d (A,B )2 + d (B,C )2 = d (A,C )2
![Page 14: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/14.jpg)
14
Important Point
Why do we assume AB is perpendicular to AC, and not that either of the other two angles are right-angles?
1. AB and AC are not “random vectors”; they each go to points that are far away from A and close to each other.
2. If AB is longer than AC, then it is angle ACB that is right, but both ACB and ABC are approximately right-angles.
![Page 15: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/15.jpg)
15
Dealing With a Non-Euclidean Space
Problem: clusters cannot be represented by centroids.Why? Because the “average” of “points” might not be a point in the space.Best substitute: the clustroid = point in the cluster that minimizes the sum of the squares of distances to the points in the cluster.
![Page 16: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/16.jpg)
16
Representing Clusters in Non-Euclidean Spaces
Recall BFR represents a Euclidean cluster by N, SUM, and SUMSQ.A non-Euclidean cluster is represented by:
N.The clustroid.Sum of the squares of the distances from clustroid to all points in the cluster.
![Page 17: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/17.jpg)
17
Example of CoD Use
Problem: in non-Euclidean space, we want to decide whether to merge two clusters.
Each cluster represented by N, clustroid, and “SUMSQ.”Also, SUMSQ for each point in the cluster, even if it is not the clustroid.
Merge if SUMSQ for new cluster is “low.”
![Page 18: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/18.jpg)
18
Estimating SUMSQ
p
other clust-roid, b
p ’s clustroid, c
![Page 19: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/19.jpg)
19
Suppose p Were the Clustroidof Combined Cluster
It’s SUMSQ would be the sum of:1. Old SUMSQ(p) [for old cluster containing p].2. SUMSQ(b) plus d (p,b)2 times number of
points in b ’s cluster.
Critical point: vector p ->b assumed perpendicular to vectors from b to all other points in its cluster --- justifies (2).
![Page 20: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/20.jpg)
20
Combining Clusters --- Continued
We can thus estimate SUMSQ for each point in the combined cluster. Take the point with the least SUMSQ as the clustroid of the new cluster --- provided that SUMSQ is small enough.
![Page 21: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/21.jpg)
21
The GRGPF Algorithm
From Ganti et al. --- see reading list.Works for non-Euclidean distances.Works for massive (disk-resident) data.Hierarchical clustering.Clusters are grouped into a tree of disk blocks (like a B-tree or R-tree).
![Page 22: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/22.jpg)
22
Information Retained About a Cluster
1. N, clustroid, SUMSQ.2. The p points closest to the clustroid,
and their values of SUMSQ.3. The p points of the cluster that are
furthest away from the clustroid, and their SUMSQ’s.
![Page 23: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/23.jpg)
23
At Interior Nodes of the Tree
Interior nodes have samples of the clustroids of the clusters found at descendant leaves of this node.Try to keep clusters on one leaf block close, descendants of a level-1 node close, etc.Interior part of tree kept in main memory.
![Page 24: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/24.jpg)
24
Picture of the Tree
cluster data cluster data
samples
on disk
mainmemory
![Page 25: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/25.jpg)
25
Initialization
Take a main-memory sample of points.Organize them into clusters hierarchically.Build the initial tree, with level-1 interior nodes representing clusters of clusters, and so on.All other points are inserted into this tree.
![Page 26: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/26.jpg)
26
Inserting Points
Start at the root.At each interior node, visit one or more children that have sample clustroidsnear the inserted point.At the leaves, insert the point into the cluster with the nearest clustroid.
![Page 27: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/27.jpg)
27
Updating Cluster Data
Suppose we add point X to a cluster.Increase count N by 1.For each of the 2p + 1 points Y whose SUMSQ is stored, add d (X,Y )2.Estimate SUMSQ for X.
![Page 28: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/28.jpg)
28
Estimating SUMSQ(X )
If C is the clustroid, SUMSQ(X ) is, by the CoD assumption: Nd (X,C )2 + SUMSQ(C )
Based on assumption that vector from Xto C is perpendicular to vectors from C to all the other nodes of the cluster.
This value may allow X to replace one of the closest or furthest nodes.
![Page 29: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/29.jpg)
29
Possible Modification to Cluster Data
There may be a new clustroid --- one of the p closest points --- because of the addition of X.Eventually, the clustroid may migrate out of the p closest points, and the entire representation of the cluster needs to be recomputed.
![Page 30: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/30.jpg)
30
Splitting and Merging Clusters
Maintain a threshold for the radius of a cluster = √(SUMSQ/N ).Split a cluster whose radius is too large.Adding clusters may overflow leaf blocks, and require splits of blocks up the tree.
Splitting is similar to a B-tree.But try to keep locality of clusters.
![Page 31: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/31.jpg)
31
Splitting and Merging --- (2)
The problem case is when we have split so much that the tree no longer fits in main memory.Raise the threshold on radius and merge clusters that are sufficiently close.
![Page 32: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/32.jpg)
32
Merging Clusters
Suppose there are nearby clusters with clustroids C and D, and we want to consider merging them.Assume that the clustroid of the combined cluster will be one of the pfurthest points from the clustroid of one of those clusters.
![Page 33: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/33.jpg)
33
Merging --- (2)
Compute SUMSQ(X ) [from the cluster of C ] for the combined cluster by summing:
1. SUMSQ(X ) from its own cluster.2. SUMSQ(D ) + N [d (X,C )2 + d (C,D )2].
Uses the CoD to reason that the distance from X to each point in the other cluster goes to C, makes a right angle to D, and another right angle to the point.
![Page 34: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/34.jpg)
34
Merging --- Concluded
Pick as the clustroid for the combined cluster that point with the least SUMSQ.But if this SUMSQ is too large, do not merge clusters.Hope you get enough mergers to fit the tree in main memory.
![Page 35: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/35.jpg)
35
Fastmap
Not a clustering algorithm --- rather, a method for applying multidimensional scaling.
That is, mapping the points onto a small-dimension space, so the CoD does not apply.
![Page 36: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/36.jpg)
36
Fastmap --- (2)
Assumes non-Euclidean space.But like GRGFP pretends it is working in 2-dimensional Euclidean space when it is convenient to do so.
Goal: map n points in much less than O(n 2) time.
I.e., you cannot compute distances between each pair of points and place points in k-dim. space to minimize error.
![Page 37: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/37.jpg)
37
Fastmap --- Key Idea
Create a “dimension” in non-Euclidean space by:
1. Pick a pair of points A and B that are far apart.
Start with random A; pick most distant B.
2. Treat AB as an “axis” and project all points onto AB, using the law of cosines.
![Page 38: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/38.jpg)
38
Projecting Point C Onto ABC
d(B,C)
d(A,B)
d(A,C)
BAx
x = [d 2(A,C) + d 2(A,B) – d 2(B,C)]/2d (A,B)
![Page 39: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/39.jpg)
39
Revising Distances
Having computed the position of every point along the pseudo-axis AB, we need to lower the distances between points in the “other dimensions.”
![Page 40: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/40.jpg)
40
Picture
D
dold(C,D)dnew(C,D) =
√dold(C,D)2 – (x-y)2C
xA B
y
![Page 41: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/41.jpg)
41
But …
We can’t afford to compute new distances for each pseudo-dimension.
It would take O(n 2) time.
Rather, for each pseudo-dimension, store the position along the pseudo-axis for each point, and adjust the distance between points by square-subtract-sqrtonly when needed.
I.e., one of the points is an axis-end.
![Page 42: CS206 --- Electronic Commerce](https://reader030.vdocument.in/reader030/viewer/2022012408/616a3e7411a7b741a3506183/html5/thumbnails/42.jpg)
42
Fastmap --- Summary
Pick a number of dimensions k.FOR i = 1 TO k DO BEGIN
Pick a pseudo-axis AiBi;
Compute projection of each
point onto this pseudo-axis;
END;
Each step is O(ni ); total O(nk 2).