l13. cluster analysis
TRANSCRIPT
![Page 1: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/1.jpg)
Cluster Analysis
Poul Petersen
BigML
![Page 2: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/2.jpg)
2
Trees vs Clusters
Trees (Supervised Learning)
Provide: labeled data Learning Task: be able to predict label
Clusters (Unsupervised Learning)
Provide: unlabeled data Learning Task: group data by similarity
![Page 3: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/3.jpg)
3
Trees vs Clusters
sepal length
sepal width
petal length
petal width species
5.1 3.5 1.4 0.2 setosa5.7 2.6 3.5 1.0 versicolor6.7 2.5 5.8 1.8 virginica… … … … …
sepal length
sepal width
petal length
petal width
5.1 3.5 1.4 0.25.7 2.6 3.5 1.06.7 2.5 5.8 1.8… … … …
Learning Task: Find function “f” such that: f(X)≈Y
Learning Task: Find “k” clusters such that the data in each cluster is self similar
Inputs “X” Label “Y”
![Page 4: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/4.jpg)
4
• Customer segmentation
• Item discovery
• Association
• Recommender
• Active learning
Use Cases
![Page 5: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/5.jpg)
5
Customer Segmentation
GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for upsell.
• Dataset of mobile game users.
• Data for each user consists of usage statistics and a LTV based on in-game purchases
• Assumption: Usage correlates to LTV
5
0%
3%
7%
1%
![Page 6: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/6.jpg)
6
Item Discovery
• Dataset of 86 whiskies
• Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics.
GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste.
Smoky
Fruity
Body
![Page 7: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/7.jpg)
7
Association
• Dataset of Lending Club Loans
• Mark any loan that is currently or has even been late as “trouble”
GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population
0%
3%
7%
1%
![Page 8: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/8.jpg)
8
Active Learning
GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data.
• Dataset of diagnostic measurements of 768 patients.
• Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*
*For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not-fraud. Or a million images which need to be labeled as cat/not-cat.
2323
![Page 9: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/9.jpg)
9
Human Example
K=3
![Page 10: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/10.jpg)
10
Human Example
“Round”
“Skinny”
“Corners”
![Page 11: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/11.jpg)
11
Human Example
• Human used prior knowledge to select possible features that separated the objects.
• “round”, “skinny”, “edges”, “hard”, etc
• Items were then separated based on the chosen features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently “distant”
• no crossover
![Page 12: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/12.jpg)
12
Learning from Humans
• Length/Width
• greater than 1 => “skinny”
• equal to 1 => “round”
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require “edges” which have corners
• easier to count
Create features that capture these object differences
![Page 13: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/13.jpg)
13
Feature EngineeringObject Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2
![Page 14: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/14.jpg)
14
Now we can Plot
K=3
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-Means Key Insight:We can find clusters using distances
in n-dimensional feature space
![Page 15: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/15.jpg)
15
Now we can Plot
NumSurfaces
Length / Width
box block eraser
knob
pennydime
bead
key battery screw
K-MeansFind “best” (minimum distance)circles that include all points
![Page 16: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/16.jpg)
16
K-Means Algorithm
K=3
![Page 17: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/17.jpg)
16
K-Means Algorithm
K=3
![Page 18: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/18.jpg)
16
K-Means Algorithm
K=3
![Page 19: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/19.jpg)
16
K-Means Algorithm
K=3
![Page 20: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/20.jpg)
17
K-Means Algorithm
K=3
![Page 21: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/21.jpg)
17
K-Means Algorithm
K=3
![Page 22: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/22.jpg)
18
Features Matter!
Metal Other
Wood
![Page 23: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/23.jpg)
19
Convergence
![Page 24: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/24.jpg)
19
Convergence
![Page 25: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/25.jpg)
19
Convergence
Convergence guaranteedbut not necessarily unique
Starting points important (K++)
![Page 26: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/26.jpg)
20
Starting Points
• Random points or instances in n-dimensional space
• Chose points “farthest” away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instance
• each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
![Page 27: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/27.jpg)
21
Scaling
price
number of bedrooms
d = 160,000
d = 1
![Page 28: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/28.jpg)
22
Other Tricks
• What is the distance to a “missing value”?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown “K”?
![Page 29: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/29.jpg)
23
Distance to Missing Value?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
![Page 30: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/30.jpg)
24
Distance to Categorical?
• Special distance function
• if valA == valB then distance = 0 (or scaling value) else distance = 1
• Assign centroid the most common category of the member instances
• Compute Euclidean distance as normal
One approach: similar to “k-prototypes”
![Page 31: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/31.jpg)
25
Distance to Categorical?
feature_1 feature_2 feature_3
instance_1 red cat ball
instance_2 red cat ball
instance_3 red cat box
instance_4 blue dog fridge
D = 0
D = 1
D = sqrt(3)
![Page 32: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/32.jpg)
26
Only Euclidean?
1
MahalanobisCosine Similarity
0
-1
![Page 33: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/33.jpg)
27
Finding K: G-means
![Page 34: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/34.jpg)
28
Finding K: G-means
![Page 35: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/35.jpg)
28
Finding K: G-means
![Page 36: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/36.jpg)
29
Finding K: G-means
Let K=2
![Page 37: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/37.jpg)
29
Finding K: G-means
Let K=2
![Page 38: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/38.jpg)
29
Finding K: G-means
Let K=2
Keep 1, Split 1 New K=3
![Page 39: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/39.jpg)
30
Finding K: G-means
Let K=3
![Page 40: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/40.jpg)
30
Finding K: G-means
Let K=3
![Page 41: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/41.jpg)
30
Finding K: G-means
Let K=3
Keep 1, Split 2 New K=5
![Page 42: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/42.jpg)
31
Finding K: G-means
Let K=5
![Page 43: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/43.jpg)
31
Finding K: G-means
Let K=5
![Page 44: L13. Cluster Analysis](https://reader031.vdocument.in/reader031/viewer/2022021419/587f5c311a28ab0d378b76cf/html5/thumbnails/44.jpg)
31
Finding K: G-means
Let K=5
K=5