acm 2013-02-25

55
Fast Single-pass k- means Clustering

Upload: ted-dunning

Post on 10-May-2015

1.149 views

Category:

Technology


0 download

DESCRIPTION

A talk about super fast clustering to the ACM in the bay area.

TRANSCRIPT

Page 1: ACM 2013-02-25

Fast Single-pass k-means Clustering

Page 2: ACM 2013-02-25

whoami – Ted Dunning

• Chief Application Architect, MapR Technologies• Committer, member, Apache Software Foundation– particularly Mahout, Zookeeper and Drill

• Contact me [email protected]@[email protected]@ted_dunning

Page 3: ACM 2013-02-25

Agenda

• Rationale• Theory– clusterable data, k-mean failure modes, sketches

• Algorithms– ball k-means, surrogate methods

• Implementation– searchers, vectors, clusterers

• Results• Application

Page 4: ACM 2013-02-25

RATIONALE

Page 5: ACM 2013-02-25

Why k-means?

• Clustering allows fast search– k-nn models allow agile modeling– lots of data points, 108 typical– lots of clusters, 104 typical

• Model features– Distance to nearest centroids– Poor man’s manifold discovery

Page 6: ACM 2013-02-25

What is Quality?

• Robust clustering not a goal– we don’t care if the same clustering is replicated

• Generalization to unseen data critical– number of points per cluster– distance distributions– target function distributions– model performance stability

• Agreement to “gold standard” is a non-issue

Page 7: ACM 2013-02-25

An Example

Page 8: ACM 2013-02-25

The Problem

• Spirals are a classic “counter” example for k-means

• Classic low dimensional manifold with added noise

• But clustering still makes modeling work well

Page 9: ACM 2013-02-25

An Example

Page 10: ACM 2013-02-25

An Example

Page 11: ACM 2013-02-25

The Cluster Proximity Features

• Every point can be described by the nearest cluster – 4.3 bits per point in this case– Significant error that can be decreased (to a point)

by increasing number of clusters• Or by the proximity to the 2 nearest clusters (2

x 4.3 bits + 1 sign bit + 2 proximities)– Error is negligible– Unwinds the data into a simple representation

Page 12: ACM 2013-02-25

Diagonalized Cluster Proximity

Page 13: ACM 2013-02-25

Lots of Clusters Are Fine

Page 14: ACM 2013-02-25

The Limiting Case

• Too many clusters lead to over-fitting• Which we mediate by averaging over several

nearby clusters• In the limit we get k-nn modeling– and probably use k-means to speed up search

Page 15: ACM 2013-02-25

THEORY

Page 16: ACM 2013-02-25

Intuitive Theory

• Traditionally, minimize over all distributions– optimization is NP-complete– that isn’t like real data

• Recently, assume well-clusterable data

• Interesting approximation bounds provable

Page 17: ACM 2013-02-25

For Example

Grouping these two clusters

seriously hurts squared distance

Page 18: ACM 2013-02-25

ALGORITHMS

Page 19: ACM 2013-02-25

Lloyd’s Algorithm• Part of CS folk-lore• Developed in the late 50’s for signal quantization, published in 80’s

initialize k cluster centroids somehowfor each of many iterations:for each data point:assign point to nearest clusterrecompute cluster centroids from points assigned to clusters

• Highly variable quality, several restarts recommended

Page 20: ACM 2013-02-25

Typical k-means Failure

Selecting two seeds here cannot be

fixed with Lloyds

Result is that these two clusters get glued

together

Page 21: ACM 2013-02-25

Ball k-means

• Provably better for highly clusterable data• Tries to find initial centroids in each “core” of each real

clusters• Avoids outliers in centroid computation

initialize centroids randomly with distance maximizing tendencyfor each of a very few iterations:

for each data point:assign point to nearest cluster

recompute centroids using only points much closer than closest cluster

Page 22: ACM 2013-02-25

Still Not a Win

• Ball k-means is nearly guaranteed with k = 2• Probability of successful seeding drops

exponentially with k• Alternative strategy has high probability of

success, but takes O(nkd + k3d) time

Page 23: ACM 2013-02-25

Not good enough

Page 24: ACM 2013-02-25

Surrogate Method

• Start with sloppy clustering into κ = k log n clusters• Use this sketch as a weighted surrogate for the

data• Cluster surrogate data using ball k-means• Results are provably good for highly clusterable

data• Sloppy clustering is on-line• Surrogate can be kept in memory• Ball k-means pass can be done at any time

Page 25: ACM 2013-02-25

Algorithm Costs

• O(k d log n) per point per iteration for Lloyd’s algorithm

• Number of iterations not well known• Iteration > log n reasonable assumption

Page 26: ACM 2013-02-25

Algorithm Costs

• Surrogate methods– fast, sloppy single pass clustering with κ = k log n– fast sloppy search for nearest cluster,

O(d log κ) = O(d (log k + log log n)) per point– fast, in-memory, high-quality clustering of κ weighted

centroidsO(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high qualityO(κ d log k) or O(d log κ log k) for larger k, looser quality

– result is k high-quality centroids• Even the sloppy surrogate may suffice

Page 27: ACM 2013-02-25

Algorithm Costs

• How much faster for the sketch phase?– take k = 2000, d = 10, n = 100,000 – k d log n = 2000 x 10 x 26 = 500,000– d (log k + log log n) = 10(11 + 5) = 170– 3,000 times faster is a bona fide big deal

Page 28: ACM 2013-02-25

Pragmatics

• But this requires a fast search internally• Have to cluster on the fly for sketch• Have to guarantee sketch quality• Previous methods had very high complexity

Page 29: ACM 2013-02-25

How It Works

• For each point– Find approximately nearest centroid (distance = d)– If (d > threshold) new centroid– Else if (u > d/threshold) new cluster– Else add to nearest centroid

• If centroids > κ ≈ C log N– Recursively cluster centroids with higher threshold

Page 30: ACM 2013-02-25

Resulting Surrogate

• Result is large set of centroids– these provide approximation of original

distribution– we can cluster centroids to get a close

approximation of clustering original– or we can just use the result directly

• Either way, we win

Page 31: ACM 2013-02-25

IMPLEMENTATION

Page 32: ACM 2013-02-25

How Can We Search Faster?

• First rule: don’t do it– If we can eliminate most candidates, we can do less work– Projection search and k-means search

• Second rule: don’t do it– We can convert big floating point math to clever bit-wise integer

math– Locality sensitive hashing

• Third rule: reduce dimensionality– Projection search– Random projection for very high dimension

Page 33: ACM 2013-02-25

Projection Search

total ordering!

Page 34: ACM 2013-02-25

How Many Projections?

Page 35: ACM 2013-02-25

LSH Search

• Each random projection produces independent sign bit• If two vectors have the same projected sign bits, they

probably point in the same direction (i.e. cos θ ≈ 1)• Distance in L2 is closely related to cosine

• We can replace (some) vector dot products with long integer XOR

Page 36: ACM 2013-02-25

LSH Bit-match Versus Cosine

Page 37: ACM 2013-02-25

Results with 32 Bits

Page 38: ACM 2013-02-25

The Internals

• Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

• Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

• Super-fast clustering– Kmeans, StreamingKmeans

Page 39: ACM 2013-02-25

Parallel Speedup?

Page 40: ACM 2013-02-25

What About Map-Reduce?

• Map-reduce implementation is nearly trivial– Compute surrogate on each split– Total surrogate is union of all partial surrogates– Do in-memory clustering on total surrogate

• Threaded version shows linear speedup already

• Map-reduce speedup shows same linear speedup

Page 41: ACM 2013-02-25

How Well Does it Work?

• Theoretical guarantees for well clusterable data– Shindler, Wong and Meyerson, NIPS, 2011

• Evaluation on synthetic data– Rough clustering produces correct surrogates– Ball k-means strategy 1 performance is very good

with large k

Page 42: ACM 2013-02-25

How Well Does it Work?

• Empirical evaluation on 20 newsgroups• Alternative algorithms include ball k-means

versus streaming k-means|ball k-means• Results

Average distance to nearest cluster on held-out data same or slightly smallerMedian distance to nearest cluster is smaller> 10x faster (I/O and encoding limited)

Page 43: ACM 2013-02-25

APPLICATION

Page 44: ACM 2013-02-25

The Business Case

• Our customer has 100 million cards in circulation

• Quick and accurate decision-making is key.– Marketing offers– Fraud prevention

Page 45: ACM 2013-02-25

Opportunity

• Demand of modeling is increasing rapidly

• So they are testing something simpler and more agile

• Like k-nearest neighbor

Page 46: ACM 2013-02-25

What’s that?

• Find the k nearest training examples – lookalike customers

• This is easy … but hard– easy because it is so conceptually simple and you don’t have

knobs to turn or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

• Initial rapid prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

Page 47: ACM 2013-02-25

K-Nearest Neighbor Example

Page 48: ACM 2013-02-25

Required Scale and Speed and Accuracy

• Want 20 million queries against 25 million references in 10,000 s

• Should be able to search > 100 million references

• Should be linearly and horizontally scalable• Must have >50% overlap against reference

search

Page 49: ACM 2013-02-25

How Hard is That?

• 20 M x 25 M x 100 Flop = 50 P Flop

• 1 CPU = 5 Gflops

• We need 10 M CPU seconds => 10,000 CPU’s

• Real-world efficiency losses may increase that by 10x

• Not good!

Page 50: ACM 2013-02-25

K-means Search

• First do clustering with lots (thousands) of clusters

• Then search nearest clusters to find nearest points

• We win if we find >50% overlap with “true” answer

• We lose if we can’t cluster super-fast– more on this later

Page 51: ACM 2013-02-25

Lots of Clusters Are Fine

Page 52: ACM 2013-02-25

Lots of Clusters Are Fine

Page 53: ACM 2013-02-25

Some Details

• Clumpy data works better– Real data is clumpy

• Speedups of 100-200x seem practical with 50% overlap– Projection search and LSH give additional 100x

• More experiments needed

Page 54: ACM 2013-02-25

Summary

• Nearest neighbor algorithms can be blazing fast

• But you need blazing fast clustering– Which we now have

Page 55: ACM 2013-02-25

Contact Me!• We’re hiring at MapR in US and Europe

• MapR software available for research use

• Come get the slides at http://www.mapr.com/company/events/acmsf-2-25-13

• Get the code as part of Mahout trunk

• Contact me at [email protected] or @ted_dunning