1 ©mapr technologies - confidential super-fast clustering report from mapr workshop

1©MapR Technologies - Confidential

Super-Fast ClusteringReport from MapR workshop

Contact:– tdunning@maprtech.com– @ted_dunning

Twitter for this talk– #mapr_uk

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Company Background

MapR provides the industry’s best Hadoop Distribution– Combines the best of the Hadoop community

contributions with significant internally financed infrastructure development

Background of Team– Deep management bench with extensive analytic,

storage, virtualization, and open source experience– Google, EMC, Cisco, VMWare, Network Appliance, IBM,

Microsoft, Apache Foundation, Aster Data, Brio, ParAccel Proven – MapR used across industries (Financial Services, Media,

Telcom, Health Care, Internet Services, Government) – Strategic OEM relationship with EMC and Cisco– Over 1,000 installs

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

We Also Do …

Open source development– Zookeeper– Hadoop– Mahout– Stuff

Partner workshops– Machine learning– Information architecture– Cluster design

The Problem

A certain bank– had lots of customers– had lots of prospective customers– had a non-trivial number of fraudulent customers– had a non-trivial number of fraudulent merchants

They also – collected data– built models– collected more data– built more models

But …

These models were arduous to build

And hard to test

So people suggested something simpler

Like k-nearest neighbor

What’s that?

Find the k nearest training examples Use the average value of the target variable from them

This is easy … but hard– easy because it is so conceptually simple and you don’t have knobs to turn

or models to build– hard because of the stunning amount of math– also hard because we need top 50,000 results

Initial prototype was massively too slow– 3K queries x 200K examples takes hours– needed 20M x 25M in the same time

What We Did

Mechanism for extending Mahout Vectors– DelegatingVector, WeightedVector, Centroid

Searcher interface– ProjectionSearch, KmeansSearch, LshSearch, Brute

Super-fast clustering– Kmeans, StreamingKmeans

Projection Search

K-means Search

But These Require k-means!

Need a new k-means algorithm to get speed

Streaming k-means is– One pass (through the original data)– Very fast (20 us per data point with threads)– Very parallelizable

How It Works

For each point– Find approximately nearest centroid (distance = d)– If d > threshold, new centroid– Else possibly new cluster– Else add to nearest centroid

If centroids > K ~ C log N– Recursively cluster centroids with higher threshold

Result is large set of centroids– these provide approximation of original distribution– we can cluster centroids to get a close approximation of clustering original– or we can just use the result directly

Parallel Speedup?

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

Warning, Recursive Descent

Inner loop requires finding nearest centroid

With lots of centroids, this is slow

But wait, we have classes to accelerate that!

(Let’s not use k-means searcher, though)

Contact:– tdunning@maprtech.com– @ted_dunning

Slides and such:– http://info.mapr.com/ted-uk-05-2012

Thank You

1 ©mapr technologies - confidential super-fast clustering report from mapr workshop

open source technologies

mapr workshop

paraccelproven mapr

emerging technologies

open source technology

open source committers

open source experiencegoogle

enterprise storage

Documents

table of contents - mapr...global partner proram i table of...

mapreduce improvements in mapr hadoop

manatee awareness and protection resource (mapr) website

apache mahout -...

mapr 5.2: getting more value from the mapr converged data...

big data analytics the network is the bottleneck › us ›...

philly db mapr overview

data donderdag 30 oktober 2014 - mapr

mapr data platform reference architecture for oracle cloud...

big data governance - mapr & alydata

big data everywhere chicago: getting real with the mapr...

deep insight solutions - mapr ultra

rhadoopand - mapr · pdf file2!!!!|!!!rhadoopandmapr! ......

mapr converged data platform

big data hadoop briefing hosted by cisco, wwt and mapr: mapr...

informatica powerexchange for mapr-db - 10.1.1 update 2 ......

lenovo big data reference architecture for mapr ... · pdf...

mapr case study - valueselling.com studies/mapr case...

mapr lucidworks joint webinar

lenovo big data reference architecture for the mapr ... ·...