whats right and wrong with apache mahout

36
1 ©MapR Technologies 2013- Confidential Apache Mahout How it's good, how it's awesome, and where it falls short

Upload: ted-dunning

Post on 10-May-2015

3.671 views

Category:

Technology


2 download

DESCRIPTION

This is a summary of what I think is good and bad about Mahout. Presented on the eve of the 2013 Hadoop Summit

TRANSCRIPT

Page 1: Whats Right and Wrong with Apache Mahout

1©MapR Technologies 2013- Confidential

Apache Mahout

How it's good, how it's awesome, and where it falls short

Page 2: Whats Right and Wrong with Apache Mahout

2©MapR Technologies 2013- Confidential

What is Mahout?

“Scalable machine learning”– not just Hadoop-oriented machine learning– not entirely, that is. Just mostly.

Components– math library– clustering– classification– decompositions– recommendations

Page 3: Whats Right and Wrong with Apache Mahout

3©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

Page 4: Whats Right and Wrong with Apache Mahout

4©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

Page 5: Whats Right and Wrong with Apache Mahout

5©MapR Technologies 2013- Confidential

What is Right and Wrong with Mahout?

Components– recommendations– math library– clustering– classification– decompositions– other stuff

All the stuff that isn’t there

Page 6: Whats Right and Wrong with Apache Mahout

6©MapR Technologies 2013- Confidential

Mahout Math

Page 7: Whats Right and Wrong with Apache Mahout

7©MapR Technologies 2013- Confidential

Mahout Math

Goals are– basic linear algebra,– and statistical sampling,– and good clustering,– decent speed,– extensibility,– especially for sparse data

But not – totally badass speed– comprehensive set of algorithms– optimization, root finders, quadrature

Page 8: Whats Right and Wrong with Apache Mahout

8©MapR Technologies 2013- Confidential

Matrices and Vectors

At the core:– DenseVector, RandomAccessSparseVector– DenseMatrix, SparseRowMatrix

Highly composable API

Important ideas: – view*, assign and aggregate– iteration

m.viewDiagonal().assign(v)

Page 9: Whats Right and Wrong with Apache Mahout

9©MapR Technologies 2013- Confidential

Assign

Matrices

Vectors

Matrix assign(double value);Matrix assign(double[][] values);Matrix assign(Matrix other);Matrix assign(DoubleFunction f);Matrix assign(Matrix other, DoubleDoubleFunction f);

Vector assign(double value);Vector assign(double[] values);Vector assign(Vector other);Vector assign(DoubleFunction f);Vector assign(Vector other, DoubleDoubleFunction f);Vector assign(DoubleDoubleFunction f, double y);

Page 10: Whats Right and Wrong with Apache Mahout

10©MapR Technologies 2013- Confidential

Views

Matrices

Vectors

Matrix viewPart(int[] offset, int[] size);Matrix viewPart(int row, int rlen, int col, int clen);Vector viewRow(int row);Vector viewColumn(int column);Vector viewDiagonal();

Vector viewPart(int offset, int length);

Page 11: Whats Right and Wrong with Apache Mahout

11©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

Page 12: Whats Right and Wrong with Apache Mahout

12©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

Page 13: Whats Right and Wrong with Apache Mahout

13©MapR Technologies 2013- Confidential

Examples

The trace of a matrix

Random projection

Low rank random matrix

m.viewDiagonal().zSum()

m.times(new DenseMatrix(1000, 3).assign(new Normal()))

Page 14: Whats Right and Wrong with Apache Mahout

14©MapR Technologies 2013- Confidential

Recommenders

Page 15: Whats Right and Wrong with Apache Mahout

15©MapR Technologies 2013- Confidential

Examples of Recommendations

Customers buying books (Linden et al) Web visitors rating music (Shardanand and Maes) or movies (Riedl,

et al), (Netflix) Internet radio listeners not skipping songs (Musicmatch) Internet video watchers watching >30 s (Veoh) Visibility in a map UI (new Google maps)

Page 16: Whats Right and Wrong with Apache Mahout

16©MapR Technologies 2013- Confidential

Recommendation Basics

History:

User Thing1 3

2 4

3 4

2 3

3 2

1 1

2 1

Page 17: Whats Right and Wrong with Apache Mahout

17©MapR Technologies 2013- Confidential

Recommendation Basics

History as matrix:

(t1, t3) cooccur 2 times, (t1, t4) once, (t2, t4) once, (t3, t4) once

t1 t2 t3 t4

u1 1 0 1 0

u2 1 0 1 1

u3 0 1 0 1

Page 18: Whats Right and Wrong with Apache Mahout

18©MapR Technologies 2013- Confidential

A Quick Simplification

Users who do h

Also do r

User-centric recommendations

Item-centric recommendations

Page 19: Whats Right and Wrong with Apache Mahout

19©MapR Technologies 2013- Confidential

Clustering

Page 20: Whats Right and Wrong with Apache Mahout

20©MapR Technologies 2013- Confidential

An Example

Page 21: Whats Right and Wrong with Apache Mahout

21©MapR Technologies 2013- Confidential

An Example

Page 22: Whats Right and Wrong with Apache Mahout

22©MapR Technologies 2013- Confidential

Diagonalized Cluster Proximity

Page 23: Whats Right and Wrong with Apache Mahout

23©MapR Technologies 2013- Confidential

Parallel Speedup?

Page 24: Whats Right and Wrong with Apache Mahout

24©MapR Technologies 2013- Confidential

Lots of Clusters Are Fine

Page 25: Whats Right and Wrong with Apache Mahout

25©MapR Technologies 2013- Confidential

Decompositions

Page 26: Whats Right and Wrong with Apache Mahout

26©MapR Technologies 2013- Confidential

Low Rank Matrix

Or should we see it differently?

Are these scaled up versions of all the same column?

1 2 5

2 4 10

10 20 50

20 40 100

Page 27: Whats Right and Wrong with Apache Mahout

27©MapR Technologies 2013- Confidential

Low Rank Matrix

Matrix multiplication is designed to make this easy

We can see weighted column patterns, or weighted row patterns All the same mathematically

1

2

10

20

1 2 5x

Column pattern(or weights)

Weights (or row pattern)

Page 28: Whats Right and Wrong with Apache Mahout

28©MapR Technologies 2013- Confidential

Low Rank Matrix

What about here?

This is like before, but there is one exceptional value

1 2 5

2 4 10

10 100 50

20 40 100

Page 29: Whats Right and Wrong with Apache Mahout

29©MapR Technologies 2013- Confidential

Low Rank Matrix

OK … add in a simple fixer upper

1

2

10

20

1 2 5x

0

0

10

0

0 8 0x

Which rowException

pattern

+[

[

]]

Page 30: Whats Right and Wrong with Apache Mahout

30©MapR Technologies 2013- Confidential

Random Projection

Page 31: Whats Right and Wrong with Apache Mahout

31©MapR Technologies 2013- Confidential

SVD Projection

Page 32: Whats Right and Wrong with Apache Mahout

32©MapR Technologies 2013- Confidential

Classifiers

Page 33: Whats Right and Wrong with Apache Mahout

33©MapR Technologies 2013- Confidential

Mahout Classifiers

Naïve Bayes– high quality implementation– uses idiosyncratic input format– … but it is naïve

SGD– sequential, not parallel– auto-tuning has foibles– learning rate annealing has issues– definitely not state of the art compared to Vowpal Wabbit

Random forest– scaling limits due to decomposition strategy– yet another input format– no deployment strategy

Page 34: Whats Right and Wrong with Apache Mahout

34©MapR Technologies 2013- Confidential

The stuff that isn’t there

Page 35: Whats Right and Wrong with Apache Mahout

35©MapR Technologies 2013- Confidential

What Mahout Isn’t

Mahout isn’t R, isn’t SAS

It doesn’t aim to do everything

It aims to scale some few problems of practical interest

The stuff that isn’t there is a feature, not a defect

Page 36: Whats Right and Wrong with Apache Mahout

36©MapR Technologies 2013- Confidential

Contact:– [email protected]– @ted_dunning– @apachemahout– @[email protected]

Slides and suchhttp://www.slideshare.net/tdunning

Hash tags: #mapr #apachemahout