what's new in apache mahout

52
© 2014 MapR Technologies 1 © 2014 MapR Technologies What’s New in Apache Mahout: A Preview of Mahout 1.0 Ted Dunning 21 May, 2014 Boulder/Denver Big Data Meet-up

Upload: ted-dunning

Post on 27-Aug-2014

1.238 views

Category:

Software


2 download

DESCRIPTION

Apache Mahout is changing radically. Here is a report on what is coming, notably including an R like domain specific language that can use multiple computational engines such as Spark.

TRANSCRIPT

Page 1: What's new in Apache Mahout

© 2014 MapR Technologies 1© 2014 MapR Technologies

What’s New in Apache Mahout: A Preview of Mahout 1.0Ted Dunning21 May, 2014 Boulder/Denver Big Data Meet-up

Page 2: What's new in Apache Mahout

© 2014 MapR Technologies 2

What’s New in Apache Mahout:A Preview of Mahout 1.0

21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM

Ted Dunning, Chief Applications Architect MapR TechnologiesTwitter @Ted_Dunning Email [email protected] [email protected]

Page 3: What's new in Apache Mahout

© 2014 MapR Technologies 3

There was just an explosion in Apache Mahout…

Page 4: What's new in Apache Mahout

© 2014 MapR Technologies 4

Apache Mahout up to now…• Open source Apache project http://mahout.apache.org/• Mahout version is 0.9 released Feb 2014; included Scala

– Summary 0.9 blog at http://bit.ly/1rirUUL• Library of scalable algorithms for machine learning

– Some run on Apache Hadoop distributions; others do not require Hadoop– Some can be run at small scale– Some are run in parallel; others are sequential

• Includes the following main areas:– Clustering & related techniques– Classification– Recommendation– Mahout Math Library

Page 5: What's new in Apache Mahout

© 2014 MapR Technologies 5

Roadmap to Mahout 1.0• Say good-bye to MapReduce

– New MR algorithms will not be accepted– Support for existing ones will continue for now

• Support for Apache Spark– Under construction; some features already available

• Support for h2o being explored• Support for Apache Stratosphere possibly in future

Page 6: What's new in Apache Mahout

© 2014 MapR Technologies 6

Roadmap: Apache Mahout 1.0

Page 7: What's new in Apache Mahout

© 2014 MapR Technologies 7

Apache Spark

• Apache Spark http://spark.apache.org/– Open source “fast and general engine for large scale data processing”– Especially fast in-memory– Made top level open Apache project

• Feb 2014 • http://spark.apache.org/ • over 100 committers

– Original developers have started company called Databricks (Berkeley CA) http://databricks.com/

Page 8: What's new in Apache Mahout

© 2014 MapR Technologies 8

Mahout and Scala• Scala http://www.scala-lang.org/

– Open source; appeared in 2003 – Wiki describes as “object-functional programming and scripting

language”• Scala provides functional style

– Makes lazy evaluation much safer– Notationally compact– Minor syntax extensions allowed– Makes math much easier

Page 9: What's new in Apache Mahout

© 2014 MapR Technologies 9

Here’s what DSL & Spark will mean for Mahout• Scala DSL provides convenient notation for expressing parallel

machine learning

• Spark (and other engines) provide execution environment

• Overview of Scala and Apache Spark bindings in Mahout can be found at

https://mahout.apache.org/users/sparkbindings/home.html

Page 10: What's new in Apache Mahout

© 2014 MapR Technologies 10

What do clusters, Cap’n Crunch and Coco Puffs have in common?

Page 11: What's new in Apache Mahout

© 2014 MapR Technologies 11

They’re part of the data in the new Mahout Spark shell tutorial…

Page 12: What's new in Apache Mahout

© 2014 MapR Technologies 12

And you shouldn’t be eating them.

Page 13: What's new in Apache Mahout

© 2014 MapR Technologies 13

Tutorial: Mahout- Spark Shell• Find it here http://bit.ly/RSTeMr• Early stage code - play with Mahout Scala’s DSL for linear

algebra and Mahout-Spark shell– Uses publicly available breakfast cereal data set– Challenge: Fit linear model that infers customer ratings from ingredients– Toy data set but load with Mahout to mimic a huge data set

• Mahout's linear algebra DSL has an abstraction called DistributedRowMatrix (DRM) – models a matrix that is partitioned by rows and stored in the memory of

a cluster of machines

Page 14: What's new in Apache Mahout

© 2014 MapR Technologies 14

Dissecting the Model• Components

– Cereal ingredients are the features– Ratings are the target variables

• Linear regression assumes that target variable y is generated by linear combination of feature matrix X with parameter vector β plus the noise ε

y = Xβ + ε• Goal: Find estimate of parameter vector β that explains data

Page 15: What's new in Apache Mahout

© 2014 MapR Technologies 15

What do you see in this matrix?

val drmData = drmParallelize(dense( (2, 2, 10.5, 10, 29.509541), // Apple Cinnamon Cheerios (1, 2, 12, 12, 18.042851), // Cap'n'Crunch (1, 1, 12, 13, 22.736446), // Cocoa Puffs (2, 1, 11, 13, 32.207582), // Froot Loops (1, 2, 12, 11, 21.871292), // Honey Graham Ohs (2, 1, 16, 8, 36.187559), // Wheaties Honey Gold (6, 2, 17, 1, 50.764999), // Cheerios (3, 2, 13, 7, 40.400208), // Clusters (3, 3, 13, 4, 45.811716)), // Great Grains Pecan numPartitions = 2);

Page 16: What's new in Apache Mahout

© 2014 MapR Technologies 16

Add Bias Column

val drmX1 = drmX.mapBlock(ncol = drmX.ncol + 1) { case(keys, block) => // create a new block with an additional column val blockWithBiasColumn = block.like(block.nrow, block.ncol + 1) // copy data from current block into the new block blockWithBiasColumn(::, 0 until block.ncol) := block // last column consists of ones blockWithBiasColumn(::, block.ncol) := 1

keys -> blockWithBiasColumn}

Page 17: What's new in Apache Mahout

© 2014 MapR Technologies 17

Solve Linear System, Compute Error

val XtX = (drmX1.t %*% drmX1).collectval Xty = (drmX1.t %*% y).collect(::, 0)

beta = solve(XtX, Xty)

val fittedY = (drmX1 %*% beta).collect(::, 0)error = (y - fittedY).norm(2)

Page 18: What's new in Apache Mahout

© 2014 MapR Technologies 18

In R

all = matrix( c(2, 2, 10.5, 10, 29.509541, 1, 2, 12, 12, 18.042851, 1, 1, 12, 13, 22.736446, 2, 1, 11, 13, 32.207582, 1, 2, 12, 11, 21.871292, 2, 1, 16, 8, 36.187559, 6, 2, 17, 1, 50.764999, 3, 2, 13, 7, 40.400208, 3, 3, 13, 4, 45.811716), byrow=T, ncol=5)

Page 19: What's new in Apache Mahout

© 2014 MapR Technologies 19

More R

a1 = cbind(a, 1)ata = t(a1) %*% a1aty = t(a1) %*% y

x1 = solve(a=ata, b=aty)

Page 20: What's new in Apache Mahout

© 2014 MapR Technologies 20

Well, Actually

all = data.frame(all)

m = lm(X5 ~ X1 + X2 + X3 + X4, df)

plot(df$X5, predict(m))abline(lm(y ~ x, data.frame(x=df$X5, y=predict(m))), col='red’)

Page 21: What's new in Apache Mahout

© 2014 MapR Technologies 21

R Wins

Page 22: What's new in Apache Mahout

© 2014 MapR Technologies 22

R Wins … For Now

Page 23: What's new in Apache Mahout

© 2014 MapR Technologies 23

R Wins … For Now … at Small Scale

Page 24: What's new in Apache Mahout

© 2014 MapR Technologies 24

Recommendation

Behavior of a crowd helps us understand what individuals will do

Page 25: What's new in Apache Mahout

© 2014 MapR Technologies 25

Recommendation

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob Bob got an apple

Page 26: What's new in Apache Mahout

© 2014 MapR Technologies 26

Recommendation

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob Bob got an apple. What else would Bob like?

Page 27: What's new in Apache Mahout

© 2014 MapR Technologies 27

Recommendation

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob A puppy!

Page 28: What's new in Apache Mahout

© 2014 MapR Technologies 28

You get the idea of how recommenders work…

Page 29: What's new in Apache Mahout

© 2014 MapR Technologies 29

By the way, like me, Bob also wants a pony…

Page 30: What's new in Apache Mahout

© 2014 MapR Technologies 30

Recommendation

?

Alice

Bob

Charles

Amelia

What if everybody gets a pony?

What else would you recommend for new user Amelia?

Page 31: What's new in Apache Mahout

© 2014 MapR Technologies 31

Recommendation

?

Alice

Bob

Charles

Amelia

If everybody gets a pony, it’s not a very good indicator of what to else predict...

What we want is anomalous co-occurrence

Page 32: What's new in Apache Mahout

© 2014 MapR Technologies 32

Get Useful Indicators from Behaviors• Use log files to build history matrix of users x items

– Remember: this history of interactions will be sparse compared to all potential combinations

• Transform to a co-occurrence matrix of items x items• Look for useful co-occurrence by looking for anomalous co-

occurrences to make an indicator matrix– Log Likelihood Ratio (LLR) can be helpful to judge which co-

occurrences can with confidence be used as indicators of preference– ItemSimilarityJob in Apache Mahout uses LLR

• (pony book said RowSimilarityJob,not as good )

Page 33: What's new in Apache Mahout

© 2014 MapR Technologies 33

Model uses three matrices…

Page 34: What's new in Apache Mahout

© 2014 MapR Technologies 34

History Matrix: Users x Items

Alice

Bob

Charles

✔ ✔ ✔✔ ✔

✔ ✔

Page 35: What's new in Apache Mahout

© 2014 MapR Technologies 35

Co-Occurrence Matrix: Items x Items

-

1 21 1

1

12 1

00

0 0

Use LLR test to turn co-occurrence into indicators of interesting co-occurrence

Page 36: What's new in Apache Mahout

© 2014 MapR Technologies 36

Indicator Matrix: Anomalous Co-Occurrence

✔✔

Page 37: What's new in Apache Mahout

© 2014 MapR Technologies 37

Which one is the anomalous co-occurrence?

A not AB 13 1000

not B 1000 100,000

A not AB 1 0

not B 0 10,000

A not AB 10 0

not B 0 100,000

A not AB 1 0

not B 0 20.90 1.95

4.52 14.3

Page 38: What's new in Apache Mahout

© 2014 MapR Technologies 38

Collection of Documents: Insert Meta-Data

Search Technology

Item meta-data

Document for “puppy” id: t4

title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

Ingest easily via NFS

Page 39: What's new in Apache Mahout

© 2014 MapR Technologies 39

A Quick Simplification• Users who do h

• Also do

User-centric recommendations

Item-centric recommendations

Page 40: What's new in Apache Mahout

© 2014 MapR Technologies 40

val drmA = sampleDownAndBinarize( drmARaw, randomSeed, maxNumInteractions).checkpoint()

val numUsers = drmA.nrow.toInt

// Compute number of interactions per thing in Aval csums = drmBroadcast(drmA.colSums)

// Compute co-occurrence matrix A'Aval drmAtA = drmA.t %*% drmA

Page 41: What's new in Apache Mahout

© 2014 MapR Technologies 41

What’s New in Apache Mahout:A Preview of Mahout 1.0

21 May 2014 Boulder/Denver Big Data Meet-up #BDBDM

Ted Dunning, Chief Applications Architect MapR TechnologiesTwitter @Ted_Dunning Email [email protected] [email protected]

Page 42: What's new in Apache Mahout

© 2014 MapR Technologies 42

Page 43: What's new in Apache Mahout

© 2014 MapR Technologies 43

Sandbox

Page 44: What's new in Apache Mahout

© 2014 MapR Technologies 44

Going Further: Multi-Modal Recommendation

Page 45: What's new in Apache Mahout

© 2014 MapR Technologies 45

Going Further: Multi-Modal Recommendation

Page 46: What's new in Apache Mahout

© 2014 MapR Technologies 46

Better Long-Term Recommendations• Anti-flood

Avoid having too much of a good thing• Dithering

“When making it worse makes it better”

Page 47: What's new in Apache Mahout

© 2014 MapR Technologies 47

Why Use Dithering?

Page 48: What's new in Apache Mahout

© 2014 MapR Technologies 48

What’s New in Apache Mahout?A Preview of Mahout 1.0

21 May 2014 #BDBDMTed Dunning, Chief Applications Architect MapR Technologies

Twitter @Ted_Dunning Email [email protected] [email protected]

Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout

Page 49: What's new in Apache Mahout

© 2014 MapR Technologies 49

Sample Music Log Files

13 START 10113 2182654281

23 BEACON 10113 218265 428124 START 10113 796006

1193502834 BEACON 10113 7960061193502844 BEACON 10113 7960061193502854 BEACON 10113 7960061193502864 BEACON 10113 7960061193502874 BEACON 10113 7960061193502884 BEACON 10113 7960061193502894 BEACON 10113 79600611935028104 BEACON 10113 79600611935028109 FINISH 10113 79600611935028111 START 10113 589999

12011972121 BEACON 10113 58999912011972

Time

Event type

User ID

Artist ID

Track ID

Page 50: What's new in Apache Mahout

© 2014 MapR Technologies 50

id 1710mbid 592a3b6d-c42b-4567-99c9-ecf63bd66499name Chuck Berryarea United Statesgender Maleindicator_artists 386685,875994,637954,3418,1344,789739,1460, …

id 541902mbid 983d4f8f-473e-4091-8394-415c105c4656name Charlie Winstonarea United Kingdomgender Noneindicator_artists 997727,815,830794,59588,900,2591,1344,696268, …

Documents for Music Recommendation

Page 51: What's new in Apache Mahout

© 2014 MapR Technologies 51

Practical Machine Learning: Innovations in Recommendation

28 April 2014 NoSQL Matters Conference #NoSQLMattersTed Dunning, Chief Applications Architect MapR Technologies

Twitter @Ted_Dunning Email [email protected] [email protected]

Apache Mahout https://mahout.apache.org/ Twitter @ApacheMahout

Page 52: What's new in Apache Mahout

© 2014 MapR Technologies 52