co-occurrence based recommendations with mahout, scala and spark

40
Co-occurrence-based recommendations with Mahout, Scala & Spark Sebastian Schelter @sscdotopen BigData Beers 05/29/2014

Upload: sscdotopen

Post on 11-Aug-2014

439 views

Category:

Data & Analytics


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Co-occurrence-based recommendations with Mahout, Scala & Spark

Sebastian Schelter @sscdotopen

BigData Beers

05/29/2014

Page 3: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Cooccurrence Analysis

Page 4: Co-occurrence Based Recommendations with Mahout, Scala and Spark

History matrix

// real usecase: load from DFS

// val A = drmFromHDFS(...)

// our toy example

val A = drmParallelize(dense(

(1, 1, 1, 0), // Alice

(1, 0, 1, 0), // Bob

(0, 0, 1, 1)), // Charles

numPartitions = 2)

Page 5: Co-occurrence Based Recommendations with Mahout, Scala and Spark

How often do items co-occur?

Page 6: Co-occurrence Based Recommendations with Mahout, Scala and Spark

How often do items co-occur?

// compute co-occurrence matrix

val C = A.t %*% A

Page 7: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Which cooccurences are interesting?

Page 8: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Which cooccurences are interesting?

// compute some statistics

val interactionsPerItem =

drmBroadcast(A.colSums)

// convert to indicator matrix

val I = C.mapBlock() {

// compute LLR scores from

// cooccurrences and statistics

...

// only keep interesting cooccurrences

...

}

// save indicator matrix

I.writeDrm(...);

Page 9: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Cooccurrence Analysis prototype available

• MAHOUT-1464 provides full-fledged cooccurrence analysis protoype

– applies selective downsampling to make computation tractable

– support for cross-recommendations in datasets with multiple interaction types, e.g.

• “people who watch this video also watch those videos”

• “people who enter this search query watch those videos”

– code to run this on the Movielens and Epinions datasets

• future plan: easy indexing of indicator matrix with Apache Solr to allow for search-as-recommendation deployments– prototype for MR code already existing at https://github.com/pferrel/solr-recommender

– integration is in the works

Page 10: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Under the covers

Page 11: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Underlying systems

• currently: runtime based on Apache Spark

– fast and expressive cluster computing system

– general computation graphs, in-memory primitives, rich API, interactive shell

• potentially supported in the future: • Apache Flink (formerly: “Stratosphere”)

• H20

Page 12: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Runtime & Optimization

• Execution is defered, user composes logical operators

• Computational actions implicitly trigger optimization (= selection of physical plan) and execution

• Optimization factors: size of operands, orientation of operands, partitioning, sharing of computational paths

• e. g.: matrix multiplication:– 5 physical operators for drmA %*% drmB– 2 operators for drmA %*% inMemA– 1 operator for drm A %*% x – 1 operator for x %*% drmA

val C = A.t %*% A

I.writeDrm(path);

val inMemV =(U %*% M).collect

Page 13: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Page 14: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

A

Page 15: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization:

rewrite plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

MatrixMult

A A

C

Page 16: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Optimization Example

• Computation of ATA in example

• Naïve execution

1st pass: transpose A (requires repartitioning of A)

2nd pass: multiply result with A(expensive, potentially requires repartitioning again)

• Logical optimization

Optimizer rewrites plan to use specialized logical operator for Transpose-Times-Self matrix multiplication

val C = A.t %*% A

Transpose

MatrixMult

A A

C

Transpose-Times-Self

A

C

Page 17: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

Page 18: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

A

Page 19: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x

A AT

Page 20: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x

A AT a1• a1•T

Page 21: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + x

A AT a1• a1•T a2• a2•

T

Page 22: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + +x x

A AT a1• a1•T a2• a2•

T a3• a3•T

Page 23: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Tranpose-Times-Self

• Mahout computes ATA via row-outer-product formulation– executes in a single pass over row-partitioned A

m

i

T

ii

TaaAA

0

x = x + + +x x x

A AT a1• a1•T a2• a2•

T a3• a3•T a4• a4•

T

Page 24: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Physical operators for Transpose-Times-Self

• Two physical operators (concrete implementations) available for Transpose-Times-Self operation

– standard operator AtA

– operator AtA_slim, specialized implementation for tall & skinny matrices

• Optimizer must choose – currently: depends on user-defined

threshold for number of columns

– ideally: cost based decision, dependent on estimates of intermediate result sizes

Transpose-Times-Self

A

C

Page 25: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Physical operators for the distributed computation of ATA

Page 26: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Physical operator AtA

1100

0101

0111

A

Page 27: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

Page 28: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

for 1st partition

for 1st partition

Page 29: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

Page 30: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

Page 31: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

for 2nd partition

for 2nd partition

Page 32: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

01110

1

for 2nd partition

11001

1

for 2nd partition

Page 33: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

01111

1

11000

0

for 1st partition

for 1st partition

01010

1

01110

1

for 2nd partition

01010

1

11001

1

for 2nd partition

Page 34: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

0111

0111

0000

0000

for 1st partition

for 1st partition

0000

0101

0000

0111

for 2nd partition

0000

0101

1100

1100

for 2nd partition

Page 35: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

0111

0111

0000

0000

for 1st partition

for 1st partition

0000

0101

0000

0111

for 2nd partition

0000

0101

1100

1100

for 2nd partition

0111

0212

worker 3

1100

1312

worker 4

ATA

Page 36: Co-occurrence Based Recommendations with Mahout, Scala and Spark

Physical operator AtA_slim

1100

0101

0111

A

Page 37: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2

1100

Physical operator AtA_slim

1100

0101

0111

A1

A

worker 1

worker 2

0101

0111

Page 38: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2TA2A2

1100

1

11

000

0000

Physical operator AtA_slim

1100

0101

0111

A1TA1A1

A

worker 1

worker 2

0101

0111

0

02

011

0212

Page 39: Co-occurrence Based Recommendations with Mahout, Scala and Spark

A2TA2A2

1100

1

11

000

0000

Physical operator AtA_slim

1100

0101

0111

A1TA1A1

A C = ATA

worker 1

worker 2

A1TA1 + A2

TA2

driver

0101

0111

0

02

011

0212

1100

1312

0111

0212