collaborative filtering with spark

May 9, 2014

Collaborative Filtering with Spark

Chris Johnson@MrChrisJohnson

Friday, May 9, 14

Who am I??•Chris Johnson

– Machine Learning guy from NYC– Focused on music recommendations– Formerly a graduate student at UT Austin

Friday, May 9, 14

3What is MLlib?Algorithms:

•classification: logistic regression, linear support vector machine (SVM), naive bayes

•regression: generalized linear regression

•clustering: k-means

•decomposition: singular value decomposition (SVD), principle component analysis (PCA

•collaborative filtering: alternating least squares (ALS)

http://spark.apache.org/docs/0.9.0/mllib-guide.html

Friday, May 9, 14

4What is MLlib?Algorithms:

•classification: logistic regression, linear support vector machine (SVM), naive bayes

•regression: generalized linear regression

•clustering: k-means

•decomposition: singular value decomposition (SVD), principle component analysis (PCA

•collaborative filtering: alternating least squares (ALS)

http://spark.apache.org/docs/0.9.0/mllib-guide.html

Friday, May 9, 14

Collaborative Filtering - “The Netflix Prize” 5

Friday, May 9, 14

Collaborative Filtering 6

Hey,I like tracks P, Q, R, S!

Well,I like tracks Q, R, S, T!

Then you should check out track P!

Nice! Btw try track T!

Image via Erik BernhardssonFriday, May 9, 14

7Collaborative Filtering at Spotify• Discover (personalized recommendations)• Radio• Related Artists• Now Playing

Friday, May 9, 14

Section name 8

Friday, May 9, 14

Explicit Matrix Factorization 9

Movies

Inception

•Users explicitly rate a subset of the movie catalog•Goal: predict how users will rate new movies

Friday, May 9, 14

• = bias for user• = bias for item• = regularization parameter

Explicit Matrix Factorization 10

ChrisInception

? 3 5 ?1 ? ? 12 ? 3 2? ? ? 55 2 ? 4

•Approximate ratings matrix by the product of low-dimensional user and movie matrices

•Minimize RMSE (root mean squared error)

• = user rating for movie • = user latent factor vector• = item latent factor vector

Friday, May 9, 14

Implicit Matrix Factorization 11

1 0 0 0 1 0 0 10 0 1 0 0 1 0 0 1 0 1 0 0 0 1 10 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1

•Replace Stream counts with binary labels– 1 = streamed, 0 = never streamed

•Minimize weighted RMSE (root mean squared error) using a function of stream counts as weights

• = bias for user• = bias for item• = regularization parameter

• = 1 if user streamed track else 0• • = user latent factor vector• =i tem latent factor vector

Friday, May 9, 14

Alternating Least Squares 12

• Initialize user and item vectors to random noise

• Fix item vectors and solve for optimal user vectors– Take the derivative of loss function with respect to user’s vector, set

equal to 0, and solve– Results in a system of linear equations with closed form solution!

• Fix user vectors and solve for optimal item vectors• Repeat until convergence

code: https://github.com/MrChrisJohnson/implicitMF

Friday, May 9, 14

Alternating Least Squares 13

• Note that:

• Then, we can pre-compute once per iteration– and only contain non-zero elements for tracks that

the user streamed– Using sparse matrix operations we can then compute each user’s

vector efficiently in time where is the number of tracks the user streamed

code: https://github.com/MrChrisJohnson/implicitMF

Friday, May 9, 14

14Alternating Least Squares

code: https://github.com/MrChrisJohnson/implicitMFFriday, May 9, 14

Section name 15

Friday, May 9, 14

Scaling up Implicit Matrix Factorization with Hadoop

Friday, May 9, 14

Hadoop at Spotify 2009 17

Friday, May 9, 14

Hadoop at Spotify 2014 18

700 Nodes in our London data center

Friday, May 9, 14

Implicit Matrix Factorization with Hadoop 19

Reduce stepMap step

u % K = 0i % L = 0

u % K = 0i % L = 1 ... u % K = 0

i % L = L-1

u % K = 1i % L = 0

u % K = 1i % L = 1 ... ...

... ... ... ...

u % K = K-1i % L = 0 ... ... u % K = K-1

i % L = L-1

item vectorsitem%L=0

item vectorsitem%L=1

item vectorsi % L = L-1

user vectorsu % K = 0

user vectorsu % K = 1

user vectorsu % K = K-1

all log entriesu % K = 1i % L = 1

u % K = 0

u % K = 1

u % K = K-1

Figure via Erik BernhardssonFriday, May 9, 14

Implicit Matrix Factorization with Hadoop 20

One map taskDistributed

cache:All user vectors where u % K = x

Distributed cache:

All item vectors where i % L = y

Mapper Emit contributions

Map input:tuples (u, i, count)

where u % K = x

andi % L = y

Reducer New vector!

Figure via Erik BernhardssonFriday, May 9, 14

Hadoop suffers from I/O overhead 21

IO Bottleneck

Friday, May 9, 14

Spark to the rescue!! 22

http://www.slideshare.net/Hadoop_Summit/spark-and-shark

Hadoop

Friday, May 9, 14

Section name 23

Friday, May 9, 14

First Attempt 24

ratings user vectors item vectors

node 1 node 2 node 3 node 4 node 5 node 6

• For each iteration:– Compute YtY over item vectors and broadcast – Join user vectors along with all ratings for that user and all item vectors for

which the user rated the item– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

Friday, May 9, 14

First Attempt 25

node 2 node 3 node 4 node 5

Friday, May 9, 14

First Attempt 26

node 2 node 3 node 4 node 5

Friday, May 9, 14

First Attempt 27

Friday, May 9, 14

First Attempt 28

•Issues: – Unnecessarily sending multiple copies of item vector to each node– Unnecessarily shuffling data across cluster at each iteration–Not taking advantage of Spark’s in memory capabilities!

Friday, May 9, 14

Second Attempt 29

• For each iteration:– Compute YtY over item vectors and broadcast – Group ratings matrix into blocks, and join blocks with necessary user and

item vectors (to avoid multiple item vector copies at each node)– Sum up YtCuIY and YtCuPu and solve for optimal user vectors

Friday, May 9, 14

Second Attempt 30

Friday, May 9, 14

Second Attempt 31

Friday, May 9, 14

Second Attempt 32

Friday, May 9, 14

Second Attempt 33

Friday, May 9, 14

Second Attempt 34

•Issues: –Still Unnecessarily shuffling data across cluster at each iteration–Still not taking advantage of Spark’s in memory capabilities!

Friday, May 9, 14

So, what are we missing?... 35

•Partitioner: Defines how the elements in a key-value pair RDD are partitioned across the cluster.

user vectorspartition 1

partition 2

partition 3

Friday, May 9, 14

•partitionBy(partitioner): Partitions all elements of the same key to the same node in the cluster, as defined by the partitioner.

partition 2

partition 3

Friday, May 9, 14

•mapPartitions(func): Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator[T] => Iterator[U] when running on an RDD of type T.

partition 2

partition 3

function()function()

function()

Friday, May 9, 14

• persist(storageLevel): Set this RDD's storage level to persist (cache) its values across operations after the first time it is computed.

partition 2

partition 3

Friday, May 9, 14

Third Attempt 39

• Partition ratings matrix, user vectors, and item vectors by user and item blocks and cache partitions in memory• Build InLink and OutLink mappings for users and items

– InLink Mapping: Includes the user IDs and vectors for a given block along with the ratings for each user in this block– OutLink Mapping: Includes the item IDs and vectors for a given block along with a list of destination blocks for which to send

these vectors• For each iteration:

– Compute YtY over item vectors and broadcast – On each item block, use the OutLink mapping to send item vectors to the necessary user blocks– On each user block, use the InLink mapping along with the joined item vectors to update vectors

Friday, May 9, 14

Third Attempt 40

Friday, May 9, 14

Third Attempt 41

Friday, May 9, 14

Third Attempt 42

Friday, May 9, 14

Third Attempt 43

Friday, May 9, 14

Third Attempt 44

Friday, May 9, 14

Third attempt 45

Friday, May 9, 14

Third attempt 46

Friday, May 9, 14

Third attempt 47

Friday, May 9, 14

ALS Running Times 48

Via Xiangrui Meng (Databricks) http://stanford.edu/~rezab/sparkworkshop/slides/xiangrui.pdfFriday, May 9, 14

Section name 49

Friday, May 9, 14

Section name 50

Friday, May 9, 14

Section name 51

Friday, May 9, 14

Section name 52

Friday, May 9, 14

Section name 53

Friday, May 9, 14

Section name 54

Friday, May 9, 14

Section name 55

Friday, May 9, 14

Section name 56

Friday, May 9, 14

collaborative filtering with spark

principle

naive bayes

memory build

items inlink

pca collaborative

erik bernhardsson

joined item

update vectors

Software

collaborative filtering in icamp

socially collaborative filtering

1 information filtering rong jin. 2 outline brief...

collaborative filtering

collaborative filtering intro - full

user modeling and personalization 6: collaborative ... ·...

collaborative filtering recommender...

collaborative filtering recommendation system

tutorial 14 (collaborative filtering)

apache spark and ml bo zhang, ibm chul sung and pu yang...

h2e: a privacy provisioning framework for collaborative...

social network collaborative filtering

vertical recommendation using collaborative filtering

collaborative filtering matrix factorization...

case study 4: collaborative filtering · case study 4:...

scalable collaborative filtering recommendation algorithms...

collaborative filtering introduction search or content...

introduction to collaborative filtering

cobafi : collaborative bayesian filtering

collaborative filtering in map/reduce