big, practical recommendations with alternating least squares

Big, Practical Recommendations with Alternating Least Squares

Sean Owen • Apache Mahout / Myrrix.com

WHERE’S BIG LEARNING?

Next: Application Layer Analytics Machine Learning

Like Apache Mahout Common Big Data app

today Clustering, recommenders,

classifiers on Hadoop Free, open source; not

mature

Where’s commercialized Big Learning?

Storage

Database

Processing

Applications

A RECOMMENDER SHOULD …

Answer in Real-time Ingest new data, now Modify recommendations

based on newest data No “cold start” for new data

Scale Horizontally For queries per second For size of data set

Accept Diverse Input Not just people and

products Not just explicit ratings Clicks, views, buys Side information

Be “Pretty Accurate”

NEED: 2-TIER ARCHITECTURE

Real-time Serving Layer Quick results based on

precomputed model Incremental update Partitionable for scale

Batch Computation Layer Builds model Scales out (on Hadoop?) Asynchronous, occasional,

long-lived runs

A PRACTICAL ALGORITHM

MATRIX FACTORIZATION BENEFITS

Factor user-item matrix to user-feature + feature-item matrix

Well understood in ML, as: Principal Component

Analysis Latent Semantic Indexing

Several algorithms, like: Singular Value

Decomposition Alternating Least Squares

Models intuition Factorization is batch

parallelizable Reconstruction (recs) in

low-dimension is fast Allows projection of new

data Cold start solution Approximate update solution

A PRACTICAL IMPLEMENTATION

ALTERNATING LEAST SQUARES BENEFITS

Simple factorization P ≈ X YT

Approximate: X, Y are “skinny” (low-rank)

Faster than the SVD Trivially parallel, iterative

Dumber than the SVD No singular values,

orthonormal basis

Parallelizable by row -- very Hadoop-friendly

Iterative: OK answer fast,

refine as long as desired Yields to “binary” input

model Ratings as regularization

instead Sparseness / 0s no longer a

problem

ALS ALGORITHM 1

Input: (user, item, strength) tuples Anything you can quantify is

input Strength is positive

Many tuples per user-item R is sparse user-item

interaction matrix rij = total strength of

interaction between user i and item j

1 4 3

3

4 3 2

5 2 3

5

2 4 R

ALS ALGORITHM 2

Follow “Collaborative Filtering for Implicit Feedback Datasets”www2.research.att.com/~yifanhu/PUB/cf.pdf

Construct “binary” matrix P 1 where R > 0 0 where R = 0

Factor P, not R R returns in regularization

Still sparse; implicit 0s fine

1 1 1 0 0

0 0 1 0 0

0 1 0 1 1

1 0 1 0 1

0 0 0 1 0

1 1 0 0 0 P

ALS ALGORITHM 3

P is m x n Choose k << m, n Factor P as Q = X YT, Q ≈

P

X is m x k ; YT is k x n

Find best approximation Q Minimize L2 norm of diff: || P-

Q ||2

Minimal squared error: “Least Squares”

Recommendations are largest values in Q

YT

X

ALS ALGORITHM 4

Optimizing X, Y simultaneously is non-convex, hard

If X or Y are fixed, system of linear equations: convex, easy

Initialize Y with random values

Solve for X Fix X, solve for Y Repeat (“Alternating”)

YT

X

ALS ALGORITHM 5

Define regularization weights cui = 1 + α rui

Minimize:

Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)

Simple least-squares regression objective, plus Weighted least-squared error terms by strength,

a penalty for not reconstructing 1 at “strong” association is higher Standard L2 regularization term

ALS ALGORITHM 6

With fixed Y, compute optimal X Each row xu is independent

Define Cu as diagonal matrix of cu (user strength weights)

xu = (YTCuY + λI)-1 YTCupu

Compare to simple least-squares regression solution (YTY)-1 YTpu

Adds Tikhonov / ridge regression regularization term λI

Attaches cu weights to YT

See paper for how YTCuY is computed efficiently;skipping the engineering!

1 1 1 0 0

0 0 1 0 0

0 1 0 1 1

1 0 1 0 1

0 0 0 1 0

1 1 0 0 0

EXAMPLE FACTORIZATION

k = 3, λ = 2, α = 40, 10 iterations

≈

0.96 0.99 0.99 0.38 0.93

0.44 0.39 0.98 -0.11

0.39

0.70 0.99 0.42 0.98 0.98

1.00 1.04 0.99 0.44 0.98

0.11 0.51 -0.13

1.00 0.57

0.97 1.00 0.68 0.47 0.91

Q = X•YT

FOLD-IN

Need immediate, if approximate, updates for new data

New user u needs new row Qu = Xu YT

We have Pu ≈ Qu

Compute Xu via right inverse:X YT(YT)-1 = Q(YT)-1 so:X = Q(YT)-1

What is (YT)-1?

Note (YTY)(YTY)-1 = I Gives YT’s right inverse:

YT (Y(YTY)-1) = I Xu = Qu Y(YTY)-1

Xu ≈ Pu Y(YTY)-1

Recommend as usual: Qu = XuYT

For existing user, instead add to existing row Xu

THIS IS MYRRIX

Soft-launched Serving Layer available

as open source download Computation Layer available

as beta Ready on Amazon EC2 / EMR Full launch Q4 2012 myrrix.com

[email protected]

APPENDIX

EXAMPLES

STACKOVERFLOW TAGS WIKIPEDIA LINKS

Recommend tags to questions

Tag questions automatically, improve tag coverage

3.5M questions x 30K tags

4.3 hours x 5 machines on Amazon EMR

$3.03 ≈ $0.08 per 100,000 recs

Recommend new linked articles from existing links

Propose missing, related links

2.5M articles x 1.8M articles

28 hours x 2 PCs on Apache Hadoop 1.0.3

big, practical recommendations with alternating least squares

Technology

useritem5 r

r r returns

factor p

projection of new data

als algorithm

user iand item j

construct binary matrix

sparse useritem2