big, practical recommendations with alternating least squares
DESCRIPTION
TRANSCRIPT
Big, Practical Recommendations with Alternating Least Squares
Sean Owen • Apache Mahout / Myrrix.com
WHERE’S BIG LEARNING?
Next: Application Layer Analytics Machine Learning
Like Apache Mahout Common Big Data app
today Clustering, recommenders,
classifiers on Hadoop Free, open source; not
mature
Where’s commercialized Big Learning?
Storage
Database
Processing
Applications
A RECOMMENDER SHOULD …
Answer in Real-time Ingest new data, now Modify recommendations
based on newest data No “cold start” for new data
Scale Horizontally For queries per second For size of data set
Accept Diverse Input Not just people and
products Not just explicit ratings Clicks, views, buys Side information
Be “Pretty Accurate”
NEED: 2-TIER ARCHITECTURE
Real-time Serving Layer Quick results based on
precomputed model Incremental update Partitionable for scale
Batch Computation Layer Builds model Scales out (on Hadoop?) Asynchronous, occasional,
long-lived runs
A PRACTICAL ALGORITHM
MATRIX FACTORIZATION BENEFITS
Factor user-item matrix to user-feature + feature-item matrix
Well understood in ML, as: Principal Component
Analysis Latent Semantic Indexing
Several algorithms, like: Singular Value
Decomposition Alternating Least Squares
Models intuition Factorization is batch
parallelizable Reconstruction (recs) in
low-dimension is fast Allows projection of new
data Cold start solution Approximate update solution
A PRACTICAL IMPLEMENTATION
ALTERNATING LEAST SQUARES BENEFITS
Simple factorization P ≈ X YT
Approximate: X, Y are “skinny” (low-rank)
Faster than the SVD Trivially parallel, iterative
Dumber than the SVD No singular values,
orthonormal basis
Parallelizable by row -- very Hadoop-friendly
Iterative: OK answer fast,
refine as long as desired Yields to “binary” input
model Ratings as regularization
instead Sparseness / 0s no longer a
problem
ALS ALGORITHM 1
Input: (user, item, strength) tuples Anything you can quantify is
input Strength is positive
Many tuples per user-item R is sparse user-item
interaction matrix rij = total strength of
interaction between user i and item j
1 4 3
3
4 3 2
5 2 3
5
2 4 R
ALS ALGORITHM 2
Follow “Collaborative Filtering for Implicit Feedback Datasets”www2.research.att.com/~yifanhu/PUB/cf.pdf
Construct “binary” matrix P 1 where R > 0 0 where R = 0
Factor P, not R R returns in regularization
Still sparse; implicit 0s fine
1 1 1 0 0
0 0 1 0 0
0 1 0 1 1
1 0 1 0 1
0 0 0 1 0
1 1 0 0 0 P
ALS ALGORITHM 3
P is m x n Choose k << m, n Factor P as Q = X YT, Q ≈
P
X is m x k ; YT is k x n
Find best approximation Q Minimize L2 norm of diff: || P-
Q ||2
Minimal squared error: “Least Squares”
Recommendations are largest values in Q
YT
X
ALS ALGORITHM 4
Optimizing X, Y simultaneously is non-convex, hard
If X or Y are fixed, system of linear equations: convex, easy
Initialize Y with random values
Solve for X Fix X, solve for Y Repeat (“Alternating”)
YT
X
ALS ALGORITHM 5
Define regularization weights cui = 1 + α rui
Minimize:
Σ cui(pui – xuTyi)2 + λ(Σ||xu||2 + Σ||yi||2)
Simple least-squares regression objective, plus Weighted least-squared error terms by strength,
a penalty for not reconstructing 1 at “strong” association is higher Standard L2 regularization term
ALS ALGORITHM 6
With fixed Y, compute optimal X Each row xu is independent
Define Cu as diagonal matrix of cu (user strength weights)
xu = (YTCuY + λI)-1 YTCupu
Compare to simple least-squares regression solution (YTY)-1 YTpu
Adds Tikhonov / ridge regression regularization term λI
Attaches cu weights to YT
See paper for how YTCuY is computed efficiently;skipping the engineering!
1 1 1 0 0
0 0 1 0 0
0 1 0 1 1
1 0 1 0 1
0 0 0 1 0
1 1 0 0 0
EXAMPLE FACTORIZATION
k = 3, λ = 2, α = 40, 10 iterations
≈
0.96 0.99 0.99 0.38 0.93
0.44 0.39 0.98 -0.11
0.39
0.70 0.99 0.42 0.98 0.98
1.00 1.04 0.99 0.44 0.98
0.11 0.51 -0.13
1.00 0.57
0.97 1.00 0.68 0.47 0.91
Q = X•YT
FOLD-IN
Need immediate, if approximate, updates for new data
New user u needs new row Qu = Xu YT
We have Pu ≈ Qu
Compute Xu via right inverse:X YT(YT)-1 = Q(YT)-1 so:X = Q(YT)-1
What is (YT)-1?
Note (YTY)(YTY)-1 = I Gives YT’s right inverse:
YT (Y(YTY)-1) = I Xu = Qu Y(YTY)-1
Xu ≈ Pu Y(YTY)-1
Recommend as usual: Qu = XuYT
For existing user, instead add to existing row Xu
THIS IS MYRRIX
Soft-launched Serving Layer available
as open source download Computation Layer available
as beta Ready on Amazon EC2 / EMR Full launch Q4 2012 myrrix.com
APPENDIX
EXAMPLES
STACKOVERFLOW TAGS WIKIPEDIA LINKS
Recommend tags to questions
Tag questions automatically, improve tag coverage
3.5M questions x 30K tags
4.3 hours x 5 machines on Amazon EMR
$3.03 ≈ $0.08 per 100,000 recs
Recommend new linked articles from existing links
Propose missing, related links
2.5M articles x 1.8M articles
28 hours x 2 PCs on Apache Hadoop 1.0.3