practical data science workshop - recommendation systems - collaborative filtering - strata ny -...

Practical Data Science on Spark & Hadoop

Collaborative Filtering Recommendation Systems

Chris Fregly

Principal Data Solutions Engineer IBM Spark Technology Center

Outline ①  Introduction ②  Live, Interactive, Group Demo! ③  Approximations ④  Similarity ⑤  Recommendations ⑥  Building a Model ⑦ ML Pipelines ⑧  $1 Million Netflix Prize

Who am I?

Live, Interactive, Group Demo! ①  Navigate to sparkafterdark.com

②  Select 3 actresses and 3 actors

③  Wait for me to build the models

https://github.com/fluxcapacitor/pipeline -->

Bloom Filter

7

Approximate set

k-hashes on put/get

False positives

Used all through Spark

From Twitter’s Algebird

Count Min Sketch

8

Approximate counters Better than HashMap Low, fixed memory Known error bounds Large number of counters From Twitter’s Algebird Streaming example in Spark codebase

HyperLogLog

9

Approximate cardinality Approximate count distinct Low memory 1.5KB @ 2% error 10^9 elements! From Twitter’s Algebird Streaming example in Spark codebase countApproxDistinctByKey()

Monte Carlo Simulations

10

From Manhattan Project (A-bomb) Simulate movement of neutrons

Law of Large Numbers (LLN) Average of results of many trials Converge on expected value

SparkPi example in Spark codebase Pi # red dots / # total dots * 4

Demo!

Monte Carlo Simulation

Euclidean Similarity Linear measure Bias toward magnitude

Cosine Similarity Angle measure Corrects magnitude bias

Jaccard Similarity Set Intersection divided by Set Union Bias towards popularity

Log Likelihood Similarity Corrects popularity bias

Calculating Similarity “All-pairs similarity” “Pair-wise similarity” “Similarity join” Naïve impl: O(m*n^2); m=rows, n=cols Must minimize shuffle and computation

Minimizing Shuffle and Computation Approximate!

Reduce m (rows) Sampling Bucketing (aka. “Partitioning” or “Clustering”) Removing rows with sparsity below threshold (ie. inactive)

Reduce n (cols) Remove most frequent value (ie. 0) Remove least popular

Reduce m (rows): Sampling DIMSUM “Dimension Independent Matrix Square Using MR” Remove rows with low probability of similarity

RowMatrix.columnSimilarities()

Twitter 40% efficiency gain over naïve cosine similarity ->

Reduce m (rows): Bucketing LSH “Locality Sensitive Hashing” Split m into b buckets w/ similarity hash func() Requires pre-processing Compare items within buckets Comparison is parallelizable O(m*n^2) -> O(m*n/b*b^2) O(1.25E17) -> O(1.25E13); b=50

Reduce n (cols) Remove most frequent values Replace with (index,value) pairs O(m*n^2) -> O(m*nnz^2); nnz=number of non-zeros, Be sure to choose most frequent value – may not be 0!

Recommendation/ML Terminology User: User seeking recommendations Item: Item being recommended Explicit User Feedback: like or rating Implicit User Feedback: search, click, hover, view, scroll Instances: Rows of user feedback/input data Overfitting: Training a model too closely to the training data & hyperparameters Hold Out Split: Holding out some of the instances to avoid overfitting Features: Columns of instance rows (of feedback/input data) Cold Start Problem: Not enough data to personalize (new) Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations, etc) Model Evaluation: Compare predictions to actual values of hold out split

Features Dimensions: Alias for Features Binary Features: True or False Numeric Discrete Features: Integers Numeric Features: Real values Ordinal Features: Maintains order (S -> M -> L -> XL -> XXL) Temporal Features: Time-based (Time of Day, Binge Watching) Categorical Features: Finite, unique set of categories(NFL teams) Feature Engineering: Modify, reduce, combine features

Feature Engineering Dimension Reduction: Reduce num features or “feature space” Principle Component Analysis (PCA): Find principle features that describe the data

One-Hot Encoding: Convert categorical feature vals to 0’s, 1’s

Bears -> 1 Bears -> 1,0,0 49’ers -> 2 --> 49’ers -> 0,1,0 Steelers-> 3 Steelers-> 0,0,1

Non-Personalized Recommendations “Cold Start” Problem

Top K Aggregations

Summary Statistics

PageRank

Facebook Graph

Demo!

Top K Aggregations PageRank

Personalized Recommendations Collaborative Filtering User-to-Item Item-to-Item

Clustering (Similarity) Users Items

User-to-Item Collaborative Filtering Find similar users based on similarity function(s) Cosine similarity, etc

Recommend items that other similar users have chosen Exclude items that have already been chosen Rank items by num of similar users who have chosen

Alternating Least Squares Matrix Factorization -->

Matrix Factorization

Item-to-Item Collaborative Filtering Made famous by Amazon ~2003 Couldn’t scale traditional User-to-Item algos Offline: Generates ItemID::List[CustomerID] vectors Online: For each item in shopping cart, find similar items based on closest List[CustomerID] vector

User and Item Clustering (Similarity) Based on Similarity ie. Similar Profile/Description Text or Categories

LDA Topic, K-Means, Nearest Neighbor, Eigenfaces, PCA

Streaming K Means Clustering Initial set of k clusters with random centers

Incoming data: Assign to closest cluster: distance to center Update centers: minimize within-cluster-sum-of-squares

Half-life decay factor Reduce contribution of old data to half --> Measured in num batches or num data points

Eliminate dead clusters never assigned new data Split existing cluster and join with dead cluster -->

Demo!

Alternating Least Squares Matrix Factorization

Split Instance Data 3 Roles Model Training (80%) Model Validation (10%) Model Testing (10%) k-folds Cross Validation Divide instances into k sections Alternate each k section between 3 roles above

http://www.slideshare.net/SebastianRaschka/musicmood-20140912

Hyperparameter Selection Select sets of values for each hyperparameter Use GridSearch to find best combo to reduce error Avoid overfitting!

http://www.slideshare.net/ogrisel/strategies-and-tools-for-parallel-machine-learning-in-python

Evaluation Criteria Regression (Distance has meaning) Root Mean Square Error (RMSE) Mean Absolute Error (MAE) Categorical (Distance does not have meaning) Precision/Accuracy

ML Pipelines Inspired by scikit-learn Transformers transform() input for estimation (training) predict() new input

Estimators fit() a model to the transformed dataset (training)

Pipeline Chain everything together

$1 Million Netflix Prize October, 2006 --> Sept 2009 (3 years!!) Winning algorithm beat Netflix by 10.06% based on RMSE Ensemble of 500+ models Combined using Gradient Boosted Decision Trees Computationally intensive and impractical

Winning Algorithm Adjustments “Alice effect”: Alice tends to rate lower than the average user “Inception effect”: Inception is rate higher than average movie “Alice-Inception effect”: Combo of Alice and Inception Number of days since a user’s first rating Number of days since a movie’s first rating Number of people who have rated a movie A movie’s overall mean rating

Factor these out and find the baseline!

Thanks! Chris Fregly @cfregly

References ①  https://github.com/fluxcapacitor/pipeline ②  http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf ③  http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/ ④  http://spark.apache.org/docs/latest/ml-guide.html

practical data science workshop - recommendation systems - collaborative filtering - strata ny -...

Software