recommender systems with apache spark's als function

Building aBuilding aRecommenderRecommenderSystemSystemin Pysparkin Pyspark

Will JohnsonWill Johnson- Uline- Uline- DePaul- DePaul

LearnBy Marketing.com

AGENDAAGENDA- RecSys- RecSys * Basics* Basics * MF* MF * Evaluation* Evaluation * Advanced* Advanced- PySpark- PySpark * Basics* Basics * ALS* ALS

User Based Collaborative Filtering

1.0 2.0

User Based Collaborative Filtering

3.8 2.0

1.0 2.0

Item Based Collaborative Filtering

Matrix Factorization

Evaluation

RMSE = √∑ (Predicted−Actual)2

nPrecision Recall

|hitsu||RecoSet u|

|hitsu||TestSetu|

Expert Review: Novelty, Context

CRISP-DM

Data Understanding

movielens = sc.textFile("../in/ml-100k/u.data")

Data Understanding

movielens.first()

movielens.count() 100,000

u'196\t242\t3\t881250949'

Data Understanding

clean_data = movielens.map(lambda x:x.split('\t'))rate = clean_data.map(lambda y: int(y[2]))

rate.mean() 3.529863

users = clean_data.map(lambda y: int(y[0]))

users.distinct().count() 943

clean_data.map(lambda y: int(y[1])).\ distinct().count() 1,682

Data Preparation

from pyspark.mllib.recommendation\ import ALS, MatrixFactorizationModel, Rating

mls = movielens.map(lambda l: l.split('\t'))ratings = mls.map(lambda x:\ Rating(int(x[0]), int(x[1]), float(x[2])))

Rating(user=196, product=242, rating=3.0)

Data Preparation

train, test = ratings.randomSplit([0.7,0.3],7856)

train.count()

70,005

test.count()

29,995

train.cache()test.cache()

Modeling

rank = 5 # Latent Factors to be made

numIterations = 10 # Times to repeat process

#Create the model on the training datamodel = ALS.train(train, rank, numIterations)

Modeling / Evaluation

model.userFeatures()

model.productFeatures()

# For Product X, Find N Users to Sell Tomodel.recommendUsers(242,100)

# For User Y Find N Products to Promotemodel.recommendProducts(196,10)

#Predict Single Product for Single Usermodel.predict(196, 242)

# Predict Multi Users and Multi Products# Pre-Processingpred_input = train.map(lambda x:(x[0],x[1]))

# Lots of Predictionspred = model.predictAll(pred_input) #Returns Ratings(user, item, prediction)

(196, 242)

Rating(user=894, product=1560, rating=3.845)

Evaluation

User Item Actual Pred

196 242 3.0 3.91

186 302 3.0 3.29

22 377 1.0 1.09

244 51 2.0 3.66

298 474 4.0 4.11

TRAINING RMSE: 0.763

Evaluation

#Organize the data to make (user, product) the key)true_reorg = train.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred.map(lambda x:((x[0],x[1]), x[2]))

#Do the actual jointrue_pred = true_reorg.join(pred_reorg)

from math import sqrtMSE = true_pred.map(lambda r: (r[1][0] - r[1][1])**2).mean()RMSE = sqrt(MSE)#Results in 0.7629908117414474

((582, 1014), (4.0, 3.397))

((196, 242), 3.0)

Evaluation

test_input = test.map(lambda x:(x[0],x[1])) pred_test = model.predictAll(test_input)test_reorg = test.map(lambda x:((x[0],x[1]), x[2]))pred_reorg = pred_test.map(lambda x:\ ((x[0],x[1]), x[2]))test_pred = test_reorg.join(pred_reorg)test_MSE = test_pred.map(lambda r:\ (r[1][0] - r[1][1])**2).mean()test_RMSE = sqrt(test_MSE)

TEST RMSE: 1.0145

CRISP-DM

RecSys are Nearest Neighbors or MF Based

ALS is Implemented in Spark

rank = 5; numIterations = 10;#Create the model on the training datamodel = ALS.train(train, rank, numIterations)# Lots of Predictionspred = model.predictAll(pred_input)#Examine Model Featuresmodel.productFeatures()# Save your model!model.save(sc,"../out/ml-model")

Questions?Questions?

LearnBy Marketing.com

recommender systems with apache spark's als function

Data & Analytics

recommender systems recommender systems

recommender introduction to recommender systems and

recommender systems handbook - home -...

spark's souvenir 2k13

mllib: spark's machine learning library

course recommender

recommender lecture

fast als-based matrix factorization for recommender systems

open recommender

recommender systems an introduction chapter07 evaluating...

recommender system

streaming-oodt: combining apache spark's power with apache

movie genome: alleviating new item cold start in movie...

what's new with apache spark's structured streaming?

recommender systems - universidade nova de...

personalized recommender by exploiting domain based expert...

from dataframes to tungsten: a peek into spark's future @...

the spark's shadow - ningapi.ning.com/.../020110.doc ·...

recommender lab

from dataframes to tungsten: a peek into spark's...