yelp data challenge - discovering latent factors using ratings and reviews
TRANSCRIPT
Yelp Data Challenge - Discovering
Latent Factors using Ratings and
Reviews
Ahmed Al-Herz, Tharindu Mathew,
Gen Nishida
Problem Statement
● Uncover hidden dimensions of ratings using both ratings and review text
combined.
● From the uncovered hidden dimensions, we can answer the following
questions:
○ What does a particular user cares about regarding restaurants?
○ Which aspects should the restaurant improve in order to effectively
increase the rating?
○ Which restaurant is the best for a particular user?
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
User Preference & Restaurant Aspects
● The user preference can be represented by a vector, each component
corresponds to a particular aspect of the restaurant features.
● The restaurant aspects can be represented by a similar vector.
Food Place Service Price Time ...
User A 0.6 0.2 0.1 0.1 0.0
User B 0.1 0.1 0.0 0.7 0.1
Food Place Service Price Time ...
Restaurant A 0.9 0.4 0.9 0.02 0.3
Restaurant B 0.3 0.8 0.2 0.9 0.8
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Previous Work
● Most of the previous works use either ratings or review text.
○ Latent factor model [1] uses the ratings to predict the missing ratings, but
the extracted dimensions do not necessarily explain the variation present
in ratings.
○ Latent Dirichlet Allocation (LDA) [2] discovers hidden dimensions in text,
but it cannot predict the ratings.
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
[1] Koren, Y., Bell, R., Volinsky, C. Matrix factorization techniques for recommender systems, Computer (2009).
[2] Hoffman, M., Blen, D. Online learning for latent dirichlet allocation. Newral Information Processing Systems (2010).
Baselines
● Simplest baseline
The overall average rating is used as prediction.
● Non-negative matrix factorization (NMF)
A matrix is formed from the ratings such that the entry in the i-th row and j-th
column represents a rating by i-th user to j-th restaurant, and NMF approach
is used to predict missing ratings.
● Latent factor model
Baseline MSE RMSE
Simples baseline 1.475 1.214
Latent factor model 1.360 1.166
NMF 10.95 3.31
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Latent Dirichlet Allocation (LDA)
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
● Assume that there are set number of topics that each document
● Generate a mixture of words for each topic with a weight for each word
● No attempt to interpret what each topics means
Ex:
● 30% milk, 10% meow, 20% kitten - Cat
● 35% bone, 20% bark, 10% chase - Dog
LDA Approach
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
1. Experiment with different number of topics. Ex: 50, 5, 10
2. Run the LDA algorithm and look at the topics generated
Ex:
1. 0.012*coffee + 0.010*breakfast + 0.009*bread + 0.009*dish +
0.009*eggs + 0.009*sausage
2. 0.020*chocolate + 0.018*ice + 0.017*cream + 0.017*dessert +
0.013*desert
3. Manual interpretation of the topics. Ex: 1 defines breakfast, 2 defines
dessert
4. Create a preference vector θub per user per business for each review,
representing k topics, ex: θub = [0.3, 0.2, 0.1, …., 0.2]
Results of LDA (gensim)
Topic Some of the extracted words
1 Breakfast breakfast, toast, brunch, bacon
2 BBQ bbq, brisket, pork, chicken, beef, flavor
3 Salad salad, cheese, bacon, tomato, dressing
4 Fast food burger, food, fries, drink
5 Coziness place, people, service, sitting, friend
6 Bar cocktails, time, server
7 Dessert chocolate, ice cream, dessert, cake
... ... ...
20 Sushi rolls, sushi, drink
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Latent Factor Model
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.
Then, the rating sui by user u to restaurant i is predicted by
where is the overall average rating and and are the bias.
Latent Factor Model
● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.
Then, the rating sui by user u to restaurant i is predicted by
where is the overall average rating and and are the bias.
● The parameters are chosen by minimizing the Mean Squared Error (MSE), i.e.,
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Our Approach
● We extend the latent factor model by combining the review text data using LDA.
● First, we compute the rough estimate of bi and wu by
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Our Approach
● Next, we optimize bi and wu by minimizing the following MSE:
This is convex with regard to bi and wu, so we can obtain the global optimum
by gradient descent.
● We extend the latent factor model by combining the review text data using LDA.
● First, we compute the rough estimate of bi and wu by
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Results
● Optimize w and b using the selected hyper parameters (stepSize=0.00001,
mu=2.0, and lambda=2.0)
Dataset MSE RMSE
Training dataset 0.141 0.375
Validation dataset 0.684 0.827
Test dataset 0.661 0.813
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Results
● Optimize w and b using the selected hyper parameters (stepSize=0.00001,
mu=2.0, and lambda=2.0)
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Dataset MSE RMSE
Training dataset 0.141 0.375
Validation dataset 0.684 0.827
Test dataset 0.661 0.813
Results
Dataset MSE RMSE
Training dataset 0.141 0.375
Validation dataset 0.684 0.827
Test dataset 0.661 0.813
● Optimize w and b using the selected hyper parameters (stepSize=0.00001,
mu=2.0, and lambda=2.0)
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Performance improvement over the baselines by 96% at most. (Simple Baseline - MSE - 1.475, RMSE -
1.214)
Applications
Restaurant Predicted rating
Z’s Greek 6.4
ZK Grill 6.0
New York Pizza Dept. 5.9
● Recommendation about how to effectively improve the restaurant business.
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
● Recommendation of restaurants to users
Restaurant Breakfast BBQ Fast food Coziness ...
Humble Pie 0.96 5.02 3.93 0.95 ...
Top 3 ratings predicted for a user who has significantly high weight on “Breakfast”
Summary & Future Work
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
● We have proposed an approach that uses both ratings and review text to
uncover the latent factors of user preference and restaurant feature
vectors.
● Our approach has achieved the better prediction accuracy over the
baselines by 96 %.
Summary & Future Work
● We have proposed an approach that uses both ratings and review text to
uncover the latent factors of user preference and restaurant feature
vectors.
● Our approach has achieved the better prediction accuracy over the
baselines by 96 %.
● Do we account sentiment analysis based on local LDA. Do we account for
different sentiments within review?
○ Ex: food was great, but service was bad
● Establish a method evaluate the extracted hidden factors (e.g. human
computation)
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Work Distribution
Ahmed Al-Herz, Tharindu Mathew, Gen Nishida
Team members Tasks
Ahmed Al-Herz ● Establish our objective function
● Calculating convexity of our objective function
● Implement the simplest baseline
● Implement the non-negative matrix factorization
Tharindu Mathew ● Establish our objective function
● Implement LDA
● Interpret the extracted topics
● Evaluate the sentiment analysis, but not implemented.
Gen Nishida ● Establish our objective function
● Extract a denser dataset from the original Yelp dataset
● Implement the latent factor model
● Implement the objective function
● Implement the gradient descent