yelp data challenge - discovering latent factors using ratings and reviews

Yelp Data Challenge - Discovering

Latent Factors using Ratings and

Reviews

Ahmed Al-Herz, Tharindu Mathew,

Gen Nishida

Problem Statement

● Uncover hidden dimensions of ratings using both ratings and review text

combined.

● From the uncovered hidden dimensions, we can answer the following

questions:

○ What does a particular user cares about regarding restaurants?

○ Which aspects should the restaurant improve in order to effectively

increase the rating?

○ Which restaurant is the best for a particular user?

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

User Preference & Restaurant Aspects

● The user preference can be represented by a vector, each component

corresponds to a particular aspect of the restaurant features.

● The restaurant aspects can be represented by a similar vector.

Food Place Service Price Time ...

User A 0.6 0.2 0.1 0.1 0.0

User B 0.1 0.1 0.0 0.7 0.1

Food Place Service Price Time ...

Restaurant A 0.9 0.4 0.9 0.02 0.3

Restaurant B 0.3 0.8 0.2 0.9 0.8


Previous Work

● Most of the previous works use either ratings or review text.

○ Latent factor model [1] uses the ratings to predict the missing ratings, but

the extracted dimensions do not necessarily explain the variation present

in ratings.

○ Latent Dirichlet Allocation (LDA) [2] discovers hidden dimensions in text,

but it cannot predict the ratings.


[1] Koren, Y., Bell, R., Volinsky, C. Matrix factorization techniques for recommender systems, Computer (2009).

[2] Hoffman, M., Blen, D. Online learning for latent dirichlet allocation. Newral Information Processing Systems (2010).

Baselines

● Simplest baseline

The overall average rating is used as prediction.

● Non-negative matrix factorization (NMF)

A matrix is formed from the ratings such that the entry in the i-th row and j-th

column represents a rating by i-th user to j-th restaurant, and NMF approach

is used to predict missing ratings.

● Latent factor model

Baseline MSE RMSE

Simples baseline 1.475 1.214

Latent factor model 1.360 1.166

NMF 10.95 3.31


Latent Dirichlet Allocation (LDA)


● Assume that there are set number of topics that each document

● Generate a mixture of words for each topic with a weight for each word

● No attempt to interpret what each topics means

Ex:

● 30% milk, 10% meow, 20% kitten - Cat

● 35% bone, 20% bark, 10% chase - Dog

LDA Approach


1. Experiment with different number of topics. Ex: 50, 5, 10

2. Run the LDA algorithm and look at the topics generated

Ex:

1. 0.012*coffee + 0.010*breakfast + 0.009*bread + 0.009*dish +

0.009*eggs + 0.009*sausage

2. 0.020*chocolate + 0.018*ice + 0.017*cream + 0.017*dessert +

0.013*desert

3. Manual interpretation of the topics. Ex: 1 defines breakfast, 2 defines

dessert

4. Create a preference vector θub per user per business for each review,

representing k topics, ex: θub = [0.3, 0.2, 0.1, …., 0.2]

Results of LDA (gensim)

Topic Some of the extracted words

1 Breakfast breakfast, toast, brunch, bacon

2 BBQ bbq, brisket, pork, chicken, beef, flavor

3 Salad salad, cheese, bacon, tomato, dressing

4 Fast food burger, food, fries, drink

5 Coziness place, people, service, sitting, friend

6 Bar cocktails, time, server

7 Dessert chocolate, ice cream, dessert, cake

... ... ...

20 Sushi rolls, sushi, drink


Latent Factor Model


● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.

Then, the rating sui by user u to restaurant i is predicted by

where is the overall average rating and and are the bias.

Latent Factor Model

● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.

Then, the rating sui by user u to restaurant i is predicted by

where is the overall average rating and and are the bias.

● The parameters are chosen by minimizing the Mean Squared Error (MSE), i.e.,


Our Approach

● We extend the latent factor model by combining the review text data using LDA.

● First, we compute the rough estimate of bi and wu by


Our Approach

● Next, we optimize bi and wu by minimizing the following MSE:

This is convex with regard to bi and wu, so we can obtain the global optimum

by gradient descent.

● We extend the latent factor model by combining the review text data using LDA.

● First, we compute the rough estimate of bi and wu by


Results

● Optimize w and b using the selected hyper parameters (stepSize=0.00001,

mu=2.0, and lambda=2.0)

Dataset MSE RMSE

Training dataset 0.141 0.375

Validation dataset 0.684 0.827

Test dataset 0.661 0.813


Results




Dataset MSE RMSE




Results

Dataset MSE RMSE







Performance improvement over the baselines by 96% at most. (Simple Baseline - MSE - 1.475, RMSE -

1.214)

Applications

Restaurant Predicted rating

Z’s Greek 6.4

ZK Grill 6.0

New York Pizza Dept. 5.9

● Recommendation about how to effectively improve the restaurant business.


● Recommendation of restaurants to users

Restaurant Breakfast BBQ Fast food Coziness ...

Humble Pie 0.96 5.02 3.93 0.95 ...

Top 3 ratings predicted for a user who has significantly high weight on “Breakfast”

Summary & Future Work


● We have proposed an approach that uses both ratings and review text to

uncover the latent factors of user preference and restaurant feature

vectors.

● Our approach has achieved the better prediction accuracy over the

baselines by 96 %.

Summary & Future Work

● We have proposed an approach that uses both ratings and review text to

uncover the latent factors of user preference and restaurant feature

vectors.

● Our approach has achieved the better prediction accuracy over the

baselines by 96 %.

● Do we account sentiment analysis based on local LDA. Do we account for

different sentiments within review?

○ Ex: food was great, but service was bad

● Establish a method evaluate the extracted hidden factors (e.g. human

computation)


Work Distribution


Team members Tasks

Ahmed Al-Herz ● Establish our objective function

● Calculating convexity of our objective function

● Implement the simplest baseline

● Implement the non-negative matrix factorization

Tharindu Mathew ● Establish our objective function

● Implement LDA

● Interpret the extracted topics

● Evaluate the sentiment analysis, but not implemented.

Gen Nishida ● Establish our objective function

● Extract a denser dataset from the original Yelp dataset

● Implement the latent factor model

● Implement the objective function

● Implement the gradient descent

yelp data challenge - discovering latent factors using ratings and reviews

Software