yelp data challenge - discovering latent factors using ratings and reviews

Post on 18-Jul-2015

205 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Yelp Data Challenge - Discovering

Latent Factors using Ratings and

Reviews

Ahmed Al-Herz, Tharindu Mathew,

Gen Nishida

Problem Statement

● Uncover hidden dimensions of ratings using both ratings and review text

combined.

● From the uncovered hidden dimensions, we can answer the following

questions:

○ What does a particular user cares about regarding restaurants?

○ Which aspects should the restaurant improve in order to effectively

increase the rating?

○ Which restaurant is the best for a particular user?

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

User Preference & Restaurant Aspects

● The user preference can be represented by a vector, each component

corresponds to a particular aspect of the restaurant features.

● The restaurant aspects can be represented by a similar vector.

Food Place Service Price Time ...

User A 0.6 0.2 0.1 0.1 0.0

User B 0.1 0.1 0.0 0.7 0.1

Food Place Service Price Time ...

Restaurant A 0.9 0.4 0.9 0.02 0.3

Restaurant B 0.3 0.8 0.2 0.9 0.8

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Previous Work

● Most of the previous works use either ratings or review text.

○ Latent factor model [1] uses the ratings to predict the missing ratings, but

the extracted dimensions do not necessarily explain the variation present

in ratings.

○ Latent Dirichlet Allocation (LDA) [2] discovers hidden dimensions in text,

but it cannot predict the ratings.

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

[1] Koren, Y., Bell, R., Volinsky, C. Matrix factorization techniques for recommender systems, Computer (2009).

[2] Hoffman, M., Blen, D. Online learning for latent dirichlet allocation. Newral Information Processing Systems (2010).

Baselines

● Simplest baseline

The overall average rating is used as prediction.

● Non-negative matrix factorization (NMF)

A matrix is formed from the ratings such that the entry in the i-th row and j-th

column represents a rating by i-th user to j-th restaurant, and NMF approach

is used to predict missing ratings.

● Latent factor model

Baseline MSE RMSE

Simples baseline 1.475 1.214

Latent factor model 1.360 1.166

NMF 10.95 3.31

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Latent Dirichlet Allocation (LDA)

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

● Assume that there are set number of topics that each document

● Generate a mixture of words for each topic with a weight for each word

● No attempt to interpret what each topics means

Ex:

● 30% milk, 10% meow, 20% kitten - Cat

● 35% bone, 20% bark, 10% chase - Dog

LDA Approach

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

1. Experiment with different number of topics. Ex: 50, 5, 10

2. Run the LDA algorithm and look at the topics generated

Ex:

1. 0.012*coffee + 0.010*breakfast + 0.009*bread + 0.009*dish +

0.009*eggs + 0.009*sausage

2. 0.020*chocolate + 0.018*ice + 0.017*cream + 0.017*dessert +

0.013*desert

3. Manual interpretation of the topics. Ex: 1 defines breakfast, 2 defines

dessert

4. Create a preference vector θub per user per business for each review,

representing k topics, ex: θub = [0.3, 0.2, 0.1, …., 0.2]

Results of LDA (gensim)

Topic Some of the extracted words

1 Breakfast breakfast, toast, brunch, bacon

2 BBQ bbq, brisket, pork, chicken, beef, flavor

3 Salad salad, cheese, bacon, tomato, dressing

4 Fast food burger, food, fries, drink

5 Coziness place, people, service, sitting, friend

6 Bar cocktails, time, server

7 Dessert chocolate, ice cream, dessert, cake

... ... ...

20 Sushi rolls, sushi, drink

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Latent Factor Model

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.

Then, the rating sui by user u to restaurant i is predicted by

where is the overall average rating and and are the bias.

Latent Factor Model

● Let bi be the restaurant’s feature vector and wu be the user’s preference vector.

Then, the rating sui by user u to restaurant i is predicted by

where is the overall average rating and and are the bias.

● The parameters are chosen by minimizing the Mean Squared Error (MSE), i.e.,

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Our Approach

● We extend the latent factor model by combining the review text data using LDA.

● First, we compute the rough estimate of bi and wu by

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Our Approach

● Next, we optimize bi and wu by minimizing the following MSE:

This is convex with regard to bi and wu, so we can obtain the global optimum

by gradient descent.

● We extend the latent factor model by combining the review text data using LDA.

● First, we compute the rough estimate of bi and wu by

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Results

● Optimize w and b using the selected hyper parameters (stepSize=0.00001,

mu=2.0, and lambda=2.0)

Dataset MSE RMSE

Training dataset 0.141 0.375

Validation dataset 0.684 0.827

Test dataset 0.661 0.813

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Results

● Optimize w and b using the selected hyper parameters (stepSize=0.00001,

mu=2.0, and lambda=2.0)

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Dataset MSE RMSE

Training dataset 0.141 0.375

Validation dataset 0.684 0.827

Test dataset 0.661 0.813

Results

Dataset MSE RMSE

Training dataset 0.141 0.375

Validation dataset 0.684 0.827

Test dataset 0.661 0.813

● Optimize w and b using the selected hyper parameters (stepSize=0.00001,

mu=2.0, and lambda=2.0)

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Performance improvement over the baselines by 96% at most. (Simple Baseline - MSE - 1.475, RMSE -

1.214)

Applications

Restaurant Predicted rating

Z’s Greek 6.4

ZK Grill 6.0

New York Pizza Dept. 5.9

● Recommendation about how to effectively improve the restaurant business.

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

● Recommendation of restaurants to users

Restaurant Breakfast BBQ Fast food Coziness ...

Humble Pie 0.96 5.02 3.93 0.95 ...

Top 3 ratings predicted for a user who has significantly high weight on “Breakfast”

Summary & Future Work

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

● We have proposed an approach that uses both ratings and review text to

uncover the latent factors of user preference and restaurant feature

vectors.

● Our approach has achieved the better prediction accuracy over the

baselines by 96 %.

Summary & Future Work

● We have proposed an approach that uses both ratings and review text to

uncover the latent factors of user preference and restaurant feature

vectors.

● Our approach has achieved the better prediction accuracy over the

baselines by 96 %.

● Do we account sentiment analysis based on local LDA. Do we account for

different sentiments within review?

○ Ex: food was great, but service was bad

● Establish a method evaluate the extracted hidden factors (e.g. human

computation)

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Work Distribution

Ahmed Al-Herz, Tharindu Mathew, Gen Nishida

Team members Tasks

Ahmed Al-Herz ● Establish our objective function

● Calculating convexity of our objective function

● Implement the simplest baseline

● Implement the non-negative matrix factorization

Tharindu Mathew ● Establish our objective function

● Implement LDA

● Interpret the extracted topics

● Evaluate the sentiment analysis, but not implemented.

Gen Nishida ● Establish our objective function

● Extract a denser dataset from the original Yelp dataset

● Implement the latent factor model

● Implement the objective function

● Implement the gradient descent

top related