yelp dataset challenge

22
Yelp Dataset Challenge Aparna Nanda Shivika Thapar Arnab Kumar Mishra Vishesh Tanksale Vraj Parikh

Upload: arnkmish

Post on 14-Apr-2017

313 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Yelp dataset challenge

Yelp Dataset ChallengeAparna Nanda Shivika

Thapar Arnab Kumar Mishra

Vishesh Tanksale Vraj Parikh

Page 2: Yelp dataset challenge

Task 1 - Toolkit / API Lucene Java API - querying index

MongoDB - Loading json files so as to have easy access

Apache Spark - Fast large scale index creation

The Stanford NLP POS Tagging - Generating Effective Queries

Page 3: Yelp dataset challenge

Task 1 - Method / Algorithm1.Create and index a corpus of business documents using Lucene.

Document -> (Business ID, Review Text, Tip Text, Category)

2.Take in a new review/tip from test set, preprocess it by applying Part-of-Speech tagging on it and then query it against against index.

3. Perform ranking of the documents for the given query against index created using BM25 Similarity, LMDirichlet Similarity, etc.

4.Based on the top ranked documents, we rank the corresponding categories, and assign top 5 of these categories to the input review / tip.

Page 4: Yelp dataset challenge

Task 1 - Evaluation MetricsPrecision = #(relevant items retrieved) / #(retrieved items)

Recall = #(relevant items retrieved) / #(relevant items)

BM25 Similarity

Language Model with Dirichlet Smoothing

Language Model with Jelinek Mercer Smoothing

Page 5: Yelp dataset challenge

Task 1 - Evaluation Results

Page 6: Yelp dataset challenge

Task 1 - Evaluation Results

Page 7: Yelp dataset challenge

Task 2 - The ChallengeInformation Retrieval for City and Category wise comparison of businesses.

What is a business famous for? What is it that the customers like the most about a business? What is it that they don’t like?

Considered all businesses in a city to get consolidated city sentiments

Scope for improvement of a business by fetching negative remarks, complaints from reviews.

City wise comparison of businesses

Suggestions/ recommendations based on above findings

Page 8: Yelp dataset challenge

Task 2 - Toolkit / APIJava

Python

MongoDB

PyMongo

NLTK for chunking and POS tagging

Pattern for sentiment analysis

MatPlotLib for line graph plotting

Tableau

Git

Page 9: Yelp dataset challenge

Task 2 - Method / AlgorithmFilter the businesses in order to perform the review filtering for the

selected business types(hospitals,indian restaurants,gyms).

Filter the reviews based on their business cities(Madison,Pittsburgh,Charlotte) and categories and generate the corresponding MongoDB collections for them.

Use the built collections to access the review texts one by one for further processing.

Perform sentiment analysis using the Pattern package on the review text to figure out which review is positive and which one is negative.

Page 10: Yelp dataset challenge

Task 2 - Method / AlgorithmFor each positive review, fetch phrases by using Chunker from NLTK

package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks from the review. Ex. wonderful stay, fresh towels,great staff etc.

For each negative review, fetch phrases by using Chunker from NLTK package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN> <NN>} to fetch chunks from the review. Ex. always understaffed, horrible hospital, worst service, parking lot etc.

Add the good and bad phrases to the “good” and “bad” set for a city’s business.

Compare each business’s strengths and weaknesses.

Compare businesses across cities.

Page 11: Yelp dataset challenge

Pittsburgh Hospitals Negative Word Cloud

Page 12: Yelp dataset challenge

Charlotte Hospitals Negative Word Cloud

Page 13: Yelp dataset challenge

Pittsburgh Hospitals Positive Word Cloud

Page 14: Yelp dataset challenge

Charlotte Hospitals Positive Word Cloud

Page 15: Yelp dataset challenge

Madison Positive Word Cloud

Page 16: Yelp dataset challenge

Madison Negative Word Cloud

Page 17: Yelp dataset challenge

Positive and Negative Score

Page 18: Yelp dataset challenge

Attributes based comparison of cities

Page 19: Yelp dataset challenge

Task 2 - Evaluation MetricsPercentage Error: Compare the average rating of the reviews against

the average rating of the reviews based on sentiment of the reviews for that category.

x: avg of ratings of reviews from data set

y: avg of ratings based on sentiment of reviews

Percentage Error = (|y - x| / x) * 100

Error greatly impacts our analysis and recommendations.

Page 20: Yelp dataset challenge

Task 2 - Evaluation MetricsAccuracy: Estimate the rating of each review based on good and bad

phrases

Positiveness = #good phrases / #phrases

Marginalize Positiveness to a scale of 5 to get the rating.

Rating_Predicted = Positiveness * 5

Total correct predictions = #(|Actual_Rating - Rating_Predicted | <= Error)

Accuracy = (Total correct predictions/Total predictions)*100

Page 21: Yelp dataset challenge

Task 2 - Evaluation Result

Average Rating Sentiment Average Rating

Error Accuracy

Madison 4.0 3.54 11.5% 63.27

Charlotte 3.84 4.11 7% 75.55

Pittsburgh 3.54 2.79 21.1% 59.45

Page 22: Yelp dataset challenge

THANK YOU! :)