yelp dataset challenge

Yelp Dataset ChallengeAparna Nanda Shivika

Thapar Arnab Kumar Mishra

Vishesh Tanksale Vraj Parikh

Task 1 - Toolkit / API Lucene Java API - querying index

MongoDB - Loading json files so as to have easy access

Apache Spark - Fast large scale index creation

The Stanford NLP POS Tagging - Generating Effective Queries

Task 1 - Method / Algorithm1.Create and index a corpus of business documents using Lucene.

Document -> (Business ID, Review Text, Tip Text, Category)

2.Take in a new review/tip from test set, preprocess it by applying Part-of-Speech tagging on it and then query it against against index.

3. Perform ranking of the documents for the given query against index created using BM25 Similarity, LMDirichlet Similarity, etc.

4.Based on the top ranked documents, we rank the corresponding categories, and assign top 5 of these categories to the input review / tip.

Task 1 - Evaluation MetricsPrecision = #(relevant items retrieved) / #(retrieved items)

Recall = #(relevant items retrieved) / #(relevant items)

BM25 Similarity

Language Model with Dirichlet Smoothing

Language Model with Jelinek Mercer Smoothing

Task 1 - Evaluation Results

Task 2 - The ChallengeInformation Retrieval for City and Category wise comparison of businesses.

What is a business famous for? What is it that the customers like the most about a business? What is it that they don’t like?

Considered all businesses in a city to get consolidated city sentiments

Scope for improvement of a business by fetching negative remarks, complaints from reviews.

City wise comparison of businesses

Suggestions/ recommendations based on above findings

Task 2 - Toolkit / APIJava

Python

MongoDB

PyMongo

NLTK for chunking and POS tagging

Pattern for sentiment analysis

MatPlotLib for line graph plotting

Tableau

Git

Task 2 - Method / AlgorithmFilter the businesses in order to perform the review filtering for the

selected business types(hospitals,indian restaurants,gyms).

Filter the reviews based on their business cities(Madison,Pittsburgh,Charlotte) and categories and generate the corresponding MongoDB collections for them.

Use the built collections to access the review texts one by one for further processing.

Perform sentiment analysis using the Pattern package on the review text to figure out which review is positive and which one is negative.

Task 2 - Method / AlgorithmFor each positive review, fetch phrases by using Chunker from NLTK

package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks from the review. Ex. wonderful stay, fresh towels,great staff etc.

For each negative review, fetch phrases by using Chunker from NLTK package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN> <NN>} to fetch chunks from the review. Ex. always understaffed, horrible hospital, worst service, parking lot etc.

Add the good and bad phrases to the “good” and “bad” set for a city’s business.

Compare each business’s strengths and weaknesses.

Compare businesses across cities.

Pittsburgh Hospitals Negative Word Cloud

Charlotte Hospitals Negative Word Cloud

Pittsburgh Hospitals Positive Word Cloud

Charlotte Hospitals Positive Word Cloud

Madison Positive Word Cloud

Madison Negative Word Cloud

Positive and Negative Score

Attributes based comparison of cities

Task 2 - Evaluation MetricsPercentage Error: Compare the average rating of the reviews against

the average rating of the reviews based on sentiment of the reviews for that category.

x: avg of ratings of reviews from data set

y: avg of ratings based on sentiment of reviews

Percentage Error = (|y - x| / x) * 100

Error greatly impacts our analysis and recommendations.

Task 2 - Evaluation MetricsAccuracy: Estimate the rating of each review based on good and bad

phrases

Positiveness = #good phrases / #phrases

Marginalize Positiveness to a scale of 5 to get the rating.

Rating_Predicted = Positiveness * 5

Total correct predictions = #(|Actual_Rating - Rating_Predicted | <= Error)

Accuracy = (Total correct predictions/Total predictions)*100

Task 2 - Evaluation Result

Average Rating Sentiment Average Rating

Error Accuracy

Madison 4.0 3.54 11.5% 63.27

Charlotte 3.84 4.11 7% 75.55

Pittsburgh 3.54 2.79 21.1% 59.45

THANK YOU! :)

yelp dataset challenge

Data & Analytics