yelp dataset challenge
TRANSCRIPT
Yelp Dataset ChallengeAparna Nanda Shivika
Thapar Arnab Kumar Mishra
Vishesh Tanksale Vraj Parikh
Task 1 - Toolkit / API Lucene Java API - querying index
MongoDB - Loading json files so as to have easy access
Apache Spark - Fast large scale index creation
The Stanford NLP POS Tagging - Generating Effective Queries
Task 1 - Method / Algorithm1.Create and index a corpus of business documents using Lucene.
Document -> (Business ID, Review Text, Tip Text, Category)
2.Take in a new review/tip from test set, preprocess it by applying Part-of-Speech tagging on it and then query it against against index.
3. Perform ranking of the documents for the given query against index created using BM25 Similarity, LMDirichlet Similarity, etc.
4.Based on the top ranked documents, we rank the corresponding categories, and assign top 5 of these categories to the input review / tip.
Task 1 - Evaluation MetricsPrecision = #(relevant items retrieved) / #(retrieved items)
Recall = #(relevant items retrieved) / #(relevant items)
BM25 Similarity
Language Model with Dirichlet Smoothing
Language Model with Jelinek Mercer Smoothing
Task 1 - Evaluation Results
Task 1 - Evaluation Results
Task 2 - The ChallengeInformation Retrieval for City and Category wise comparison of businesses.
What is a business famous for? What is it that the customers like the most about a business? What is it that they don’t like?
Considered all businesses in a city to get consolidated city sentiments
Scope for improvement of a business by fetching negative remarks, complaints from reviews.
City wise comparison of businesses
Suggestions/ recommendations based on above findings
Task 2 - Toolkit / APIJava
Python
MongoDB
PyMongo
NLTK for chunking and POS tagging
Pattern for sentiment analysis
MatPlotLib for line graph plotting
Tableau
Git
Task 2 - Method / AlgorithmFilter the businesses in order to perform the review filtering for the
selected business types(hospitals,indian restaurants,gyms).
Filter the reviews based on their business cities(Madison,Pittsburgh,Charlotte) and categories and generate the corresponding MongoDB collections for them.
Use the built collections to access the review texts one by one for further processing.
Perform sentiment analysis using the Pattern package on the review text to figure out which review is positive and which one is negative.
Task 2 - Method / AlgorithmFor each positive review, fetch phrases by using Chunker from NLTK
package. We used {<JJ> <NN>|<JJ> <NNS>|<NN> <NNS>} to fetch chunks from the review. Ex. wonderful stay, fresh towels,great staff etc.
For each negative review, fetch phrases by using Chunker from NLTK package. We used {<NNP> <NN|NNP>|<RB> <JJ>|<JJ> <NNS>|<NN> <NN>} to fetch chunks from the review. Ex. always understaffed, horrible hospital, worst service, parking lot etc.
Add the good and bad phrases to the “good” and “bad” set for a city’s business.
Compare each business’s strengths and weaknesses.
Compare businesses across cities.
Pittsburgh Hospitals Negative Word Cloud
Charlotte Hospitals Negative Word Cloud
Pittsburgh Hospitals Positive Word Cloud
Charlotte Hospitals Positive Word Cloud
Madison Positive Word Cloud
Madison Negative Word Cloud
Positive and Negative Score
Attributes based comparison of cities
Task 2 - Evaluation MetricsPercentage Error: Compare the average rating of the reviews against
the average rating of the reviews based on sentiment of the reviews for that category.
x: avg of ratings of reviews from data set
y: avg of ratings based on sentiment of reviews
Percentage Error = (|y - x| / x) * 100
Error greatly impacts our analysis and recommendations.
Task 2 - Evaluation MetricsAccuracy: Estimate the rating of each review based on good and bad
phrases
Positiveness = #good phrases / #phrases
Marginalize Positiveness to a scale of 5 to get the rating.
Rating_Predicted = Positiveness * 5
Total correct predictions = #(|Actual_Rating - Rating_Predicted | <= Error)
Accuracy = (Total correct predictions/Total predictions)*100
Task 2 - Evaluation Result
Average Rating Sentiment Average Rating
Error Accuracy
Madison 4.0 3.54 11.5% 63.27
Charlotte 3.84 4.11 7% 75.55
Pittsburgh 3.54 2.79 21.1% 59.45
THANK YOU! :)