Download - yelp data challenge
Content
• Yelp Data Challenge• Problem Definition• Classical Machine Learning• Deep Learning• Docker Image• Conclusion
Yelp dataset
Academic Dataset.Information about local business.Five Json files:1. business2. check-in3. review4. tip5. user
https://www.yelp.com/dataset_challenge/dataset
User Reviews
Jason Files contains user reviews:Example:
{"user_id": "Iu6AxdBYGR4A0wspR9BYHA", "review_id": "KPvLNJ21_4wbYNctrOwWdQ", "stars": 5, "date": "2014-02-13", "text": "Excellent food. Superb customer service. I miss the mario machines they used to have, but it's still a great place steeped in tradition.", "type": "review", "business_id": "5UmKMjUEUNdYWqANhGckJw"}
Simpler Problem
Assuming That:
• bad rate (1-3)• good rate (4-5)we will predict whether the user like the
service (rate=4 or 5) or not(rate=1,2 or 3)
Pre-processing
Extract the first 2000 review as dataset.
Features: user reviews text.Labels: if Rate >3: Positive Else: Negative
Classical Machine Learningcombination of several machine learning algorithms.
Logistic Regression. Naive Bayes. Stochastic gradient Descent. Support Vector Machine.
combine results using majority votes.
https://github.com/amrqura/yelpRatePrediction
Classical Machine LearningFeatures:
use words of the review text as features as the following:
1.Extract most common 3000 words in the training dataset.
In each statement examine if the frequent words exists or not.
Feature set : matrix of [2000*3000] 2000: number of statements. 3000: boolean values examine existence of frequent word.
https://github.com/amrqura/yelpRatePrediction
Classical Machine LearningTraining:
Train model in 80% statements and validate on 20%.
Run:
$ python3 TextClassification.py
https://github.com/amrqura/yelpRatePrediction
Classical Machine LearningResults(5- cross validation):
Naive Bayes accuracy percent: 66.5 MultinomialNB accuracy percent: 67.95 BernoulliNB accuracy percent: 65.9 Logistic regression accuracy percent: 67.15 SGD accuracy percent: 63.3 SVC accuracy percent: 54.0 LinearSVC accuracy percent: 63.75 NuSVC accuracy percent: 66.35 Voted accuracy percent: 67.35
https://github.com/amrqura/yelpRatePrediction
CNN DatasetEach statement is represented by n*k matrix. n= number of words k= vector length.
we use word2vec to convert each word to vector.
https://github.com/amrqura/deepYelpRatePrediction
CNN Training
implementation using Tensorflow. number of filters=128 filter size=3,4,5. drop out=0.5
Apply max pooling.
output layer: two nodes with softmax function.
use cross-entropy error function. use L2 regularization.
https://github.com/amrqura/deepYelpRatePrediction
CNNTraining:
5 cross validation. Train model in 1600 statements and validate on 400.
Run:
$ python rating_prediction.py
Accuracy: 0.73549998
https://github.com/amrqura/deepYelpRatePrediction
Docker ImageDocker Image:
https://hub.docker.com/r/amrkoura/yelpchallenge/
Run: docker pull amrkoura/yelpchallange docker run -t -i amrkoura/yelpchallenge /bin/bash cd /src/
Run classical machine learning:
python3 TextClassification.py
Run Deep learning:
python rating_prediction.py
Conclusion2 implementations to predict the user rating from user review. • classical machine learning • Deep learning , convolutional neural network.
public Docker image:
https://hub.docker.com/r/amrkoura/yelpchallenge/