data crackers yelp

Upload: prashanth-sandela

Post on 02-Jun-2018

235 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Data Crackers YELP

    1/24

    1

    Data Mining on YELP Dataset

    Advisor - Duc Tran Thanh

    Team - Data Crackers

    Prashanth Sandela

    Vimal Chandra Gorijala

    Parineetha Gandhi Tirumali

  • 8/10/2019 Data Crackers YELP

    2/24

    2

    Table of Contents1. Project Vision ........................................................................................................................................ 3

    2. Data Mining Task ................................................................................................................................... 3

    2.1. Data Mining Problem: ................................................................................................................... 3

    2.2. Evaluation Metrics: ....................................................................................................................... 3

    3. Hypothesis ............................................................................................................................................. 3

    4. Data Processing ..................................................................................................................................... 4

    4.1. Data ............................................................................................................................................... 4

    4.2. Initial Dataset ................................................................................................................................ 4

    4.3. Data quality problems ................................................................................................................... 5

    4.4. Data Processing Tasks ................................................................................................................... 5

    4.5. Resulting Dataset .......................................................................................................................... 6

    5. Feature Selection .................................................................................................................................. 7

    5.1. Dataset .......................................................................................................................................... 7

    5.2. Rationales behind feature selection ............................................................................................. 7

    5.3. Feature Selection Tasks: ............................................................................................................... 8

    5.4 Selected Features ............................................................................................................................. 10

    6. Model Development and Tuning by Prashanth Sandela .................................................................... 10

    6.1. Implementation of own model(Navie Bayes) ............................................................................. 10

    6.2. Nave Base Multinomial Classification Model ............................................................................. 13

    6.3. Experimental Results .................................................................................................................. 14

    7. Model Development and Tuning by Vimal Chandra Gorijala ............................................................. 15

    7.1. Nave Bayes Multinomial Model ................................................................................................. 15

    7.2. Nave Bayes Multinomial Text Model ......................................................................................... 16

    7.3. Results Comparison .................................................................................................................... 16

    8. Model Development and Tuning by Parineetha Gandhi .................................................................... 17

    8.1. K Nearest Neighbors Model ........................................................................................................ 17

    8.2. Decision Tree ............................................................................................................................... 199. Main Findings in the Project ............................................................................................................... 20

    10. Results and Comparison ................................................................................................................. 21

    11. Project Management ...................................................................................................................... 22

    12. List of Queries: ................................................................................................................................ 23

  • 8/10/2019 Data Crackers YELP

    3/24

    3

    1.Project VisionIn todays fast growing world, there are many businesses which are startups, growing and well

    established. For every business, rating holds a stand ground for its survival in market. This rating is given

    by users who enjoys goods and services from a business. User expresses his experience towards a

    business in the form of review and star ratings through many platforms and the famous one among

    them is YELP. The review can be positive, negative or a neutral review. The aim of our Project is to build

    a classifier that classifies any given review into labels of Star ratings (-1, 0 and 1). We planned to use

    various Data Mining models to classify reviews into user star ratings labels by applying various model

    tuning techniques to attain optimal accuracy of classifying.

    2.

    Data Mining Task

    2.1.Data Mining Problem:

    The data mining task we are trying to solve is multi-class classification. The classes we used in this

    project are -1, 0 and 1 (-1 - negative, 0 - neutral, 1 - Positive).

    2.2.

    Evaluation Metrics:The following are some of the evaluation metrics we have used to assess the quality of the solution.

    1) Percentage or accuracy of correctly classified instances: This metric is appropriate because through

    it we can exactly know how our model is performing. But we cannot rely on this metric completely.

    2) ROC value: This value gives the ratio of the true positives to false positives. ROC value measurement

    is one of the most important values output by WEKA. An "optimal" classifier will have ROC values

    approaching 1, with 0.5 being comparable to "random guessing"

    Through the combination of the above metrics we can assess the performance of the model and attain

    the best results.

    3.Hypothesis As it is a classification based on text, words in the reviews are features we should consider to classify

    them correctly. For example a review has words like good, excellent, awesome, yumm food etc..,

    that review should be classified into Positive class label. We planned to concentrate on them and

    apply transformations like stop words removal, stemming etc.., to make use of those words in best

    way possible with the help of different tools, so that the review can be classified correctly. We

    intended to concentrate mainly on Bayesian algorithms as they have better performances in the

    case of text classification.

    We intended to use combination of words called bigrams. For example words like very good, yum

    yum etc. In general the view of a user about a business is expressed mostly in combination of words,so we thought using bigrams could give good accuracy to the model.

    Use of other features like business id, user id individually can improve the accuracy and they should

    not be used as a combination.

    We will discuss in the results below how the model learning is affected by approaches in the hypothesis.

  • 8/10/2019 Data Crackers YELP

    4/24

    4

    4.Data Processing

    4.1.Data

    We obtained data fromhttp://www.yelp.com/dataset_challenge. It has 40,000 businesses, 1.3 Million

    reviews and 250,000 users. The data was in JSON format and we had to do some pre-processing and

    converted into .CSV format to obtain the review text, class labels and other features. There are manyirrelevant fields like neighborhoods, votes, etc.., and we have removed all of them and considered only

    the required ones. Initially the reviews have star rating class labels associated to them from 1 to 5 and

    we have reduced them as 1,2 to negative(-1), 3 as neutral(0) and 4,5 as Positive(1). The below figure

    gives the representation of the all the reviews and the modified class labels associated to them.

    Graph showing no. of review stars

    4.2.

    Initial DatasetInitial dataset which has YELP dataset which basically consisted data about Business, Users and

    Reviews. Below is snapshot of dataset in JSON format.

    Business User Reviews

    { 'type': 'business',

    'business_id': (encrypted business id),

    'name': (business name),

    'neighborhoods': [(hood names)],

    'full_address': (localized address),

    'city': (city),

    'state': (state),

    'latitude': latitude,

    'longitude': longitude,

    'stars': (star rating, rounded to half-stars),

    'review_count': review count,

    'categories': [(localized category names)]

    'open': True / False,

    }}

    {

    'type': 'user',

    'user_id': (encrypted user id),

    'name': (first name),

    'review_count': (review count),

    'average_stars': (floating point average,

    like 4.31),

    'votes': {(vote type): (count)},

    'friends': [(friend user_ids)],

    'elite': [(years_elite)],

    'yelping_since': (date, formatted like

    '2012-03'),

    'compliments': {

    (compliment_type):

    }

    {

    'type': 'review',

    'business_id': (encrypted business id),

    'user_id': (encrypted user id),

    'stars': (star rating, rounded to half-stars),

    'text': (review text),

    'date': (date, formatted like '2012-03-14'),

    'votes': {(vote type): (count)},

    }

    213509 163761

    748188

    NEGATIVE NEUTRAL POSITIVE

    Review Stars

    Count

    http://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challengehttp://www.yelp.com/dataset_challenge
  • 8/10/2019 Data Crackers YELP

    5/24

    5

    4.3.Data quality problems

    In the dataset weve many quality issues like specified below:

    1) Presence of unwanted columns and merging the files in dataset

    2) Special Characters

    3) Numeric Data

    4) Other language Characters

    5)

    Stop words.

    6) Business_id, review_id, user_id has hash indexes which occupy a lot of space.

    4.4.Data Processing Tasks

    4.4.1. 1.3.1 Removing Unwanted Columns and merging all the files

    Among all the columns, we considered only business_id, user_id, review_id, review_text,

    review_count and stars. Furthermore, three files are combined to make a single dataset with only

    considered attributes. For accomplishing this task, first the entire dataset (Includes all the 3 files) was

    converted from JSON to CSV using PYTHON script. Next, all the datasets are combined using ETL Tool

    Pentaho. Below is screenshot of ETL Mapping.

    4.4.2.

    1.3.2 Removing Special Characters, Numeric and Other language words

    The main consideration is review_text which is the text data that users entered as a review to

    the business which also has stars. There are special characters, new line character, other language

    characters. These have been removed by below PHP script.

    4.4.3.

    1.3.3 Removing Stop Word and convert to lower caseStop words are meaningless words, by removing which the meaning or weightage of sentence

    doesnt change. So, by decreasing these stop words the number of tokens are decreased. Converting

    complete context to lower case makes it easy for comparing two words when both words are in

    lowercase. Below is the Algorithm in PHP Script to perform the operations 1.3.2 and 1.3.3.

  • 8/10/2019 Data Crackers YELP

    6/24

    6

    4.5.Resulting Dataset

    The resulting dataset consist of business_id, review_id, user_id, stars and review_text in CSV

    format. A sample of the file is below.

    nYer89hXYAoddMEKTxw7kA,k2u1F6spBGhgk2JtAe97QA,HeDqdFYkKaeDvPtiFy6Xmw,event favorite event a long time lindsey a fabulous job setting up keeping

    movie played completely hush hush absolutely love filmbar always a great beer wi ne selection wonderful staff a wacky selection art film movie night wayne

    world one time favorites naturally beyond thrilled invited a super foxey date a guy a delicious dog short leash a fabulous ti me thank lindsey filmbar yelp a

    fantastic evening party time excellent ,5

    nYer89hXYAoddMEKTxw7kA,hdZ3rlgFXctCOUhzoOebvA,XXblLOSqYlq0tXhxHfXUHQ,great time funny movie loved going film bar first time ta bles eat fantastic

    ,4

    nYer89hXYAoddMEKTxw7kA,usQTOj7LQ9v0Fl98gRa3Iw,2fPxXAysOrZLrahZQyJCNg,wayne world short leash filmbar need more a great tuesday night

    adventure wayne world took waaaaaay back favorite aiko a chicken dog short leash ages great yelp coordinated outing usual tha nks lindsey yelp crew thanks

    kelly a staff filmbar place bomb day week thanks brad kat short leash bunch having trailer available event ,5

    nYer89hXYAoddMEKTxw7kA,XTFE2ERq7YvaqGUgQYzVNA,OO6prfuGEMalQcQcU3WCaw,fab concept test out a new well new hadn t film bar prev iously

    independent cinema drink a generously gratis beer wine choosing short lease hot dogs oh conveniently parked outside plus samples appetizers heck yes add

    excitement anticipation knowing filmed az movie going shown a perfect weeknight event now know movies don t actually based arizona awesome ideas next

    mention great filmbar date ideas online dating attempts pan out thanks film bar yelp short lease fun fellow yelpers a great time ps post movie trivia an swers ,5

    $result = mysql_query("select business_id, user_id, review_id, text, stars, review_count from reviews");

    while($rows = mysql_fetch_array($result)){

    $i++;

    $text = preg_replace("/[^A-z| ]/i", " ", $rows['text']);

    $text = str_replace("\n", " ", $text);

    //Process text to remove stop words.

    $text = explode(" ", $text);$processed_text = "";

    foreach ($text as $s){

    $s = strtolower($s);

    if($s != null && array_search($s, $stopWords) == false)

    $processed_text .= $s." ";

    }

    $optString .= "'".$rows['business_id']."',";

    $optString .= "'".$rows['user_id']."',";

    $optString .= "'".$rows['review_id']."',";

    $optString .= "'".$processed_text."',";

    $optString .= $rows['stars'];

    $optString .= $rows[review_count];

    $optString .= "\n";

    if($i % 1000 == 0) {$fd = fopen("reviews_DetailedStopWords.csv", "a+");

    fwrite($fd, $optString);

    $optString = "";

    echo "$i\n";

    }

    }

  • 8/10/2019 Data Crackers YELP

    7/24

    7

    5.Feature Selection

    5.1.Dataset

    After data preprocessing, the dataset is in pure csv structured format with required columns.

    This table is loaded into HDFS. A `reviews` table is created on the dataset. This table is used for feature

    selection.

    5.2.Rationales behind feature selection

    Now, as content is processed, the next task is to reduce the size of dataset by replacing the hash

    value of business_id, review_id and user_id with unique identifier. We create a new table to store those

    values, so that the initial id value will not be lost.

    Our final aim is to classify stars based on the reviews, so we narrowed down stars 1 5 in three

    classifications i.e., Positive, Negative and Neutral reviews. So, 1 and 2 star review fall under

    Negative reviews, 4 and 5 fall under Positive review and 3 star rating fall under Neutral review.

    Essential feature is review_text for classifying the review. We have tested to classify 25,000 reviews in

    WEKA on unigram, bigram and trigram considering 66% data for training the model and 34% for testing.

    As per the result, use of ngrams gave significantly high correctness. Based on this experiment weve

    planned to use ngrams to train the model.

    Weve planned to use 67% of the data to train model, and 33% for testing. Weve removed stop words

    in data preprocessing step. Text review is given by end user of YELP application. So, there is high

    possibility of having spelling mistakes in the review text. For example, users express their feelings in

    various ways. Some users may type gooood instead of good, coooool instead of cool. So, when we

    calculate the term frequency, there is high possibility of ignoring these words. This is called

    Lemmatization. And Stemming is to identify root word.

    For this phase weve implemented LovinsStemmer Algorithm. A UDF is created for this algorithm, which

    takes complete text as input, processes it and gives the output accordingly. After this phase, we have

    divided the table into unigram, bigram and trigram and calculated the frequency of the words.

    $> Hadoop fsput review.csv

    $>hive

    HIVE>create table reviews(business_id String, review_id String, user_id String, review_text String, stars int)

    row format delimited

    fields terminated by ,

    lines terminated by \n;

    HIVE> load data inpath review.csv into table reviews;

    HIVE> select * from reviews LIMIT 10;

    /* Displays list of columns in correct format */

  • 8/10/2019 Data Crackers YELP

    8/24

    8

    5.3.Feature Selection Tasks:

    5.3.1.

    Assigning numeric id to key attributes

    Below are the queries used to assign numeric id to key attributes: business_id, user_id,

    review_id

    5.3.2. Narrowing stars

    Narrowing stars implies converting 1 and 2 stars as negative i.e -1. Converting 3 stars to Neutral

    i.e., 0. Converting 4, 5 to Positive i.e 1.

    5.3.3.

    Removing Stop Words:

    Stop Words were removed in data processing phase.

    5.3.4.

    Stemming review text

    We created a UDF `lovinsStemmer()` based on the Lovins Stemming algorithm provided by WAIKATO.

    After applying stemming, we removed some newly generated stop words using UDF `stopWords()`.

    Above is query for doing these tasks.

    HIVE> CREATE TABLE business AS

    SELECT DISTINCT id, business_id,

    (SELECT RANK() OVER(ORDER BY business_id) as id,

    RANK() OVER(ORDER BY user_id) as user_id,

    RANK() OVER(ORDER BY review_id) as review_id

    business_id

    from reviews) a;

    HIVE> CREATE TABLE users AS

    SELECT DISTINCT id, user_id,

    (SELECT RANK() OVER(ORDER BY business_id) as business_id,

    RANK() OVER(ORDER BY user_id) as id,

    RANK() OVER(ORDER BY review_id) as review_id

    user_id

    from reviews) a;

    HIVE> CREATE TABLE process_reviews ASSELECT RANK() OVER(ORDER BY business_id) as business_id,

    RANK() OVER(ORDER BY user_id) as user_id,

    RANK() OVER(ORDER BY review_id) as review_id

    review_text,

    stars

    from reviews

    HIVE> CREATE TABLE processed_stars_reviews as

    SELECT business_id, review_id, user_id, review_text,

    CASE WHEN stars = 1 or stars = 2 THEN -1WHEN stars = 3 THEN 0

    WHEN stars = 4 or stars = 5 THEN 1

    END AS stars

    FROM processed_reviews

    HIVE> CREATE TABLE stemmed_stars_reviews as

    SELECT review_id,

    stopWords(lovinsStemmer(review_text)) as review_text,

    stars

    FROM processed_stars_reviews

    http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/http://www.cs.waikato.ac.nz/~eibe/stemmers/
  • 8/10/2019 Data Crackers YELP

    9/24

    9

    5.3.5.

    Dividing training and Test data

    Total number of reviews in the dataset are 1,125,458. So 67% of it, i.e 754,056 are for training the model

    and remaining 371,408 reviews are for testing the model. As weve already created unique review_id

    from 1 to 1,125,458. Above is the query to split the dataset.

    5.3.6.

    Classification Using WEKA

    In the initial report we have used StringToWord vector filter which uses WordTokenizer to convert the

    review text into vectors and applied Navie-Bayes Multinomial classification algorithm. We got 66.7%

    instances as correctly classified. The data sample had 5000 instances and 66% of it is used as training

    data and 34% as testing data.

    In this phase we have applied NGRAM tokenizer which converts the text into NGRAMS (Unigarms,

    Bigrams, Trigrams). We also applied Attribute Selection Filter on top of this which uses

    InfoGainAttributeEval function to evaluate the worth of the attribute by measuring the information gain

    with respect to the class and got 70.02% instances correctly classified. The Data sample has 25000

    instances and 66% of it is training data and remaining is testing data.

    HIVE> CREATE TABLE test_data asSELECT * FROM processed_stars_reviews

    WHERE review_id >= 754056;

    HIVE> CREATE TABLE training_data as

    SELECT * FROM processed_stars_reviews

    WHERE review_id < 754056;

  • 8/10/2019 Data Crackers YELP

    10/24

    10

    5.3.7. Generating N-grams

    Ngrams generation is predefined function in HIVE. We are using function to make unigrams, bigrams and

    trigrams and calculate their frequency. Below are the queries to create unigram, bigram and trigram. Weare selecting top 2000 word list.

    5.4.Selected Features

    These are selected features for our classification

    1) Business Id

    2) User Id

    3) Bigrams

    4) Review Text

    6.Model Development and Tuning by Prashanth Sandela

    6.1.Implementation of own model (Nave Bayes)

    6.1.1.

    Idea to develop model in Hive

    I developed my own model for classifying star ratings based on text. For developing model, I considered

    features business id, user id, review id, review text and stars. The implementation of the model is based

    on the probability of the ngram based on the either business id or user id applied on training data and

    HIVE> CREATE TABLE unigrams as

    SELECT ngrams(sentences(review_text), 1, 2000, 2000) from training_data;

    HIVE> CREATE TABLE bigrams as

    SELECT ngrams(sentences(review_text), 2, 2000, 2000) from training_data;

    HIVE> CREATE TABLE trigrams as

    SELECT ngrams(sentences(review_text), 3, 2000, 2000) from training_data;

  • 8/10/2019 Data Crackers YELP

    11/24

    11

    apply the model on test data to classify stars of the review as -1, 0, 1 which implies Negative, neutral or

    positive review. This model has shown an accuracy of 69.5%.

    6.1.2. Model Development & Description

    This model is purely developed using HIVE Queries using Amazon Web Services storing data in S3,

    development and deploying in Elastic Map Reduce with 3 EC2 Instances. This model serves on entire

    dataset of YELP which consists of 1.3 Million records with 67% of Training data and 33% of Test Data.

    Steps followed to develop the model:

    1. Divide Training and Test Data

    2. Find ngram, frequency, star and probability from Training data

    3. Find review id, ngram, frequency in test data

    4. Train the with test data

    5. Compare test and training data set words and including few other features

    6. Retrieve the percentage match of training and test data

    QueriesBigrams1, Numerics, Bigrams_1, Bigrams_stag_1,Bigrams_stag_2have been used to design themodel. This is the final model which has been tuned. Below is the example to see the model

    implementation.

    Below is the example of classification in unigrams.

    Training Data Total Stars

    Word Frequency Star Probability Stars Total Count

    Good 100 1 0.33 1 300

    Excellent 50 1 0.16 -1 200

    Bad 100 -1 0.5 0 150

    Good 10 -1 0.05Nice 15 0 0.1

    Queries bigrams_test_1_1, stats are used to compute the results. Here in the above example, Icalculated the probability of the word based on the frequency of the word and total word count. Word

    Good is available in both 1 and -1 star ratings. So, based on the probability Good will be classified as

    +1 star. Below is how the review will be classified based on the text.

    Test Data

    Review_id Word

    Count in

    reviews New_Star Original_Star

    1 Good 10 1 1

    1 Bad 5 -1 1

    2 Bad 30 -1 -1

    2 Worst 40 -1 -1

    Here review id 1, count of word will be considered and it will be classified as a review with star rating 1.

    Similarly review id 2 will be classified as star with rating -1.

  • 8/10/2019 Data Crackers YELP

    12/24

    12

    I considered only business which had at least 10 reviews and users who have at least 10 reviews. Ive

    divided data at each business and user level. For E.g.: If there are 100 reviews for a specific business,

    then 66 reviews are supplied to training and rest to test data, when I consider business id as a feature.

    This division happens at every business id. Likewise, similar process is repeated for user id and also when

    both the features business id and user id are considered together.

    In hive, I was not able to develop ROC measure for result metrics.

    6.1.3.

    Model Tuning

    Data set supplied to this model has been removed with stop words and text enrichment. Below are

    model tuning procedures:

    1) Refining and sampling of training and test data

    Initially I just stripped my dataset into 100,000 records in which I stripped first 67,000 as training

    and 37,000 as test data. I realized that in the training data more than 90% of the records were

    positive. So, I divided data in random sample between few parts and which improved the accuracy

    nearly by 7% and which improved my accuracy from 43% to ~51%.2) Change of stemmer

    Initially I used lovins Stemmer, on doing some research I found that Porter Stemmer is better than

    Lovins Stemmer. I used a java program to implement this Stemmer which resulted in improving

    accuracy by ~0.5%.

    3) NGrams and identifying frequency count

    Use of different grams has changed the accuracy. Use of bigrams has shown better accuracy.

    Furthermore, there is slight increase in accuracy when I considered term frequency as 5000. Using

    this tuning, there is increase of 4% accuracy.

    4) Determine best approach to increase accuracy

    Before arriving to the procedure of Probability Model, I used various other models approaches like

    sum and count model which didnt help me much to determine accurate results. But use ofProbability Model has increased accuracy significantly.

    5) Change of features

    We have two more features, business id and user id. When I used business id or user id as an extra

    features, accuracy increased significantly. But when I used business id and user id together, there

    was actually reduce in accuracy and it makes sense that a use of both features implies that the

    model will search for business id and user id for same business id and tries to classify stars. As this

    combination will be unique, the accuracy got reduced.

    6) Applying on overall Dataset

    When I supplied overall dataset on the ration of 67% as training and 33% as test dataset, then I got

    accuracy of ~74%. Accuracy on sample data was ~71%, but on overall data its a bit higher.

    6.1.4.

    Pros and Cons

    a) It is an SQL like language. So easy to implement.

    b) Main advantage of using this model is, we can tune the model to any extent.

    c) There wont be any limitation on size of data or number of fields.

    d) Never run out of memory.

    e) Can implement this model in a cluster using all the required resources.

  • 8/10/2019 Data Crackers YELP

    13/24

    13

    f) It is difficult to design and implement this model any change requires lot of implementations to be

    considered like if a query is changed, what might be the effect on the result. Should be very careful

    while making changes.

    g) There are lot of predefined function already defined by HIVE, any new extensions can be easily

    accommodated by designing a UDF(User Defined Function)

    6.2.Nave Base Multinomial Classification Model

    6.2.1. About Model

    I used WEKA Data Modeling Tool to classify stars using Nave Bayes Multinomial Model which is

    available in list of Bayes models. WEKA has pre-defined models implemented with many filters and

    features.

    6.2.2.

    Model Tuning

    This has been performed on the dataset with Training data of 67% and 33% as Test data and has been

    implemented on 100,000 records.

    1)

    Initial accuracy was about around 47% without any tuning.

    2) I supplied new list of stop words list rather than using default stop word list. There was slight

    increment of accuracy but it was nearly ~0.5%

    3) Use of ngrams instead of Word Tokenizer improved has shown better accuracy.

    4) In ngrams, the accuracy was even better when bigrams have been used on top of my dataset.

    5) Default NullStemmer was replaced by LovinsStemmer which gave slight increase in accuracy.

    6) Use of words to keep also improves the accuracy of the result. Increase in word to keep from 1000

    to 5000 has shown me change in accuracy ~2%.

    7)

    I used Attribute Selection filter with search strategy of Ranker Algorithm with threshold of 0,

    generateRanking: True, numToSelect: -1 and leaving starStar to null which has shown me an

    increase in accuracy by 1.5%8) Using different features change the accuracy of output. I used business id and user id together to

    see an increase in accuracy. But use of these two attributes reduced the accuracy, which is

    expected. Coz, a user will give one or two reviews based on his experience in a business. When we

    user both features together, then number of instance per review will be narrowed down to either 1

    or 2 which implies there is definitely decrease in probability and accuracy. So, I used only one

    attribute at a time. User of user id as a feature gave be a better increase in accuracy. There was

    increase in accuracy by 3%.

    9) Overall accuracy is 76%

    6.2.3. Pros and Cons

    1)

    Using this model with WEKA give the flexibility to use many filters and attributes both for supervisedand unsupervised learning.

    2) Using WEKA, only works on small datasets. Working with larger datasets is not possible.

    3) This algorithm has already been designed, so effort to change any specific task is not required.

    4) If we want to add a new functionality which is not available, it is difficult to implement.

  • 8/10/2019 Data Crackers YELP

    14/24

    14

    6.3.Experimental Results

    Sl.

    NoAction

    *Nave

    Model

    *Nave Bayes

    Multinomial

    ROC

    for

    NBM

    Discuss Results

    1

    Initial Dataset 44% 46%

    0.47In this stage, no filters are applied and results showa initial model results without applying any filters

    2

    Refining of

    Training and Test

    Data +7% N/A

    0.54 Default set of training data I selected was more

    positive, so sampling of training data helped me to

    increase accuracy in my model. But where in Nave

    Bayes Multinomial, randomization is automatically

    handled by WEKA (using randomizer).

    3

    Change of

    Stemmer +0.5% N/A

    0.55 Change of stemmer from Lovins Stemmer to Porter

    Stemmer has shown slight increase in accuracy. I

    don't have Lovins Stemmer in default stemmers list

    in WEKA to implement in Nave Bayes Multinomial

    4 Ngrams:

    Unigrams +3% +3.5% 0.59 Of these 3 ngrams, in both the cases bigrams gave

    me optimal results. So, I went ahead implemented

    bigrams.Bigrams +7% +7.5% 0.68

    Trigrams +2% +2% 0.60

    5 Including Features

    Business id and

    User id+2% +1%

    0.70 Use of both together in Probability Model reduced

    accuracy, it might be because of full outer join which

    joins records depending on business id and user id

    and might be searching for specific instances where

    in instances which are in train dataset might not be

    available in test dataset. In this case, I found that

    Naive Bayes Multinomial gave good result.Business id +2% +3% 0.74 Use of business id and user id increased accuracy.

    But accuracy was more when user id alone was

    considered. According to this, I can understand that

    it is similar to users sentiment analysis, because a

    user will use same sort of text to express his

    feelings. I saw that there are small number of users

    who gave lot of reviews. So, use of user id as a

    feature definitely explains increase in accuracyUser id 5% +5%

    0.74

    6

    Bag of words 5% 4%

    0.75 Initially I used 1000 words for the frequency count.

    But use of 3000 words increased accuracy.

    7 Overall Accuracyon 100,000 records ~73% ~76%

    0.78

    8 Accuracy on

    complete dataset ~75% N/A

    I was not able to fit everything in WEKA memory

    even if I allocated 6GB of memory to WEKA.

    * All the accuracy rates are rounded to nearest value

  • 8/10/2019 Data Crackers YELP

    15/24

    15

    7.Model Development and Tuning by Vimal Chandra Gorijala

    7.1.Nave Bayes Multinomial Model

    7.1.1.

    About Model

    Multinomial Nave Bayes is a special version of Nave Bayes that is designed more for the text

    documents. This model is mainly useful for multiclass classification. Initially we have 5 classes to classifythe reviews but we have reduced them into three (positive, neutral, negative), so that we can train the

    model in a better manner. Here the probability of a review d being in a class c is computed as

    Where P(tk | c) is the conditional probability of a term tk occurring in a review of class c.We interpret

    P(tk | c) as measure of how much evidence tk contributes that c is a correct class. P(c) is the Prior

    probability of a review occurring in class c. If a review terms do not provide clear evidence for one class

    versus another, we choose the one that has higher probability.

    We used WEKA to implement the model. Initially the dataset containing the features review text,

    business id, review id, user id and the class label are fed to the tool. The preprocessing is done and they

    are converted into word Vectors or NGrams based on the filters applied. Now we implement the model

    on them.

    7.1.2.

    Model Tuning

    In WEKA we can change various properties to increase the performance of the model. The data sample

    has 60,000 records. The following are some of them.

    1. Using NGrams rather than Word vectors, but bigrams usage increased the efficiency.

    2. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the

    dataset.

    3. Increasing the minimum term frequency from 1 to 10, which indicates a term with less than 10

    occurrences is not considered.

    4. Converting all the text to lower tokens

    5. Using Attribute selection filter InfoGainAttribute Eval on top of Ngrams so that only top ranked

    attributes are taken into account and fed to the model.

    6. Using Cross fold option instead of percentage split option.

    7. Utilizing additional features like business_id or user_id

  • 8/10/2019 Data Crackers YELP

    16/24

    16

    7.1.3.

    Experimental Results

    Features and ParametersPercentage of

    Accuracy

    ROC

    Initial % with review text 48 0.46

    Stopwords 53 0.51

    Stemmer 54 0.52

    Unigrams 59 0.58

    Min word Frequency from 5-10 65 0.64

    Business id and User id 65.23 0.64

    Bigrams 72 0.73

    Trigrams 63 0.62

    User id 74 0.75

    Business id 74 0.75

    Attribute selection Filter 78 0.79

    Bag of words count to 5000 79 0.81

    OverAll Accuracy 79.49 0.83

    Observation: Here in Naive Bayes Multinomial model varying the minimum term frequency and usage

    of bigrams feature has improved the performance drastically. The reason behind this is the datacontains many bigrams and concentrating mainly on highly frequent words in the reviews.

    7.2.Nave Bayes Multinomial Text Model

    7.2.1.

    About Model

    Multinomial Nave Bayes Text model operates directly on string attributes. Other types of input

    attributes are accepted but ignored during training and classification. It uses word frequencies rather

    than binary bag of words representation. This model will be useful mainly with the text data.

    We used WEKA to implement the model. A data sample of 60,000 instances has been used.

    7.2.2. Model Tuning

    In WEKA we can change various properties to increase the performance of the model. The following are

    some of them:

    1. Converting all the text to lower case tokens.

    2. Varying the minimum word frequency.

    3. Using Ngrams instead of word vectors.

    4. Utilizing additional features like business_id or user_id

    5. Using Cross fold option instead of percentage split option.

  • 8/10/2019 Data Crackers YELP

    17/24

    17

    6. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the

    dataset.

    7.2.3. Experiment Results

    Features and Parameters

    Percentage of

    Accuracy

    ROC

    Initial % with review text 54 0.53

    Stopwords 56 0.57

    Stemmer 59 0.60

    Unigrams 61 0.62

    Min word Frequency from 5-10 66 0.70

    Business id and User id 65 0.64

    Bigrams 73 0.75

    Trigrams 64 0.66

    User id 77 0.79

    Business id 77 0.79

    OverAll Accuracy 79.6 0.84

    Observation: The same reason mentioned in the Naive Bayes Multinomial model is responsible for the

    drastic increase in the accuracy of the model. From the above comparison of results we can say thatNave Bayes Multinomial Text model has slightly higher efficiency (about 0.11%) than Nave Bayes

    Multinomial model. The reason behind this could be in the Nave Bayes Multinomial Text model some

    extra processing is carried out which gives it slightly higher efficiency than Nave Bayes Multinomial

    model.

    8.Model Development and Tuning by Parineetha GandhiDataset fed to the tool has 25000 reviews consisting 16576 are positive, 5650 are negative and 2772 are

    neutral reviews.

    8.1.K Nearest Neighbors Model

    8.1.1.

    About Model

    k is a constant given by the user, and an unlabeled vector is classified by assigning the label which is

    most frequent among the k nearest training samples to that vector. The following formula defines the

    nearest neighbors

  • 8/10/2019 Data Crackers YELP

    18/24

    18

    K value should be chosen according to the data, value of k reduces the effect of noise on the

    classification, but make boundaries between the classes less distinct.

    8.1.2. Model Tuning

    Tuned the model by varying the k value. Initial I applied k=1 and observed that the tokenizer does not

    make much difference on changing this attribute which gave me the result as 67.2314%

    In the second attempt on tuning the model more by applying k=15 and applying tokenizer as bigram I

    observed that the accuracy increased to 69.125%.

    Parameters Tuned:

    Following are the parameters which I changed and tuned the model. Table in the section 8.1.3 clearly

    shows the results obtained by varying the parameters.

    TFTransform and IDFTransform

    minTermFreq

    outputwordCounts

    lowercasetokensStemmer

    stopwords

    tokenizer

    Also tuned the model with the model specific parameters. For example the parameters like Euclidean

    distance, number of nearest neighbors, etc have been changed. For decision tree parameters like laPlace

    value, binary split options etc have been changed.

    8.1.3.

    Experiment Results

  • 8/10/2019 Data Crackers YELP

    19/24

    19

    Observation: The results obtained are good when k=5, the reason behind this could be when k is even,

    when classifying to more than two groups or when using an even value for k, it might be necessary to

    break a tie in the number of nearest neighbors.

    When considered KNN specific parameters like Euclidean distance and Manhattan distance, it is

    observed that the Euclidean distance gave better results.

    Results obtained when used KNN specific parameters

    8.2.

    Decision Tree

    8.2.1.

    About Model

    Decision tree classifies instances by sorting them down the tree from root to some leaf node which

    provides the classification of instances. Each internal node represents an attribute of the instance, each

    branch represents the node corresponds to one of the possible values for this attribute

    8.2.2.

    Model Tuning

    Parameters Tuned:

    Applied the same parameters for this model as well and got the accuracy 74.25%

    Performed percentage split in most of the cases as cross fold validation was taking quite a long time for

    each experiment.

    Initially I tried to run the model without applying any parameters on the dataset and observed that the

    accuracy is 72.41% and ROC is 0.72.

    The best results are obtained when set the Laplace value to true which is about 0.82 ROC. The reason

    could be the Laplace correction method biases the probability towards a uniform distribution.

    Decision Tree Specific Parameters:

    Parameters Value Accuracy ROC

    Binary Split TRUE 71.54 0.72

    numFolds 10 69.93 0.73

    useLaplace TRUE 69.98 0.82

    Results obtained when used Decision Tree specific parameters

  • 8/10/2019 Data Crackers YELP

    20/24

    20

    Result comparison between KNN and Decision Tree

    Observation:

    The results obtained are good when k=5, the reason behind this could be classifying to more than two

    groups or when using an even value for k, it might be necessary to break a tie in the number of nearest

    neighbors.

    For decision tree the best results are obtained when Laplace value is set to true and that showed the

    increase in ROC which is 0.819.

    9.Main Findings in the ProjectNaive Bayes Multinomial Text model has performed the best among all models we have tried. The

    reasons are listed below.

    Varying the parameter min term frequency has drastically affected the performance. The words

    which are not repeated frequently and not useful to the classification are ignored.

    The review text mostly has Bigrams like very good, feeling awesome etc. So, using these feature forclassifying the reviews has helped a lot.

    Use of additional features like user id, business id increased the performance. For example a user

    gives most of his reviews as positive for different businesses, most likely the next review given by

    him for any other business would be positive. If a business has most of its reviews as positive, most

    likely the next incoming review would be positive. For these features to work the reviews of user

    must be present in both training and the test set and same with businesses.

    Nave Bayes Multinomial model has almost same accuracy as the above model due to same reasons.

    But, the Multinomial Nave Bayes Text model has some extra processing to it which increases the

    accuracy.

  • 8/10/2019 Data Crackers YELP

    21/24

    21

    10. Results and Comparison

    Graph shows accuracy results obtained by different models

    From the above graph it clearly implies that Nave Bayes Multinomial Text gives optimal accuracy of

    ~80%. It can be observed that from left to right all the Nave Bayes models started with the accuracy of

    40% and 55%. Where in KNN and Decision tree started with better accuracy, we were expecting better

    accuracy with these models, but it didnt turn out to what we expected. Accuracy increased as we

    included features and filters. From ngrams, it can be observed that bigrams has shown good results, so

    we considered bigrams for further processing. In features of business id, user id and both together we

    have observed better accuracy when we considered only user id. For KNN the accuracy was good with

    k=5 and also by considering the KNN specific features like Euclidean distance the results were

    considerably high. For Decision Tree setting the laplace value has increased ROC. Including these special

    functions for these two models improved the accuracy slightly.

    So, Nave Bayes Multinomial Text Classification gave a good accuracy when compared with other

    models.

  • 8/10/2019 Data Crackers YELP

    22/24

    22

    11. Project Management

    11.1. Task Allocation and Timelines

    We used Project Management websitewww.Asana.comto manage entire project and workload. Below

    is timeline allocation of work load for each team member. We used this tool to store all intermediate

    files, reports and scripts or snippets.

    11.2. Self-Assessment:

    - Everyone on the team contributed equally. There was no total dependency or delay from

    anyone in the team.

    -

    Everyone was equally active and enthusiastic to learn something new.- Before taking any decision, we made sure that everyone is clear about the requirements and

    expected output. We followed the process of Knowledge Transfer and Reverse Knowledge

    Transfer to make sure that everyone is on same page.

    - There were lot of discussions in the initial phase of project so that everything goes without any

    hurdle in the end.

    - Everyone in the team has decent knowledge on different tools and technologies like Pentaho

    Data Integration, WEKA, MYSQL, JAVA, PHP and Big Data Components. So, if any sort of decision

    had to be made, there was always someone to address.

    - Everyone used Asana Project Management tool actively.

    11.3.

    What can be improved?- Domain knowledge

    - Increasing awareness and usability of tools to all the members of the team.

    http://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.comhttp://c/Users/Prashanth/Downloads/www.Asana.com
  • 8/10/2019 Data Crackers YELP

    23/24

    23

    12. List of Queries:

    12.1. Bigrams1CREATETABLEbigrams asSELECTword,star,frequencyFROM(

    SELECTword,star,frequency,rank()over(orderbyfrequency desc)asslnoFROM(SELECTword,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THEN1WHENneg_count >=nut_count THEN-1ELSE0

    ENDASstar,CASEWHENpos_count >=neg_count ANDpos_count >=nut_count THENpos_countWHENneg_count >=nut_count THENneg_countELSEnut_count

    ENDASfrequencyFROM(SELECTdistinct

    CASEWHENneg.gram.ngram[0] ISNOTNULLTHENconcat(neg.gram.ngram[0]," ",neg.gram.ngram[1])WHENnut.gram.ngram[0] ISNOTNULLTHENconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ELSEconcat(nut.gram.ngram[0]," ",nut.gram.ngram[1])ENDASword,CASEWHENpos.gram.estfrequency ISNULLTHEN0ELSEpos.gram.estfrequency ENDASpos_count,CASEWHENneg.gram.estfrequency ISNULLTHEN0ELSEneg.gram.estfrequency ENDASneg_count,CASEWHENnut.gram.estfrequency ISNULLTHEN0ELSEnut.gram.estfrequency ENDASnut_countFROMbigrams_neg asneg FULLOUTERJOINbigrams_nut asnutonneg.gram.ngram =nut.gram.ngramFULLOUTERJOINbigrams_pos asposonpos.gram.ngram =nut.gram.ngram

    )asa)asb)ascwhereslno

  • 8/10/2019 Data Crackers YELP

    24/24

    24

    12.5. Bigrams_stag_2CREATETABLEbigrams_stag_2ASSELECTreview_id,max(prob_sum)asprob_max FROMbigrams_stag_1GROUPBYreview_id;

    12.6. Bigrams_test_1_1CREATE

    TABLE

    bigram_test_1_1AS

    SELECTtest.review_id,new_star,test.stars asoriginal_starFROM(SELECTstag1.review_idASreview_id,starASnew_star

    FROMbigrams_stag_1ASstag1 INNERJOINbigrams_stag_2ASstag2onstag1.review_id =stag2.review_idandstag1.prob_sum =stag2.prob_max

    )a INNERJOINtest_data testona.review_id =test.review_id;

    12.7. Stats

    This gives final statistics which shows number of correctly classified instances and wrongly classifiedinstances.

    SELECTstats,COUNT(*)

    FROM(SELECTCASE

    WHENnew_star =original_star THEN1ELSE0ENDasstats,new_star,original_star

    FROMbigram_test_1_1)resGROUPBYres.stats;

    selectoriginal_star,count(*)frombigram_test_1_1 groupbyoriginal_star;

    selectnew_star,count(*)frombigram_test_1_1 groupbynew_star;