yelp advisor report

Upload: danish

Post on 05-Jul-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/15/2019 Yelp Advisor Report

    1/15

    Classifying Yelp RestaurantsTeam Yelpadvisor: Stephanie Wuerth, Bichen Wu, Tsu-Fang Lu

    December 14, 2015

    Problem Statement and Background 

    Our goal is to classify restaurants into existing labels using the Yelp academic dataset. We also hope to

    further classify restaurants with more specific labels than their given labels. For example, a restaurant

    may just be labeled as “Chinese” but can we further classify it as “Sichuan” or “Taiwanese”?

    Our dataset is the Yelp’s academic dataset, which is provided for use as part of the Yelp Dataset

    Challenge1. This dataset spans approximately 10 years of Yelp reviews (text and star rating) and 5 years

    of Yelp tips, along with hourly sums of check-ins for each business. It also includes general information

    about the business, such as its categories, ambiance, business hours, and address, and some information

    about the Yelp reviewers such as the number of reviews they have written and the average star rating they

    have given. The dataset includes 10 cities: Edinburgh, Karlsruhe, Charlotte, Urbana, Madison, Las Vegas,

    Phoenix, Pittsburgh, Montreal, and Waterloo (Canada).

    To measure accuracy, we compare our predicted label to the true label. We measure basic accuracy,

     precision and recall, as well as AUC. We also compare our model’s accuracy to a baseline model.

    Some potential applications of our classification models include:

    1. Help Yelp automate restaurant labeling without user inputs.

    2. Label vaguely labeled restaurants more specifically or label restaurants with missing labels.

    3. Inform customers about restaurants’ specialties and particular cuisines by further sub-categorizing

    restaurants into more specific labels.

    MethodsData collection

    We used the Yelp academic dataset, which is made available by Yelp for the Yelp Dataset Challenge 1. Toobtain this data, we registered for the Yelp Dataset Challenge at http://www.yelp.com/dataset_challenge.  

    Data preparation

    The data provided is in JSON format, but a Python script for converting to csv is offerred at

    https://github.com/Yelp/dataset-examples. We used this script (              json_to_csv_converter.py) to convert the

    JSON data into csv files, then we read those into a Python notebook and stored the data in Pandas

    dataframes. We subset the data for what is potentially useful for our chosen problem. We use 9 of the 10

    cities in the Yelp Academic Dataset for our model. Karlsruhe, Germany data is omitted because mostreviews here are not written in English, and review text is the richest component of our dataset. We

    further subset by selecting only restaurants (excluding Hotels, Spas, etc.). Within restaurants we further

    subset for the 20 most common types of restaurant, as dictated by their given labels. Labels chosen and

    number of restaurants with each label in our subsetted dataset are given in Fig. 1 (See in Appendix). We

    also got rid of EOL, carriage returns, and certain regex patterns in the review texts for our bag of words

    model to work better.

    1

  • 8/15/2019 Yelp Advisor Report

    2/15

    Featurization

    We featurize our review text using a Bag of Words (BoW) model, building a training matrix of number of

    restaurants by size of vocabulary as follows:

    All reviews received by each restaurant in the training set (70% of total) are joined and tokenized with

    stopwords removed, then words are counted to create the sparse BoW vector for each restaurant.

    We tested several different feature inclusions:-   N-grams: Unigrams Only, Bigrams + Unigrams

    -   Number of features retained: 6000, 15000, 100,000, or ~200,000 (which is the total count of

    unique words in our training corpus)

    -   Feature weights: raw frequencies or term frequency, inverse document frequency (TF-IDF)

    weighting. We note here the specifics of the TF-IDF weighting: we used the default parameters of

    the sklearn.feature_extraction.text.TfidfTransformer() tool. ( norm='l2'    , use_idf=True,

     smooth_idf=True,  sublinear_tf=False). The norm parameter means we normalize the final

    vectors, and the smooth_idf and use_idf parameters mean our features are weighted according to

    tf * (idf + 1) , where tf is the frequency of the feature in the restaurant's merged reviews, and idf

    is the inverse frequency of the feature in the entire training corpus (all restaurant reviews).

    Another featurization we tried, but did not implement in the final pipeline, is to use the star rating matrix,

    which is a matrix of number of users by number of restaurants. Each element in the matrix corresponds to

    a user’s rating for a certain restaurant. Then we performed matrix factorization (through PCA and

    Alternative Least Squares) to obtain a factor matrix of number of factors by number of restaurants. We

    treated each vector (with the length of factors) as a data point to represent each restaurant.

    Learning

    First we describe the learning methods used for the supervised problem of classifying restaurants into

    their existing labels. Then we describe the methods for the unsupervised problem of classifying

    restaurants into subcategories.

     Supervised text-based classification into existing labels 

    Models tested: Logistic regression and random forest

    Logistic regression marginally outperformed our random forest models, so we have chosen the logistic

    regression model as our primary model.

    Parameter choices:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,

    intercept_scaling=1, max_iter=100, multi_class='ovr',

     penalty='l2', random_state=None, solver='liblinear', tol=0.0001,

    verbose=0) 

       Logistic regression: Multi_class = “ovr” indicates that a binary problem is fit for each label. So in our

    case, for each of the 20 categories we model whether a restaurant does or does not fall into that category.

    2

  • 8/15/2019 Yelp Advisor Report

    3/15

    This is a logical choice since some restaurants fall into more than one category (for example, many “Sushi

    Bars” are also “Japanese”).

    RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',

    max_depth=None, max_features=6000, max_leaf_nodes=None,

    min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=4,

    oob_score=False, random_state=None, verbose=0,

    warm_start=False)

      Random forests: We tested a number of parameter choices, but the best performance was achieved by

    keeping 6000 features, 100 estimators, bootstrap on, and gini criterion. We initially included fewer

    features, because according to sckit-learn documentation2, for classification tasks, the number of features

    used in a random forest model is often optimized with max_features=sqrt(n_features). N_features in our

    case is ~200,000, so ~500 would be a good choice for max_features. However, we saw increased

    accuracy when we included more features.

    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 

       Multinomial Naïve Bayes  (MNB) (baseline model):

    The alpha parameter is set by default to 1 to include adaptive smoothing.

    We also tried Bernoulli Naive Bayes by binarizing features such that presence of a word (count of 1 or

    more) gave a feature value of 1 while absence of a word gave a feature value of 0. This method gave us

    fairly high accuracy, but zero recall for all categories, so we present Multinomial NB as our baseline

    model. 

    Clustering for Sub-Categorization

    For sub-categorization, we implemented spectral clustering, which can be summarized as the following

     procedures3:

    1 . F o r m th e a f f in i t y m a t r ix A , w it h A i j = e x p ( | | si  - s       j||2 

    2) for and Aii = 0δ/   =i   / j  

    2. Define D to be a diagonal matrix whose (i,i)-element is the sum of A’s i-th row. And construct

    L = D-1/2 A D-1/2 

    3 . F i n d k e i g e nv e c t or s x1, … , xk corresponding to the k smallest eigenvalues of L. Form matrix

    X= [x1

    ...xk

    ] .

    4 . R e - n or m a l iz e e a c h r o w o f X t o f or m m a t ri x Y .

    5. Treat each row of Y as a data point, do K-means clustering on Y.

    We implemented this algorithm ourselves and applied it to cluster: (1) all restaurants into groups in order

    to see whether or not these groups correspond to a sensical composition of given restaurant types, and (2)

    Chinese restaurants into subcategories. The parameter in this algorithm is , which controls theδ  

    connectivity of data points. The smaller is, the more separated clusters will appear. We set to theδ δ  

    value to make the number of separated clusters equal to 5.

    3

  • 8/15/2019 Yelp Advisor Report

    4/15

    Other things we tried :

    Initially we were working on a different problem involving time series analysis of daily review counts.

    We took the time series of the daily review counts as our features and our hope was to (1) find different

    customer influx patterns for different types of restaurants, and (2) predict customer influx to certain cities

    and venues based on these time series. Our analysis failed because after running some statistical tests, we

    learned that there is not enough information in the time series for us to distinguish different types of

    restaurants and predict customer influx.

    Another method we tried was to factorize the star-rating matrix to yield a factor matrix corresponding to

    restaurants, then use this as features and apply k-means on it to find restaurant types. However, the

    star-rating matrix is very sparse. Even using ALS (Alternative Least Squares) factorization, our average

     prediction error (measured by root of mean squared error) was larger than 1 star. So we abandoned this

    feature and used bag of words instead. 

    Results

    Supervised Labeling

    How many features should we include?

    Figure 2 in the appendix shows that, for the case of unigrams only and no TF-IDF weighting, accuracy,

     precision, and AUC are all maximized by including the entire corpus. The effect of increased corpus size

    on recall is less clear cut. Since recall is not drastically decreased by including more features, we can base

    our choices on the precision and accuracies.

    Should we weight our data by TF-IDF? Should we include bigrams?

    In Figure 3, we compare the performance of our logistic regression model for four featurization choices.

    In all cases, ~200,000 features are used, but we vary inclusion of unigrams vs. unigrams + bigrams, and

    we test whether or not to weight with TF-IDF. The top plot shows that using bigrams in addition tounigrams has little effect on the overall accuracies. We see a slight improvement for Italian and Chinese

    restaurants when adding bigrams, but this improvement is not substantial. The middle and bottom plots

    show precision and recall. We see that using TF-IDF weights generally increases precision but decreases

    recall. We average over all 20 categories for these measures in the table below:

    Raw frequencies,

    unigrams only

    TF-IDF weights,

    unigrams only

    Raw frequencies,

    bigrams +

    unigrams

    TF-IDF weights,

    bigrams +

    unigrams

    Accuracy 0.9517  0.9486  0.9562  0.9447 

    AUC 0.9220  0.9492  0.9347  0.9498 

    Precision 0.7215  0.8611  0.7737  0.9040 

    Recall 0.5672  0.3304  0.5551  0.2665 

    Table 1. Comparison of different featurization choices for the logistic regression model (with ~200,000

    features retained). Measures are averaged across scores for the 20 categories.

    4

  • 8/15/2019 Yelp Advisor Report

    5/15

     

    The highest scores for each accuracy measure are shown in bold. The highest overall accuracy is achieved

     by including bigrams and unigrams, and weighting the features by their raw counts. Precision and AUC

    are improved by weighting by TF-IDF, but recall is markedly decreased. Such low recall would cause us,

    for example, to fail to recommend a relevant restaurant to a Yelp user, so we choose not to weight our

    data by TF-IDF. Thus our final choice for featurization is: ~200,000 features of bigrams and unigrams,

    weighted by raw word counts.

    Discussion of individual category accuracies

    Figure 4 shows accuracy, precision and recall for each category for our chosen model and featurization.

    Alongside our model’s accuracies, we include “Always False” accuracies, which is the accuracy for a

    model that simply predicts false uniformly for each label. We see that for all categories, our model

    outperforms the "Always False" classifier. However, accuracy is very close to this "Always False"

    classifier for the rarest categories: Sushi Bars, Delis, Steakhouses, Seafood, and Chicken Wings. With

    larger numbers of these types of restaurants, performance for these categories would potentially be

    improved.

    Taking into account accuracy, precision, and recall, we see that our classifier is best at labeling Mexican,Pizza, and Chinese. It is not as successful at classifying American (Traditional and New), or at classifying

    the label "Food." This makes sense because American restaurants and "Food" restaurants have less

    obviously identifying word features than Mexican or Chinese restaurants. To visualize this effect, we

    examine word clouds for some of these cases (Figure 5). The word clouds display the most frequent

    words in all reviews for a given category, sized by their frequencies. Stopwords are removed in addition

    to the word “food,” which is common to all categories.

    Random Forest Model Results 

    Here we also report the accuracy measures for our Random Forest model, because it performed nearly as

    well as the Logistic Regression model. For the results presented here, we used the same training matrix as

    in the primary model (unigrams + bigrams, raw counts), but we only retain 6000 features. The parameters

    used are given in the methods sections. We halve our test dataset into validation (for testing parameter

    combinations) and final test data sets. Table 2 summarizes accuracy scores for this model (both validation

    and final test scores), with logistic regression included for comparison. Accuracy and recall are below the

    logistic regression model, but precision is higher. If given more time to test more parameter combinations, it

    is plausible that we could achieve higher accuracy with this random forests approach. Recall might

    improve with shallower trees or fewer features considered, since these parameters give a simpler model

    with lower variance, but with this comes potentially higher bias (lower accuracy).

    Accuracy Precision Recall

    Logistic Regression 0.9562  0.7737  0.5551 

    Random Forest (validation) 0.9530  0.8867  0.4666 

    Random Forest (final test) 0.9522  0.8972  0.4550 

    Table 2. Accuracy measure comparisons between primary (logistic regression) and a random forest

    model. Measures are averaged across scores for the 20 categories.

    5

  • 8/15/2019 Yelp Advisor Report

    6/15

    The random forest model allows us to examine the most important features. Here we list some of the mostimportant features (in decreasing order) : pizza, chinese, bar, mexican, pizzas, burger, subway, mcdonalds,

    mexican food, chinese food, sandwiches, bartender, sandwich, sushi, tacos, taco, bartenders, italian, coffee,

     burrito, crust, fries, burgers, bar food, drive, fried rice, pizza good, asada, salsa, pepperoni, beer, good pizza,

    fast food, pasta, rice, carne asada, waitress, breakfast, italian food, burritos, subs, wings, best pizza, happy

    hour, bars, bread, mein, drinks, pizza place, beers, sub, fast, great pizza, cafe, restaurant, italian restaurant,eggs, japanese, place, rice beans, deli, taco bell, great, carne, chinese restaurant, pub.

    Many of these features are obvious identifiers for certain labels.

    Comparison to baseline model

    In the table below we summarize accuracy measures (averaged across our 20 categories) for our primary

    model and our baseline model. We see significant improvement in all measures except for recall. The low

     precision of the baseline model indicates that it underfits our data, which is expected of a simple model

    such as Naive Bayes.

    Accuracy AUC Precision Recall

    Multinomial NB (Baseline) 0.9119  0.8690  0.4818  0.7648 

    Logistic Regression (Primary) 0.9562  0.9347  0.7737  0.5551 

    Table 3. Accuracy measure comparisons between primary (logistic regression) and baseline (multinomial

    naive Bayes) models. Measures are averaged across scores for the 20 categories.

    We compare performance against the baseline model for all categories in Figure 6. In the top panel we see

    that our model is more accurate than our baseline model for all categories. While the improvement does

    not appear drastic, it should be noted that "Always False" never outperforms our primary model, but it

    outperforms the baseline model for 11 of the 20 categories (these 11 being Fast Food, American

    (Traditional), Sandwiches, Food, American (New), Breakfast and Brunch, Cafes, Delis, Steakhouses,

    Seafood, and Chicken Wings). The 9 categories for which the baseline model surpasses “Always False”in accuracy are all categories we expect to have more unique vocabularies, such as ethnic cuisine. Our

     primary model outperforms the baseline model most significantly for labels Sandwiches (improvement by

    >20%) and Fast Food (improvement by >10%). Better accuracy is expected for logistic regression as

    compared to naive Bayes for a problem such as ours because naive Bayes is a simplification of logistic

    regression. Naive Bayes assumes that features (words) are generated independently given the class (in our

    case, the “class” is true or false for each label), whereas logistic regression does not make this

    assumption. As such, we expect the naive Bayes model to have higher bias but lower variance, and that it

    will underfit our data, leading to low precision and high recall.

    Spectral Clustering

    We applied spectral clustering on: (1) all restaurants to classify them into groups and analyze the true

    labels that comprise these groups, and (2) Chinese restaurants to classify them into subcategories. In order

    to figure out which labels each cluster corresponds to, we printed out the top 5 true labels of each of the 5

    clusters. As shown in Figure 7, we see that cluster 3 corresponds to pizza or Italian restaurants, cluster 2

    corresponds to bar, nightlife type of restaurant. The other three clusters are more difficult to interpret

     because they contain mixed types of restaurants. Figure 8 is the result of applying spectral clustering on

    6

  • 8/15/2019 Yelp Advisor Report

    7/15

    Chinese restaurants. Many of the Chinese restaurants have true labels in addition to “Chinese,” such as

    “Taiwanese” or “Buffet.” So, as we did for the clusters of all restaurants, we can again print the most

    common true labels (other than Chinese) for the restaurants in our Chinese clusters. First we notice that

    the most frequent labels in each clusters are “Asian Fusion”, “Buffet”, which provides little information

    about their types. Other than that, we see that in the first cluster, we observe Japanese and Sushi bar,

    which implies that their styles are more dominated by Japanese food. In the fifth cluster, we observe Thai,

    Vietnamese, Szechuan restaurants, which are relatively spicy.

    Tools

    We performed all of our analysis in iPython notebooks because this platform is useful for visualizing

    results alongside code. We used Pandas and NumPy for data manipulations because these are tools all

    group members use. At first, we built our BoW features (and TF-IDF weights) using handwritten code

    adapted from CS294 homework, but later we migrated towards scikit-learn tools for this task.

    sklearn.feature_extraction.text.CountVectorizer() was used to form BoW training matrices. This tool

    simplified a few tasks:

    (1) setting the maximum feature retention count (“max_features” parameter),

    (2) setting which n-grams to include (“n-gram range”), and(3) setting which stop words to remove (we removed words from the given “english” stop word list).

    Once those matrices were built, we could transform the counts into their TF-IDF representation with

    sklearn.feature_extraction.text.TfidfTransformer().

    For supervised labeling, we implemented several models from scikit-learn. The justification is that these

    tools are easy to use, especially in an iPython notebook. Models we used include:

    from sklearn.linear_model: LogisticRegression()

    from sklearn.naive_bayes: BernoulliNB() and MultinomialNB()

    from  sklearn.ensemble: RandomForestClassifier()

    We also used these tools for quantifying model performance:

    from  sklearn.metrics: roc_curve, roc_auc_score, auc

    For unsupervised clustering, we basically used k-means from scikit-learn.  

    For visualization, we used Matplotlib because it is well-suited for simple graphics, and can be used inline

    in an iPython notebook. We also used the wordcloud package to create some appealing visualizations of

    our review text.

    Lessons Learned

    Supervised Labeling: We explored a number of machine learning approaches for the supervised problem

    of classifying Yelp restaurants into existing labels. Our best model was a logistic regression model,

    closely followed by a random forests model. We thus selected the logistic regression model as our primary model, and we compare it to a baseline model (multinomial naive bayes). The features used were

    the words from all of the reviews written for each restaurant that we aimed to classify. We evaluated a

    number of featurization choices for these words including:

    (1) whether to use unigrams only or whether to additionally include bigrams,

    (2) whether to weight the features by raw word counts or TF-IDF weights, and

    (3) how many features to include.

    7

  • 8/15/2019 Yelp Advisor Report

    8/15

    As seen in Figure 2 and Table 1, we achieved the best performance for the logistic regression model by

    using bigrams+unigrams, retaining 200,000+ features, and representing features as raw word counts.

    We measure accuracy in a number of ways:

    (1) accuracy (did we correctly predict that a restaurant does or does not fall within a certain category?),

    (2) area under the ROC curve,(3) precision, and

    (4) recall.

    Scores for these accuracy measures are displayed in Table 2. Our logistic regression model outperforms

    our baseline model substantially in accuracy, AUC, and precision, but the baseline model has higher

    recall. We also show that our primary model outperforms the “Always False” model for all 20 categories,

    whereas our baseline model does not for many categories. Our primary model performs best at classifying

    ethnic cuisine such as “Chinese” and “Mexican,” which we hypothesize is due to these types of

    restaurants having special and unique identifying words such as “Mexican” and “tacos” for Mexican

    restaurants and “Chinese” and “noodles” for Chinese restaurants. This is corroborated by the word cloud

    visualizations in Figure 5 and in looking at the most important features for our random forests model.

    Unsupervised Labeling:

    Unsupervised learning for subcategorization is relatively more difficult. In this project, we applied

    spectral clustering on the review text in order to find subcategories of restaurants. The intuition is, let’s

    say, for Chinese restaurants, people may use “hot”, “spicy” to describe a Sichuan restaurant and use “milk

    tea”, “salted popcorn chicken” in reviews for Taiwanese restaurants. However, the difficulty is, it’s not

    obvious what each cluster corresponds to.

    One way to figure this out is to look at the percentage of existing labels. For example, if in a cluster, 50%

    of restaurants are “bar”, 25% are “night life”, then we could reason this cluster corresponds to the bar type

    of restaurants. Though we do observe this in some of the clusters (refer to the results section), there are

    also clusters with mixed labels that are not easy to interpret. A more fundamental question to ask is, is the

    clustering based on restaurant types? Or, is it perhaps more related to something else like star-rating, cost,or other latent factors? A key lesson for us is that unsupervised learning doesn’t always give us the result

    we expect.

    Team Contributions

    *CS294* Bichen (40%): Time series analysis (majority of the “Project Preliminary Data Analysis”

    submission), star-rating matrix factorization, spectral clustering of review texts for unsupervised

    subcategory classification.

    *CS294* Stephanie (40%): Initial reading in of data and exploration of business dataset (majority of

    “Project Data Exploration” submission). Completion of bag of words featurization. Small scale

    supervised labeling (majority of results presented in PowerPoint presentation) . Majority of textfeaturization and supervised labeling presented in poster presentation and presented here.

    *CS194* Tsu-Fang (20%): Data exploration on review texts and user data. Started bag of words

    featurization and TF-IDF analysis. Tested value of adding restaurant name feature and TF-IDF effects on

    model accuracies after logistic regression and naive bayes (not shown). Ported and formatted results for

     poster / presentation.

    8

  • 8/15/2019 Yelp Advisor Report

    9/15

    References

    (1) Yelp academic dataset. https://www.yelp.com/academic_dataset.

    (2) “Ensemble Methods.” http://scikit-learn.org/stable/modules/ensemble.html

    (3) Ng, Andrew Y., Michael I. Jordan, and Yair Weiss. "On spectral clustering: Analysis and an

    algorithm."    Advances in neural information processing systems 2 (2002): 849-856.

    Our github repository is here: htps://github.com/tsufanglu/Yelp-Dataset-Challenget

    The most relevant notebooks to this report are:

    CatsAllCities.ipynb

    Yelp_Restaurnats_Spectral_Clustering.ipynb

    They are located in the code folder of the repo:

    https://github.com/tsufanglu/Yelp-Dataset-Challenge/tree/master/code 

    9

  • 8/15/2019 Yelp Advisor Report

    10/15

    Appendix (Figures)

    Figure 1. Chosen restaurant labels and their counts.

    10

  • 8/15/2019 Yelp Advisor Report

    11/15

     

    Figure 2. Accuracies, precisions, and recalls for our logistic regression model colored by the number of

    words retained in the training corpuses (no TF-IDF weighting). These indicate that we ought to keep as

    many words as possible as features.

    11

  • 8/15/2019 Yelp Advisor Report

    12/15

     

    Figure 3. Comparison of accuracies for 4 different featurization choices. In each case 211964 words (or

    211964 bigrams + unigrams in the bigram case) are retained for training.

    12

  • 8/15/2019 Yelp Advisor Report

    13/15

     

    Figure 4. Accuracy measures for our chosen model, broken down by category.

    Figure 5. Word clouds for American (New) (Upper Left), American (Traditional) (Upper Right),

    Mexican, and Chinese. Notice Chinese has words unique to it such as “Chinese,” “noodle,” “rice”, and

    dumpling; Mexican has unique words like “Mexican,” “taco,” and “burrito,” but the upper 2 word clouds

    do not show obviously unique words.

    13

  • 8/15/2019 Yelp Advisor Report

    14/15

    Figure 6. Baseline comparisons. There is substantial improvement over baseline for accuracy, AUC, and

     precision. The simple baseline model has higher recall. Bottom panel labels serve as a guide for all

     panels.

    14

  • 8/15/2019 Yelp Advisor Report

    15/15

     

    Figure 7. Spectral clustering result of all restaurants. Most frequent 5 labels in each cluster.

    Figure 8. Spectral clustering results for Chinese restaurants. Most frequent 5 labels in each cluster.

    15