mathstat.carleton.camathstat.carleton.ca/~smills/2017-18/stat4601-5703...  · web viewword stems...

30
Classification of Amazon Product Metadata: Support Vector Machine & Kernel Methods Research Paper STAT5703 | Amy Li, Mitchell Hughes STAT4601 | Nolan Hodge Monday, March 26, 2018

Upload: vudang

Post on 18-Jul-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Classification of Amazon Product Metadata: Support Vector Machine & Kernel Methods

Research Paper

STAT5703 | Amy Li, Mitchell Hughes

STAT4601 | Nolan Hodge

Monday, March 26, 2018

ABSTRACT

Automated classification of metadata into predefined categories is an important means to manage

and process the expanding amount of information that Amazon stores. This source of enormous

web documentation lacks structure, which prevents derivation of potential useful knowledge

from its collection. Classification is a key technique for organizing such digital data: in this

paper, we research and apply Support Vector Machines (SVM) and kernel methods onto

Amazon.com’s product metadata. From our model, we predict products that are likely to be co-

purchased or viewed based on its textual title and category classification. We find that a high

degree of accuracy can be obtained with our parsimonious model.

Keywords: Machine learning, Support Vector Machines, Kernel method, Text Classification,

Data Mining, Metadata

1

TABLE OF CONTENTS

SECTION PAGE

I: STATEMENT OF PROBLEM………………………………………………………...……. 3

II: LITERATURE REVIEW………………………………………………………….…………4

IIIB: ANALYSIS OF SIMILAR APPLICATIONS…………………………………………… 7

IIIA: DISCUSSION OF TECHNICAL & INTUITIVE MODEL FRAMEWORK……………10

IV: APPLICATION OF METHODOLOGY & CODE OVERVIEW………………...…..…….13

V: CONCLUSION………………………………………………………………………………18

BIBLIOGRAPHY………………………………………………………………………………..20

APPENDIX I: CODE……………………………………………………………………………21

APPENDIX II: WORK DISTRIBUTION STATEMENT………………………………………22

2

I. STATEMENT OF PROBLEM

In this paper, we are interested in uncovering Amazon consumer purchasing habits – classifying

the relationship between products and their co-purchases. We thus seek to research the

underlying technical aspects of designing such a model where co-purchasing products can be

predicted from Support Vector Machines (SVM), specifically by k-means technique. By

deciphering from product titles, descriptions, categories and Amazon’s unique identifier (ASIN),

we categorize common consumer baskets. This posited problem inherently requires modeling

from the Amazon metadata and analyzing purchasing patterns from alphanumeric inputs.

First and foremost, this research paper serves to form a deeper understanding of the topic by

thoroughly digesting scholarly and technical articles. From these expert sources, we formulate a

review and discussion of methodologies involved in constructing SVMs with k-means.

The task of classifying data under predefined categories – where either single or multi labels

exists – is meant to automatically class based on content. This technique is valuable for data

mining raw material. Classification is defined as algorithmically assigning a document or object

to one or more classes based on attributes, behavior or subject. The broad problem is identified

as training a dataset consisting of records, such that each record has a unique record identifier

and corresponding fields. The increasing amount of information available online is rapidly

increasing – therefore, the research of automatic classifiers has integral meaning for machine

learning and information extraction. In addition, the evolution of this technique has vital

implications on its present-day applications. The goal is to create a model from the training

dataset based off class label’s attributes which can classify new data. With regards to machine

learning approaches, this area of research has been increasingly expanding in the published

3

domain and likewise, SVM algorithms have been rapidly adopted on a plethora of applicable

databases.

The paper is organized as follows. Section II describes the literature that encompasses SVM;

Section III we discuss the both the applications of the Amazon dataset performed by others, and

our own models’ technical and intuitive framework. The penultimate section outlines our

application upon the Amazon dataset, code analysis and evaluates our SVM model prediction.

Finally, we conclude in Section V about future directions and summarize this research project.

II. LITERATURE REVIEW

This research paper discusses the varied architecture and approaches to machine learning

classifiers and data mining alternative methods, and we consult the following experts’ writings.

Our literature review comprises of peer-reviewed papers and less formal yet informative sources

of SVM topics on relevant datasets. We must stress that though these informal sources are not

traditionally articled scholars; due to the rapidity of machine learning developments, these data

scientists’ blogs contain a breadth of cutting-edge knowledge.

This section begins with a brief outline of SVM origins, and then relays relevant papers that

touch on our research topic before reviewing more specific sources that utilize this specific

4

dataset. This latter literature necessitates further technical discussion; thus, we include the

informal sources within Section III.

SVM is a fairly new learning method – its original algorithm roots were introduced to solve

statistical problems before the online era. Vapnik et al (1963) developed nonlinear classifiers by

applying the kernel method to maximize-margin hyperplanes. Vapnik continued developing this

technique into the 1990’s to become more holistic.

In another pivotal paper, Thorsten Joachims explores the use of SVM on text data and highlights

its automation feature as it eliminates the need for manual parameter tuning. His paper provides

an excellent overview about its functions; he notes that SVMs are very universal learners and at

its fundamental form, SVMs learn linear threshold function, but remain pliable for other

polynomial classifiers, such as kernel function. This could be extended into neural nets and radial

basic function networks as well.

Joachim continues to explain that a unique property of SVM is its learn ability can be

independent of its dimensionality of the feature space since it “measures the complexity of

hypotheses based on the margin with which they separate the data, not the number of features”.

This paper proposes SVMs for text categorization. The authors espouse that SVM uses

overfitting protection and thereby, it does not depend on the number of features, and it can

handle large spaces.

For a paper that encapsulates recent research, Jindal et al’s Techniques for text classification:

Literature review and current trends (2015) processes existing work in this discipline and

evaluates competing methodologies. They begin by defining:

5

“Text classification consists of document representation, feature selection or feature

transformation, application of data mining algorithm and finally an evaluation of the applied

algorithm.” (Jindal et al 2)

Their collection of varied research from digital and analog portals. Jindal et al found that most

authors have studied SVM algorithms and it was the most popular means of text classification

(followed by k-nearest neighbours) though many proposed advanced methods to enhance its

applicability. 65% of papers of papers involve these two algorithms from their sample of 132

papers; thus, researchers display a clear preference for SVM and KNN machine learning in the

88 papers.

For instance, Leopold (2002) attest that kernel function requires much pre-processing and feature

selection, and argue that weighting methods can reduce dimensionality and larger impact on

SVM performance. However, Namburuet et al (2005) argue that SVM is more suitable for binary

classification. Saporta (1990) states linear association between variables and suitable

transformation of the original variables, or proper distance measures, can produce satisfactory

solutions.

Perhaps optimistically, these authors note that though many papers’ conclusions are data-

dependent and contain black-box solutions, it remains promising that there is a seismic shift from

traditional statistical methods towards modern machine learning.

In terms of this specific dataset, there have been a few projects and academic research conducted

to reveal new machine learning practices. Julian McAuley et al (2015) proposed content-based

recommender systems to model user preference towards types of foods – this is akin to our

approach. He analyzes metadata from a user’s previous activity, instead of collaborative means

6

of recommending based off other users’ activities. He combines the two methods to address

sparsity and cold-start problems (no reviews yet). In a similar paper, he leverages the same

dataset to be scalable, personalized, temporally evolving and interpretable as a visually-aware

recommender system. McAuley et al address how the corpora for “long-tailed” new items being

continually introduced like cold-starts are common problems that require remedies.

III-A. ANALYSIS OF APPLICATIONS

In this twofold section, we begin by referencing machine learning works that have already been

conducted on this Amazon dataset. Secondly, we examine the technical components along with

an intuitive explanation of our SVM model.

Professor Julian McAuley of the University of California San Diego is the source of our dataset.

It ranges from Amazon’s online debut in 1995 to 2013. Analyzing the 9.4 million products

(roughly 10 GB) proved to be a difficult computational task for any laptop so we create a subset.

Of these products, if we particularize the entries, we find that there are less distinct products.

This means many are variations of the same product (i.e. different colours of a mug) and there

contains many null entries without pricing information.

Max Woolf, an associate data scientist at BuzzFeed, has performed a similar analysis on

McAuley’s review dataset. This collection contains product reviews which totals 142.8 million,

and he concluded that certain categories that are frequently rated 4 or 5 stars out of 5 are

7

recognized as helpful by other users; and likewise, Amazon Electronics with 1-star reviews

(thereby signalling harsh disapproval) is also considered helpful. However, reviews of 2-3 stars

are not, which Woolf notes as a signal that Amazon could benefit from a binary like/dislike

system instead of its present rating schema.

Fig. 1: Woolf rating distribution

Fig. 2: Woolf review summary

8

Trevor Smith, another data scientist at Metris who is by training an economist, delves into this

dataset as well for machine learning applications. Smith transforms the data using a ‘Bag-of-

Words’ technique by reviewing all text and creating a sparse matrix of said words then

configures classification algorithm as a feature/attribute:

Fig. 3: Smith classifier matrix

From above, it is evident that the two sentences do not both contain unique words – only unique

words are filled within the matrix. Parsing through the reviews, Smith examines if any entries

contain the word in the first position of the matrix: if yes, a 1 is assigned and elsewise, a 0. This

process is repeated for every column. The next step is train data to fit the SVM classification

model. The resulting accuracy is quite high for a simple application practice.

III-B: DISCUSSION OF TECHNICAL MATERIAL

For a rudimentary example, a training dataset: with points where 1 or -

1 are the values for - indicating the class to which belongs. Each is a -dimensional

real vector and our objective is to maximize the distance between the group of points for which

9

= 1 from those that = -1, thus hyperplane and nearest point either group are as far apart as

possible.

From there, clusters are defined to be externally heterogenous and internally homogenous –

meaning members are like one another yet dissimilar to members of other clusters. In other

words, SVM algorithms aim to create strong association structure among variables.

Our classifier’s aim is to use a set of pre-classified data to classify those which have not yet been

seen. Thusly, our first objective in classification is to include the maximum number of product

entries that are filtered per the defined inclusion criterion. Selecting relevant subset training set is

essential – this was undertaken in these steps:

Firstly, the data is downloaded in JavaScript Object Notation (JSON) format and transformed

into comma-separated values (CSV) by parsing through the entries for all key headings. We

remove irrelevant entries where there are null values for any of the observed columns. Once this

is list is created, the entries are flattened into their respective columns. However, the categorical

column remained compacted with subcategories that are separated by square brackets and begin

from labelling the product entry as an aggregated category title to increasing granular

subcategories.

Secondly, once preprocessing has completed, we transform the dataset which contains strings of

characters into a suitable format for the learning algorithm and classification task. Word stems

are one means of accomplishing this, ordering concern is negligible for the purposes of our

project. Then we attribute value representation of text. From there, one could assign each distinct

word to a corresponding feature and the number of times it occurs in the document or dataset as

10

its value. For compactness, words correspond to features if they are present in the document at

least thrice and if they are not “stop-words” (i.e. prepositions, articles, etc.).

Bag-Of-Words is a commonly utilized representation method, where a document represents a

collection of words which occurs at least once. We evaluate our model using a novel corpus of

words from Amazon’s product titles, or descriptions and categories.

For simplicity, we design our clustering in a binary manner. In two groups, we posit if we can

accurately predict, given a product’s title, whether said product will appear as also_bought with

another product.

In the array for also_bought, we loop through the whole CSV file for each unique ASIN

identifier and verify if its value was in another product’s also_bought array. If this is confirmed,

we increment the top level ASIN counter, and repeat the process. For an individual product,

also_bought_count is the total amount of times it appears in another product’s also_bought array.

We construct a model which learns words that will most likely lead to a co-purchasing scenario

given a product’s specific attribute. Our model deciphers “someword” → % likelihood it will

exist in the dataset and a product will be within also_bought. Some corpora of words selected

from both training and test sets’ title column informs our model prediction such that it strips out

words separated by spaces and punctuation. The True/False indicator is the result of the model’s

predictions against the test set. 1,539 products were never listed in also_bought. Most of the

results indicate that solo purchases commonly occur among Amazon consumers.

11

IV: APPLICATION OF METHODOLOGY & CODE OVERVIEW

The initial format for Amazon metadata consists of newline separated JSON packets, as

displayed below:

{ “asin”: “0000031852”, “title”: “Girls Ballet Tutu Zebra Hot Pink”,  “price”: 3.17,  “imUrl”: “http://link_to_image/.jpg",  “related”: {   ..“also_bought”: [“ASIN_0”, ASIN_n-m, ASIN_n], ..“also_viewed”: [“ASIN_0”, ASIN_n-m, ASIN_n], ..“buy_after_viewing”: [“ASIN_0”, ASIN_n-m, ASIN_n], ..“bought_together”: [“ASIN_0”, ASIN_n-m, ASIN_n],  ..},  “salesRank”: {“Toys & Games”: 211836},  “brand”: “Coxlures”,  “categories”: [[“Sports & Outdoors”, “Other Sports”, “Dance”]] } Source: http://jmcauley.ucsd.edu/data/amazon/links.html

A loose description of non-intuitive fields that we are concerned with:

asin | unique identifier for a productalso_bought | list of asins | products purchased alongside the top level asinalso_viewed | list of asins | products viewed alongside the top level asinbuy_after_viewing| list of asins | products purchased after viewing asinbought_together| list of asins | common products purchased alongside asin

Since we will be using a linear SVM classification model, we require a means to rank the data.

This is where the “related” co-purchasing fields will be used – these are the latter four headings.

Using Python, the JSON format raw metadata is converted into a workable comma-separated

values CSV file that would, for each ASIN, record a count of how many times in which a

product appears in a related field for any other product.

12

Grabbing the values that we require from shuffled JSON packets, each dictionary entry is written

as a CSV line.

Below displays how the data looks in said format:

We add a 10,000 line threshold to begin our testing phase. Now, R-based SVM algorithms can

be used to build the classification model. We develop and apply our model on an example that

will use the title column and also_bought_count.

13

Here are the range of values for also_bought_count.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 1539 224 89 50 27 25 15 8 8 3 2 2 1 14 17 19 241 2 2 1 1The model will classify products by having been also bought, or not. In other words 1 for

also_bought (also_bought_count > 1), 0 for not also_bought (also_bought_count < 1).

The training set will consist of the 1st 2000 entries. The test set will consist of the next 500

entries.

14

CODE ANALYSIS

Using a corpus of words (akin to the ‘Bag of Words’ technique), we collect the terms in all rows

for titles. From there, we construct a Document Term Matrix for all terms, and reduce the

frequency of non-common terms such that the sparsity is less than 0.95.

<<DocumentTermMatrix (documents: 2000, terms: 45)>>Non-/sparse entries : 3304/86696Sparsity : 94%Maximal term length : 12Weighting : term frequency (tf)

Here is a look at the SVM model:

> svm.model.title.also.bought.countL2 Regularized Support Vector Machine (dual) with Linear Kernel

2000 samples 45 predictor 2 classes: '0', '1'

No pre-processingResampling: Bootstrapped (25 reps) Summary of sample sizes: 2000, 2000, 2000, 2000, 2000, 2000, ... Resampling results across tuning parameters:

cost Loss Accuracy Kappa 0.25 L1 0.8870779 0.0031141372 0.25 L2 0.8884432 0.0003427000 0.50 L1 0.8868551 0.0040584027 0.50 L2 0.8881687 0.0005571252 1.00 L1 0.8864654 0.0053518981 1.00 L2 0.8882257 0.0024991084

Accuracy was used to select the optimal model using the largest value.The final values used for the model were cost = 0.25 and Loss = L2.

15

Here is a look at the first 5 rows and columns in the Document Term Matrix:

TermsDocs memoir rock roll soldier anderson 1 1 1 1 1 0 2 0 0 0 0 1 3 0 0 0 0 0 4 0 0 0 0 0 5 0 0 0 0 0OUTPUT:FALSE TRUE 0.208 0.792

Using this model, the test set had predicted values with ~79% accuracy.

V. CONCLUSION

In this paper, we research SVM methodology from widespread sources to develop a deeper

understanding of this machine learning technique and apply this knowledge upon a chosen

dataset. In our own application, we uncover Amazon consumer purchasing habits by classifying

the relationship between products and their co-purchases, which are predicted from Support

Vector Machines (SVM) and k-means. By deciphering from product titles, descriptions,

categories and Amazon’s unique identifier (ASIN), we wield Amazon metadata and analyzing

purchasing patterns from alphanumeric inputs. Our parsimonious text classification model allows

us to predict with high accuracy on co-product relations.

For future research endeavors, we would be able to flexibly expand our analysis by including an

additional row in the initial CSV data entry process to include other attributes with associated

ASINs, such as categories or descriptions instead of titles. This would lend to changing the

16

product.container init function within the R code to make testSize the last row. The aftermath

would construct a prediction on whether, for instance, a certain product would show up pairwise

from also_bought with other products. In addition, this analysis could become more granular and

thorough if we further investigated the True also_bought_count co-product relationships.

17

BIBLIOGRAPHY

Hastie, Trevor, et al. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2017.

He, Ruining, and Julian Mcauley. “Ups and Downs.” Proceedings of the 25th International Conference on World Wide Web - WWW '16, 2016, doi:10.1145/2872427.2883037.

Indal, Rajni, et al. “Techniques for Text Classification: Literature Review and Current Trends.”Webology, vol. 12, no. 2, Dec. 2015.

James, Gareth, et al. An Introduction to Statistical Learning: with Applications in R. Springer, 2017.

Joachims, Thorsten. “Text Categorization with Support Vector Machines: Learning with Many Relevant Features.” Universitat Dortmund.

Leopold, E., & Kindermann, J. Text categorization with support vector machines. how to represent texts in input space? 2005 Machine Learning, 46, 423–444.

Namburu, S.m., et al. “Experiments on Supervised Learning Algorithms for Text Categorization.”2005 IEEE Aerospace Conference, 2005, doi:10.1109/aero.2005.1559612.

Mcauley, Julian, et al. “Image-Based Recommendations on Styles and Substitutes.” Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR '15, 2015, doi:10.1145/2766462.2767755

SAPORTA, G. Probabilités Analyse des Données et Statistiques, Technip, Paris. 1990

Vapnik, V. Pattern Recognition Using Generalized Portrait Method. 2005 Automation and Remote Control, 774-780.

18

APPENDIX I: CODE

#######################################################

# File and Directory setup

#######################################################

dir.root <- paste("C:", "Users", "trash", "dev", "data-mining", "support-vector-machines", sep="/") # set me

dir.code <- "Code"

dir.data <- paste(dir.root, "Data", sep="/")

amazon.csv <- paste("asin_metadata_no_list_dict","csv",sep=".")

amazon.data <- paste(dir.data, amazon.csv, sep="/")

# Get data

amazon.counts.csv <- paste("bought_with_count_categories_asin_metadata_small","csv",sep=".")

amazon.counts.data <- paste(dir.data, amazon.counts.csv, sep="/")

data.amazon <- read.csv(amazon.counts.data, header = TRUE, fill = TRUE, sep=",")

########################################################################################################################

########################################################################################################################

########################################################################################################################

library(RTextTools)

library(tm)

library(dplyr)

data.amazon <- read.csv(amazon.counts.data, header = TRUE, fill = TRUE, sep=",")

data.clean <- data.amazon %>% mutate(count_new = if_else(also_bought_count > 1, 1, 0))

product.matrix <- create_matrix(data.clean$category, language = "English",

removeNumbers = TRUE,

removePunctuation = TRUE,

removeStopwords = FALSE, stemWords = FALSE)

19

product.container <- create_container(product.matrix,

data.clean$count_new,

trainSize = 1:1000, testSize = 1051:1074,

virgin = FALSE)

product.model <- train_model(product.container, algorithm = "SVM")

product.result <- classify_model(product.container, product.model)

x <- as.data.frame(cbind(data.clean$count_new[1051:1074], product.result$SVM_LABEL))

colnames(x) <- c("actual.count", "predicted.count")

x <- x %>% mutate(predicted.count = predicted.count - 1)

round(prop.table(table(x$actual.count == x$predicted.count)), 3)

20

APPENDIX II: WORK DISTRIBUTION STATEMENT

This research project was done in a collaborative manner. All team members contributed insights and coordinated tasks.

AMY LI – Writing, I, II, IIIA, IIIB, V, Slide Deck & Presentation

NOLAN HODGE – Coding, IIIB, IV

MITCHELL HUGHES – Editing, BIBLIOGRAPHY, Slide Deck & Presentation

21