data analytics on amazon product ...eecs.csuohio.edu/~sschung/cis612/analysingamazonreview...data...

DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE

SYSTEM.

PRESENTED BY:

DEVANG PATEL (2671221)

SONAL DESHMUKH (2622863)

� INTRODUCTION:

• The significance of online shopping is growing day by day because of easy purchase

method by just one click.

• Amazon is one such world widely known E-commerce website. Initially it was known for

its huge collection of books but later it was expanded for other items.

• It is all about making money. So, customer satisfaction and opinion is important part of E-

commerce websites.This gave rise “User Reviews”.

• User Reviews are customer suggestions which help other customers to make decision

about that product.

� HOW TO GET AMAZON REVIEW DATASET ?:

• We emailed them to get the access of

amazon review dataset and they

provide the link from which we can

download the review dataset.

• Data was in JSON file format.

� SOFTWARESAND TOOLS:

� REVIEW SAMPLE AND DATASET DESCRIPTION:

• Rating (1-5 stars)

• Review Text

• Summary

• No of peopled who found review

helpful

• Product ID

• Reviewer ID

CONTINUED…

• We got JSON dataset that contains following fields:

• reviewerID: ID of the Reviewer.

• asin: ID of the Product.

• reviewerName: Name of the reviewer.

• helpful: Helpfulness rating of the review.

• reviewText: Text of the review.

• overall: Rating of the Product.

• summary: Summary of the review.

• unixReviewTime: Time of the review (Unix time).

• reviewTime: Time of the review.

� IMPLEMENTATION:

• Data was in JSON format and we

are using hive. So Need to convert

JSON to CSV file but we choose

JSONSerDe.

• hive-serdes-1.0-SNAPSHOT.jar

• We downloaded JSONSerDe jar file

and copied it in hive/lib folder.

CONTINUED…

• Uploaded data of

Music_Instruments.JSON file on

HDFS.

• After uploading it on HDFS we want

to load it in the hive table.

CONTINUED…

• Table: MI_table

• Row format: JSONSerDe.class

• Location: HDFS path where JSON

file is stored.

PROBLEMS FACED DURNG TABLE CREATION IN HIVE

• Select * from MI_table;

• It fetched all the rows 10,261. but

problem was NULL values.

• The problem of NULL values was that

the key names were in capital case.

• Solution

:SERDEPROPERTIES(“case.insensitive”

=“false”)

PROBLEM ON DATA FORMAT

• The metadata file of amazon reviwes has key-value pairs in single quotes.

• We tried all the types of json serde available but none worked

• Json.dumps functionality converted the json data into correct format .

SELECT * DESERIALIZES

� CAN WE USE THIS TABLE AT PRODUCTION LEVEL?:

� MOST REVIEWED PRODUCT:

� AVERAGE RATING OF MUSIC INSTRUMENTS:

AVERAGE RATINGS ON PRODUCTS

� AMAZON 5 DIFFERENT PRODUCTS AVERAGE:

• Automative (4.18)

• Cellphones (4.12)

• Lawn_Garden (4.18)

• Musical Instruments (4.48)

• Pet_Suppliers (4.22)

� REVIEWS WERE POSITIVE OR NEGATIVE?:

1.

2.

3.

CONTINUED…

4.

5.

6.

CONTINUED…

WHICH YEAR PRODUCTS WERE REVIEWED MOST?

COST OF MOST REVIEWED (TOP 5) PRODUCTS

HDFS IN THE BROWSER :HTTP://LOCALHOST/50070

TRUST AND HELPFULNESS IN AMAZON PRODUCT REVIEWS

• The ‘helpful’ column contains

values that look like this ‘[56, 63]’.

• The first value represents the

number of helpful votes, the

second represents overall votes

• Percentage and also a binary column which states if the review is helpful or not.

OUTPUT FILE AND TABLEAU DATA SOURCE .

Helpful Ratings and Distribution

PIPELINE MODEL OF SPARK’S MLIB:

• Cylinders indicate DataFrames.

• The Tokenizer.transform

method splits the raw text

documents into words, adding a new column with words to the DataFrame.

• The HashingTF. Transform () method converts the words column into feature vectors,

adding a new column with those vectors to the DataFrame

• Logistic regression is the machine learning algorithm

PROBLEM FACED IN RUNNING SPARK MACHINE LEARNING PROGRAM EXECUTION

• Creating dataframe for the amazon data .• Downloaded sklearn ,but pyspark has its own classification mlib• The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by

correcting a bug in the init.py program of the mlib.

SPARKS EXECUTION : BIN/PYSPARK SHELL COMMAND

Start the Sparks on top of hadoop.

TRAINING SET AND TEST SET RESULTS

data analytics on amazon product ...eecs.csuohio.edu/~sschung/cis612/analysingamazonreview...data...

Documents