data analytics on amazon product ...eecs.csuohio.edu/~sschung/cis612/analysingamazonreview...data...
TRANSCRIPT
DATA ANALYTICS ON AMAZON PRODUCT REVIEW USING NOSQL HIVE AND MACHINE LEARNING ON SPARKS ON HADOOP FILE
SYSTEM.
PRESENTED BY:
DEVANG PATEL (2671221)
SONAL DESHMUKH (2622863)
� INTRODUCTION:
• The significance of online shopping is growing day by day because of easy purchase
method by just one click.
• Amazon is one such world widely known E-commerce website. Initially it was known for
its huge collection of books but later it was expanded for other items.
• It is all about making money. So, customer satisfaction and opinion is important part of E-
commerce websites.This gave rise “User Reviews”.
• User Reviews are customer suggestions which help other customers to make decision
about that product.
� HOW TO GET AMAZON REVIEW DATASET ?:
• We emailed them to get the access of
amazon review dataset and they
provide the link from which we can
download the review dataset.
• Data was in JSON file format.
� SOFTWARESAND TOOLS:
� REVIEW SAMPLE AND DATASET DESCRIPTION:
• Rating (1-5 stars)
• Review Text
• Summary
• No of peopled who found review
helpful
• Product ID
• Reviewer ID
CONTINUED…
• We got JSON dataset that contains following fields:
• reviewerID: ID of the Reviewer.
• asin: ID of the Product.
• reviewerName: Name of the reviewer.
• helpful: Helpfulness rating of the review.
• reviewText: Text of the review.
• overall: Rating of the Product.
• summary: Summary of the review.
• unixReviewTime: Time of the review (Unix time).
• reviewTime: Time of the review.
� IMPLEMENTATION:
• Data was in JSON format and we
are using hive. So Need to convert
JSON to CSV file but we choose
JSONSerDe.
• hive-serdes-1.0-SNAPSHOT.jar
• We downloaded JSONSerDe jar file
and copied it in hive/lib folder.
CONTINUED…
• Uploaded data of
Music_Instruments.JSON file on
HDFS.
• After uploading it on HDFS we want
to load it in the hive table.
CONTINUED…
• Table: MI_table
• Row format: JSONSerDe.class
• Location: HDFS path where JSON
file is stored.
PROBLEMS FACED DURNG TABLE CREATION IN HIVE
• Select * from MI_table;
• It fetched all the rows 10,261. but
problem was NULL values.
• The problem of NULL values was that
the key names were in capital case.
• Solution
:SERDEPROPERTIES(“case.insensitive”
=“false”)
PROBLEM ON DATA FORMAT
• The metadata file of amazon reviwes has key-value pairs in single quotes.
• We tried all the types of json serde available but none worked
• Json.dumps functionality converted the json data into correct format .
SELECT * DESERIALIZES
� CAN WE USE THIS TABLE AT PRODUCTION LEVEL?:
� MOST REVIEWED PRODUCT:
� AVERAGE RATING OF MUSIC INSTRUMENTS:
AVERAGE RATINGS ON PRODUCTS
� AMAZON 5 DIFFERENT PRODUCTS AVERAGE:
• Automative (4.18)
• Cellphones (4.12)
• Lawn_Garden (4.18)
• Musical Instruments (4.48)
• Pet_Suppliers (4.22)
� REVIEWS WERE POSITIVE OR NEGATIVE?:
1.
2.
3.
CONTINUED…
4.
5.
6.
CONTINUED…
CONTINUED…
CONTINUED…
CONTINUED…
WHICH YEAR PRODUCTS WERE REVIEWED MOST?
COST OF MOST REVIEWED (TOP 5) PRODUCTS
HDFS IN THE BROWSER :HTTP://LOCALHOST/50070
TRUST AND HELPFULNESS IN AMAZON PRODUCT REVIEWS
• The ‘helpful’ column contains
values that look like this ‘[56, 63]’.
• The first value represents the
number of helpful votes, the
second represents overall votes
• Percentage and also a binary column which states if the review is helpful or not.
OUTPUT FILE AND TABLEAU DATA SOURCE .
Helpful Ratings and Distribution
PIPELINE MODEL OF SPARK’S MLIB:
• Cylinders indicate DataFrames.
• The Tokenizer.transform
method splits the raw text
documents into words, adding a new column with words to the DataFrame.
• The HashingTF. Transform () method converts the words column into feature vectors,
adding a new column with those vectors to the DataFrame
• Logistic regression is the machine learning algorithm
PROBLEM FACED IN RUNNING SPARK MACHINE LEARNING PROGRAM EXECUTION
• Creating dataframe for the amazon data .• Downloaded sklearn ,but pyspark has its own classification mlib• The execution throws an error that it requires only Numpy 1.4 or higher version. It is solved by
correcting a bug in the init.py program of the mlib.
SPARKS EXECUTION : BIN/PYSPARK SHELL COMMAND
Start the Sparks on top of hadoop.
TRAINING SET AND TEST SET RESULTS