improving amazon data quality
TRANSCRIPT
![Page 1: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/1.jpg)
Data Quality Analysis and Reporting
Data Record ScienceDerek Pappas
![Page 2: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/2.jpg)
Detecting data quality problems
❖ amazon.com product data quality can be improved
❖ DRS software will detect the problems
![Page 3: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/3.jpg)
Filtering
There are do many opportunities to improve the user experience on amazon.com by identifying and fixing/filtering out the problematic data.
![Page 4: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/4.jpg)
Suggestions for Fixing Data❖ Product matching
❖ Variant elimination
❖ Identify bad data
❖ Identify duplicate products with different names from the same vendor
❖ Identify missing data
❖ Suggest fixes for data
❖ Identify over/underpriced items at third party stores (significantly overpriced items on amazon.com makes Amazon look bad in my opinion)
❖ Find bad/correct product classification
❖ Wrong product images
❖ Wrong specifications
❖ Google SEO violations
![Page 5: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/5.jpg)
Data Processing Pipeline
❖ Our pipeline was built with Hadoop map/reduce which scales. The pipeline processed 200 million records last week. It can process billions.
![Page 6: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/6.jpg)
Detecting problems
The following are just a few examples of problems that the DRS pipeline can detect.
![Page 7: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/7.jpg)
Overpricing
See the attached image of the massage balls. We can group those product variants and we can identify the overpricing.
![Page 8: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/8.jpg)
Overpricing Example
![Page 9: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/9.jpg)
"Jamming" the Amazon Index❖ The link below shows the same product over and over with
different product names-these are not variants. The vendor is "jamming" the amazon index so that their product shows up under different search terms. Google will algorithmically reduce the number of links in the Google index when a site is "spammy" or Google will manually exclude a site from or reduce the number of links in the from the Google index when black hat SEO tactics are being used by the site. See the image below
❖ https://www.amazon.com/s/ref=sr_st_price-asc-rank?keywords=ab+straps+hanging&rh=i%3Aaps%2Ck%3Aab+straps+hanging&qid=1480277091&sort=price-asc-rank
![Page 10: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/10.jpg)
“Jamming” the Amazon Index
![Page 11: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/11.jpg)
Bad Classification
❖ 3. In other instances on amazon.com I see misclassified items. In most cases we can identify the classification problems now.
![Page 12: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/12.jpg)
Bad Classification
There are biking and racing helmets mixed together.
https://www.amazon.com/s/ref=sr_nr_p_36_2?srs=2592626011&fst=as%3Aoff&rh=n%3A3375251%2Cn%3A%213375301%2Cn%3A706814011%2Cn%3A3403201%2Cn%3A6389202011%2Cn%3A3404571%2Ck%3ARACING%2Cp_36%3A1253557011&bbn=3404571&sort=price-asc-rank&keywords=RACING&ie=UTF8&qid=1480301345&rnid=386589011
![Page 13: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/13.jpg)
Wrong Product Image
❖ 5. Does not know who the manufacturer is. Searching for racing inside of Giro getting Fox and Bell at the top of the search results.
![Page 14: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/14.jpg)
Wrong Product Image (Socks)
![Page 15: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/15.jpg)
Bad Specifications
❖ Name value pairs do not match
![Page 16: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/16.jpg)
Bad Specifications
![Page 17: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/17.jpg)
Mining Reviews
❖ Product Quality Issues (including Amazon basics)
❖ Store customer service issues
❖ Graph ratings vs number of reviews (is one 5 star review better than fifty 4 star reviews-validity)
![Page 18: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/18.jpg)
Sort by Price Does Not include Shipping
![Page 19: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/19.jpg)
Product Quality Issues
❖ https://www.amazon.com/AmazonBasics-Micro-USB-USB-Cable-2-Pack/dp/B00NH13O7K/ref=pd_sim_147_5?_encoding=UTF8&psc=1&refRID=7QRCXVWVQB9F4J9EVGV7
![Page 20: Improving amazon data quality](https://reader034.vdocument.in/reader034/viewer/2022051521/5a658cf77f8b9a23688b4d05/html5/thumbnails/20.jpg)
Reporting and Analysis
❖ Our data analysis and reporting can find the good/bad records and the good/bad/missing fields/images.
❖ Moreover, our software can often suggest fixes on the data analysis website.