to address product classification based on kaggle data ...lxiong/cs570/share/project/... · the...
TRANSCRIPT
Using Density-Based Clustering Approaches to Address Product Classification based on Kaggle Data
Denis Whelan& Jin Ming
December 3, 2017
▶ Background: ◦ The Otto Group is one of the largest e-commerce companies in the world.◦ Arranging millions of products from a variety of different products and countries is a complex task
that requires a sophisticated approach. ▶ Data:
◦ ~62,000 samples, 93 numeric features, 9 target labels◦ Target labels are hidden but represent key categories such as electronics, fashion, etc.
▶ Kaggle Competition Purpose: ◦ Build a predictive model which can accurately classify products into the 9 appropriate categories
(supervised learning)
Introduction: Product Classification Challenge
▶ Our Purpose: ◦ Apply density-based clustering methods to this
product classification problem to cluster all products (unsupervised)
● CLARA (1990): ● The basic k-medoids method for large data applications
● DBSCAN (1996): ● The original density-based method
● NG - DBSCAN (2016):● Modified DBSCAN method
Methods: CLARA, DBSCAN, NG-DBSCAN
● CLARA (Clustering Large Applications, 1990): ● Sampling with PAM
● DBSCAN (Ester et al., 1996): ● The use of density-reachable points and density-connected points● Groups data packed in high-density regions of the feature space● Separates 'core points' from 'noise points'● Recognizes clusters with arbitrary shapes
CLARA, DBSCAN
NG-DBSCAN (Lulli et al. 2016)
● Limitations of DBSCAN● Scalability is limited● Cannot handle arbitrary similarity measures, only uses Euclidean
distance● The choice of Eps and MinPts
● NG-DBSCAN: ● An approximated and distributed implementation of DBSCAN● more efficient because of approximation● can represent item dissimilarity through any symmetric distance function
NG-DBSCAN
● Phase 1: ● create ε-graph
i. form neighbor graph by connecting each node to k random other nodes
ii. edges are added to ε-graph if the distance is less than eiii. as soon as a node has M_max neighbors in the ε-graph,
remove it from neighbor graph
(Lulli et al. 2016)
NG-DBSCAN
● Phase 2: ● discovering dense regions
i. coreness disseminationii. seed identificationiii. seed propagation
(Lulli et al. 2016)
Results: CLARA▶ Runtime: ◦ 0.95 seconds
Results: CLARA & DBSCAN
Results: DBSCAN
▶ Runtime: ◦ 18.3 minutes
Questions?