ie 5300- dm project presentation
TRANSCRIPT
![Page 1: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/1.jpg)
CLASSIFICATION PROJECT FOR THE RETAIL
INDUSTRY
Group MembersRaghavendra Kumar Punugu
Mewan Wijemanne
IE 5300 – Data MiningProfessor
Shouyi Wang
![Page 2: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/2.jpg)
Introduction Competition data set based on the retail industry
Variables are unknown
Determine the optimal classification method
Objective is to achieve highest accuracy
![Page 3: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/3.jpg)
DATASETRetail industry
Loyal Not loyal
Behavioral variables
( Purchase)
![Page 4: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/4.jpg)
Data Classification Approach
DATA DATA EXPLORATION
DATA QUALITY
DATA PREPROCESSI
NG
CLASSIFICATION
![Page 5: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/5.jpg)
DATA
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
Size• Train
(25676x255)• Test
(11042x255)
Attributes• 255
(Anonymous)
Property of Attributes• Numerical –
continuous• Categorical-
ordinal
Target variableLoyal (1)Not loyal (0)
![Page 6: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/6.jpg)
EXPLORATION
Mean•14.5•18.4
Range •40•353
Standard Deviation•12.09•38.49
Mode•0•0
Median •11•4
Investigation of data for better understanding of specific characteristics
Helps in selecting preprocessing tools Helps in identifying the outliers, noise in data or missing values Helps in better data visualization
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
![Page 7: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/7.jpg)
DATA QUALITY Missing values Based on the analysis conducted, under 2% of the data was determined to be missing/non-applicable. In order to conduct the classification, these values were replaced with 0.
NoisesGiven that the origin of the data set is unknown, it is difficult to determine the existence of Noises.
Outliers Currently very few outliers have been determined based on boxplot. However, for future work, outlier determinant techniques such as Bonferroni outlier test, can be applied to detect if any exist.
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
![Page 8: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/8.jpg)
DATA PREPROCESSING
Dimensionality reduction
MRMR(minimum
redundancy maximum relevance)
EnsembleLinear
discriminant analysis
Support vector
machine
Principal component
analysis
Linear discriminant
analysis
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
![Page 9: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/9.jpg)
Dimensionality reduction Reducing the high dimensionality space to a lower dimensionality
space. Reduces the time and storage space It is easy to visualize the data when reduced from 3D to 2D.
Principal component
analysis
SVD: Eigen values of covariance
matrix
Variance retained is 99.99% [161]
Minimum redundancy maximum relevance
Statistical dependency
between selected features and classification
variables
Decreasing order of importance [255]
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
![Page 10: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/10.jpg)
Principal component analysis and MRMR[COEFF,SCORE,latent] = princomp(X)
Coefficient: For a NxP matrix, it returns a pxp matrix with each column representing one principal component and each column Is in order of decreasing variance
Latent: singular value decomposition ( default), Eigen value decomposition (covariance matrix) or Alternating least square ( missing values in data) ; gives you the variance accounted by each column.
Score : representation of X in principal component space.
MRMR: Gives you features based on order of importance 11, 43, 33, 103, 247, 93, 96, 13, 244, 212,…….
![Page 11: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/11.jpg)
DATA CLASSIFICATION
Data classificatio
n
Ensemble
Linear Discriminant Analysis
Support vector
machine
DATA EXPLORATION DATA QUALITY DATA PREPROCESSING CLASSIFICATION
![Page 12: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/12.jpg)
Linear Discriminant Analysis (LDA)
Designed by R.A Fisher in the 1900s
Comparable to more complex methods
Uses Fisher’s Score Function
Searches for linear combinations that best classifies
![Page 13: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/13.jpg)
Support Vector Machine (SVM) Creates multiple hyperplanes to define decision boundaries. Determines the hyperplane that gives the largest minimum distance
to the training data set There are multiple Kernel functions used with SVM Kernels used
Radial Basis Function (RBF) Linear Polynomial (Orders; 2, 8 ,10)
![Page 14: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/14.jpg)
Ensemble Classifier Itself a supervised learning algorithm that can be trained It combines a set of trained weak learner models and the data set
used Running multiple learning cycles causes it to adjust the algorithm to
better classify There are multiple types of ensemble techniques Ensemble types used
Adaboost Logiboost
![Page 15: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/15.jpg)
Visualization of Output
Error vs Number of cycles (Ensemble)
![Page 16: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/16.jpg)
Computational ResultsVariable Selection
Method Classification Method Function Order Variable Size AccuracyMRMR Linear Discriminative Analysis 150 0.662751MRMR Linear Discriminative Analysis 130 0.663422MRMR Linear Discriminative Analysis 100 0.665319MRMR Linear Discriminative Analysis 90 0.664914MRMR Linear Discriminative Analysis 95 0.665929MRMR Linear Discriminative Analysis 98 0.666967MRMR Support Vector Machine RBF 98 0.543112MRMR Support Vector Machine Polynomial 10 98 0.548724MRMR Support Vector Machine Polynomial 2 98 0.465846MRMR Support Vector Machine Polynomial 8 98 0.503412MRMR Support Vector Machine Linear 98 0.534284MRMR Fitted Ensemble Adaboost 100 98 0.669126MRMR Fitted Ensemble Adaboost 80 98 0.669867MRMR Fitted Ensemble Adaboost 50 98 0.671196MRMR Fitted Ensemble Adaboost 40 98 0.671196MRMR Fitted Ensemble Adaboost 43 98 0.671498MRMR Fitted Ensemble Logitboost 43 98 0.667572MRMR Fitted Ensemble Logitboost 100 98 0.669082MRMR Fitted Ensemble Logitboost 200 98 0.670894MRMR Fitted Ensemble Logitboost 300 98 0.674819PCA Linear Discriminative Analysis 161 0.587564
![Page 17: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/17.jpg)
Conclusion Lowest accuracy achieved is 46.5846% Worst combination
Variable Selection Method - MRMR Classification Method - Support Vector Machine Function – Polynomial Polyorder - 2 Variable Size - 98
Current rank 32nd out of 215 Current highest accuracy achieved is 67.4819% Best combination
Variable Selection Method - MRMR Classification Method - Fitted Ensemble Function – Logitboost Learning Cycles - 300 Variable Size - 98
![Page 18: IE 5300- DM Project Presentation](https://reader033.vdocument.in/reader033/viewer/2022042610/58ac0bf31a28ab33178b4ae3/html5/thumbnails/18.jpg)
Thank you!