sliq
DESCRIPTION
A Data Mining Paper Presentation on ClassificationTRANSCRIPT
SLIQ: A Fast Scalable Classifier for Data Mining
Manish Mehta, Rakesh Agrawal, Jorma Rissanen
Presentation by: Sara Alaee , Zahra Taheri
SLIQ: A Fast Scalable Classifier for Data Mining
Presented in: 5th International Conference on
Extending Database Technology Avignon, France, March 25–29, 1996 Proceedings
927 citations
2
Outline Introduction Motivation SLIQ Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 3
Introduction
Most of the classification algorithms are designed for memory-resident data limited suitability for mining large training
datasets Solution : build a scalable classifier -
SLIQ SLIQ : Supervised Learning in Quest
04/13/23 4
Outline Introduction Motivation SLIQ Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 5
Motivation Improve scalability of tree classifiers Previous proposals:
Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for
each partition All methods achieve low accuracy!
SLIQ – improve learning time without loss in accuracy!
04/13/23 6
Motivation (cont.)
Recall (ID3, C4.5, CART):
04/13/23 7
Motivation (cont.) Non-Scalable Decision Trees:
Complexity in determining the best split for each attribute
Cost of evaluating splits for numerical attributes = cost of sorting values at each node
Cost of evaluating splits for categorical attributes = cost of searching for the best subset
Pruning cross-validation: inapplicable for large
datasets divide data in two parts - training and test
set : sizes & distribution problem04/13/23 8
–
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 9
SLIQ – Algorithm
Key features: Tree classifier, handling both numerical
and categorical attributes Pre-sort numerical attributes before
tree has been built Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm
based on Minimum Description Length (MDL)
04/13/23 10
SLIQ – Algorithm (cont.)
Pre-sorting: Eliminate the need to sort the data at
each node
Create sorted list for each numerical attribute
Create class list04/13/23 11
SLIQ – Algorithm (cont.)
Example:
04/13/23 12
SLIQ – Algorithm (cont.)
Split evaluation:
04/13/23 13
SLIQ – Algorithm (cont.)
Example:
04/13/23 14
SLIQ – Algorithm (cont.)
Update class list:
04/13/23 15
SLIQ – Algorithm (cont.)
Example:
04/13/23 16
SLIQ – Algorithm (cont.) When node becomes pure, stop splitting Condense attribute lists by discarding
examples corresponding to the pure node
For large-cardinality categorical attributes (determined based on threshold): the best split computed either in greedy way, or all possible splits are evaluated
SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical
04/13/23 17
SLIQ - Pruning
Post pruning algorithm based on Minimum Description Length principle
Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given
04/13/23 18
SLIQ - Pruning Cost of the data: classification error Cost of the model:
Encoding the tree: number of bits Encoding the splits:
numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality
MDL pruning evaluates the code length at each node to decide on pruning
04/13/23 19
SLIQ - Pruning
Pruning Algorithm:
C’(ti) : cost of encoding the children’s examples using the parent’s statistics.
04/13/23 20
SLIQ - Pruning
Three pruning strategies: Full – pruning both children and
convert node to the leaf Partial – prune into the leaf or prune
the left child or prune the right child or leave node intact
Hybrid – apply Full method and then partial (prune left, prune right or leave intact)
04/13/23 21
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 22
Evaluation Metrics:
Primary: classification accuracy Secondary: classification time & size of the
decision tree Setup:
Small benchmarks: datasets from the STATLOG classification
benchmark Synthetic databases: 9 attributes for each
tuple, 2 classification functions04/13/23 23
Evaluation
STATLOG benchmark:
04/13/23 24
Evaluation
Pruning strategy comparison:
Hybrid pruning is the preferred approach, and is used for the experiments in this paper.04/13/23 25
Evaluation
Small datasets:• IND-Cart:
• good accuracy • small trees• an order of
magnitude slower than others.
• IND-C4: • Accurate• fast• large decision
trees. • SLIQ:
• Accurate• smaller than IND-
C4.• faster than IND-
Cart.
04/13/23 26
Evaluation
Scalability:
04/13/23 27
Outline Introduction Motivation SLIQ - Algorithm
Building tree Pruning Example
Evaluation Conclusion04/13/23 28
Conclusion SLIQ demonstrates to be a fast, low-cost
and scalable classifier that builds accurate trees
Based on empirical tests SLIQ achieves accuracy while producing smaller decision trees compared to other algorithms
Scalability??? Memory problem when increasing number of attributes or number of classes
04/13/23 29
THANK YOU!
04/13/23 30