sliq

SLIQ: A Fast Scalable Classifier for Data Mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen

Presentation by: Sara Alaee , Zahra Taheri

SLIQ: A Fast Scalable Classifier for Data Mining

Presented in: 5th International Conference on

Extending Database Technology Avignon, France, March 25–29, 1996 Proceedings

927 citations

2

Outline Introduction Motivation SLIQ Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 3

Introduction

Most of the classification algorithms are designed for memory-resident data limited suitability for mining large training

datasets Solution : build a scalable classifier -

SLIQ SLIQ : Supervised Learning in Quest

04/13/23 4

Outline Introduction Motivation SLIQ Algorithm



Motivation Improve scalability of tree classifiers Previous proposals:

Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for

each partition All methods achieve low accuracy!

SLIQ – improve learning time without loss in accuracy!

04/13/23 6

Motivation (cont.)

Recall (ID3, C4.5, CART):

04/13/23 7

Motivation (cont.) Non-Scalable Decision Trees:

Complexity in determining the best split for each attribute

Cost of evaluating splits for numerical attributes = cost of sorting values at each node

Cost of evaluating splits for categorical attributes = cost of searching for the best subset

Pruning cross-validation: inapplicable for large

datasets divide data in two parts - training and test

set : sizes & distribution problem04/13/23 8

–

Outline Introduction Motivation SLIQ - Algorithm



SLIQ – Algorithm

Key features: Tree classifier, handling both numerical

and categorical attributes Pre-sort numerical attributes before

tree has been built Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm

based on Minimum Description Length (MDL)

04/13/23 10

SLIQ – Algorithm (cont.)

Pre-sorting: Eliminate the need to sort the data at

each node

Create sorted list for each numerical attribute

Create class list04/13/23 11


Example:

04/13/23 12


Split evaluation:

04/13/23 13


Example:

04/13/23 14


Update class list:

04/13/23 15


Example:

04/13/23 16

SLIQ – Algorithm (cont.) When node becomes pure, stop splitting Condense attribute lists by discarding

examples corresponding to the pure node

For large-cardinality categorical attributes (determined based on threshold): the best split computed either in greedy way, or all possible splits are evaluated

SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

04/13/23 17

SLIQ - Pruning

Post pruning algorithm based on Minimum Description Length principle

Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given

04/13/23 18

SLIQ - Pruning Cost of the data: classification error Cost of the model:

Encoding the tree: number of bits Encoding the splits:

numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality

MDL pruning evaluates the code length at each node to decide on pruning

04/13/23 19

SLIQ - Pruning

Pruning Algorithm:

C’(ti) : cost of encoding the children’s examples using the parent’s statistics.

04/13/23 20

SLIQ - Pruning

Three pruning strategies: Full – pruning both children and

convert node to the leaf Partial – prune into the leaf or prune

the left child or prune the right child or leave node intact

Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

04/13/23 21

Evaluation Metrics:

Primary: classification accuracy Secondary: classification time & size of the

decision tree Setup:

Small benchmarks: datasets from the STATLOG classification

benchmark Synthetic databases: 9 attributes for each

tuple, 2 classification functions04/13/23 23

Evaluation

STATLOG benchmark:

04/13/23 24

Evaluation

Pruning strategy comparison:

Hybrid pruning is the preferred approach, and is used for the experiments in this paper.04/13/23 25

Evaluation

Small datasets:• IND-Cart:

• good accuracy • small trees• an order of

magnitude slower than others.

• IND-C4: • Accurate• fast• large decision

trees. • SLIQ:

• Accurate• smaller than IND-

C4.• faster than IND-

Cart.

04/13/23 26

Evaluation

Scalability:

04/13/23 27

Conclusion SLIQ demonstrates to be a fast, low-cost

and scalable classifier that builds accurate trees

Based on empirical tests SLIQ achieves accuracy while producing smaller decision trees compared to other algorithms

Scalability??? Memory problem when increasing number of attributes or number of classes

04/13/23 29

THANK YOU!

04/13/23 30

sliq

Engineering

pruning algorithm

conclusion sliq

sliq algorithm key features

hybrid pruning

pruning strategies

empirical tests sliq

node cost

example evaluation conclusion