sliq

30
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Sara Alaee , Zahra Taheri

Upload: sara-alaee

Post on 20-Jun-2015

126 views

Category:

Engineering


0 download

DESCRIPTION

A Data Mining Paper Presentation on Classification

TRANSCRIPT

Page 1: SLIQ

SLIQ: A Fast Scalable Classifier for Data Mining

Manish Mehta, Rakesh Agrawal, Jorma Rissanen

Presentation by: Sara Alaee , Zahra Taheri

Page 2: SLIQ

SLIQ: A Fast Scalable Classifier for Data Mining

Presented in: 5th International Conference on

Extending Database Technology Avignon, France, March 25–29, 1996 Proceedings

927 citations

2

Page 3: SLIQ

Outline Introduction Motivation SLIQ Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 3

Page 4: SLIQ

Introduction

Most of the classification algorithms are designed for memory-resident data limited suitability for mining large training

datasets Solution : build a scalable classifier -

SLIQ SLIQ : Supervised Learning in Quest

04/13/23 4

Page 5: SLIQ

Outline Introduction Motivation SLIQ Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 5

Page 6: SLIQ

Motivation Improve scalability of tree classifiers Previous proposals:

Sampling data at each node Discretization of numerical attributes Partitioning input data and build tree for

each partition All methods achieve low accuracy!

SLIQ – improve learning time without loss in accuracy!

04/13/23 6

Page 7: SLIQ

Motivation (cont.)

Recall (ID3, C4.5, CART):

04/13/23 7

Page 8: SLIQ

Motivation (cont.) Non-Scalable Decision Trees:

Complexity in determining the best split for each attribute

Cost of evaluating splits for numerical attributes = cost of sorting values at each node

Cost of evaluating splits for categorical attributes = cost of searching for the best subset

Pruning cross-validation: inapplicable for large

datasets divide data in two parts - training and test

set : sizes & distribution problem04/13/23 8

Page 9: SLIQ

Outline Introduction Motivation SLIQ - Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 9

Page 10: SLIQ

SLIQ – Algorithm

Key features: Tree classifier, handling both numerical

and categorical attributes Pre-sort numerical attributes before

tree has been built Breadth first growing strategy Goodness test – Gini index Inexpensive tree pruning algorithm

based on Minimum Description Length (MDL)

04/13/23 10

Page 11: SLIQ

SLIQ – Algorithm (cont.)

Pre-sorting: Eliminate the need to sort the data at

each node

Create sorted list for each numerical attribute

Create class list04/13/23 11

Page 12: SLIQ

SLIQ – Algorithm (cont.)

Example:

04/13/23 12

Page 13: SLIQ

SLIQ – Algorithm (cont.)

Split evaluation:

04/13/23 13

Page 14: SLIQ

SLIQ – Algorithm (cont.)

Example:

04/13/23 14

Page 15: SLIQ

SLIQ – Algorithm (cont.)

Update class list:

04/13/23 15

Page 16: SLIQ

SLIQ – Algorithm (cont.)

Example:

04/13/23 16

Page 17: SLIQ

SLIQ – Algorithm (cont.) When node becomes pure, stop splitting Condense attribute lists by discarding

examples corresponding to the pure node

For large-cardinality categorical attributes (determined based on threshold): the best split computed either in greedy way, or all possible splits are evaluated

SLIQ is able to scale for large datasets with no loss in accuracy – the splits evaluated with or without pre-sorting are identical

04/13/23 17

Page 18: SLIQ

SLIQ - Pruning

Post pruning algorithm based on Minimum Description Length principle

Find a model that minimizes:Cost(M,D) = Cost(D|M) + Cost(M)Cost(M) - cost of the modelCost(D|M) - cost of encoding the data D if model M is given

04/13/23 18

Page 19: SLIQ

SLIQ - Pruning Cost of the data: classification error Cost of the model:

Encoding the tree: number of bits Encoding the splits:

numerical attribute - constant (empirically 1) categorical attribute - depends on cardinality

MDL pruning evaluates the code length at each node to decide on pruning

04/13/23 19

Page 20: SLIQ

SLIQ - Pruning

Pruning Algorithm:

C’(ti) : cost of encoding the children’s examples using the parent’s statistics.

04/13/23 20

Page 21: SLIQ

SLIQ - Pruning

Three pruning strategies: Full – pruning both children and

convert node to the leaf Partial – prune into the leaf or prune

the left child or prune the right child or leave node intact

Hybrid – apply Full method and then partial (prune left, prune right or leave intact)

04/13/23 21

Page 22: SLIQ

Outline Introduction Motivation SLIQ - Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 22

Page 23: SLIQ

Evaluation Metrics:

Primary: classification accuracy Secondary: classification time & size of the

decision tree Setup:

Small benchmarks: datasets from the STATLOG classification

benchmark Synthetic databases: 9 attributes for each

tuple, 2 classification functions04/13/23 23

Page 24: SLIQ

Evaluation

STATLOG benchmark:

04/13/23 24

Page 25: SLIQ

Evaluation

Pruning strategy comparison:

Hybrid pruning is the preferred approach, and is used for the experiments in this paper.04/13/23 25

Page 26: SLIQ

Evaluation

Small datasets:• IND-Cart:

• good accuracy • small trees• an order of

magnitude slower than others.

• IND-C4: • Accurate• fast• large decision

trees. • SLIQ:

• Accurate• smaller than IND-

C4.• faster than IND-

Cart.

04/13/23 26

Page 27: SLIQ

Evaluation

Scalability:

04/13/23 27

Page 28: SLIQ

Outline Introduction Motivation SLIQ - Algorithm

Building tree Pruning Example

Evaluation Conclusion04/13/23 28

Page 29: SLIQ

Conclusion SLIQ demonstrates to be a fast, low-cost

and scalable classifier that builds accurate trees

Based on empirical tests SLIQ achieves accuracy while producing smaller decision trees compared to other algorithms

Scalability??? Memory problem when increasing number of attributes or number of classes

04/13/23 29

Page 30: SLIQ

THANK YOU!

04/13/23 30