chan siu lung, daniel chan wai kin, ken chow chin hung, victor koon ping yin, bob

35
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier SPRINT: A Scalable Parallel Classifier for Data Mining for Data Mining

Upload: brina

Post on 21-Jan-2016

45 views

Category:

Documents


5 download

DESCRIPTION

SPRINT: A Scalable Parallel Classifier for Data Mining. CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob. Overview. Classification Decision Tree SPRINT Serial Algorithm Parallelizing Classification Performance Evaluation Conclusion Reference. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

CHAN Siu Lung, DanielCHAN Wai Kin, KenCHOW Chin Hung, VictorKOON Ping Yin, Bob

SPRINT: A Scalable Parallel Classifier SPRINT: A Scalable Parallel Classifier for Data Miningfor Data Mining

Page 2: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Overview

• Classification• Decision Tree• SPRINT• Serial Algorithm• Parallelizing Classification• Performance Evaluation• Conclusion• Reference

Page 3: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Classification

• Objective – To build a model of the classifying attribute based upon

the other attributes

• Classification given– A set of example record, called a training set that is a set

of records consists of several attributes• Attributes are either

1. Continuous (i.e. Age: 23,24,25)

2. Categorical (i.e. Gender: Female, Male)

One of the attributes called the classifying attribute

Page 4: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Decision Trees

• Advantages:– Good for data mining– Can be constructed fast– Models are simple– Easy to understand– Can be easily converted into SQL statements– Classifiers obtain better accuracy

Page 5: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Example – Car Insurance

Continuous attribute

Categorical attribute

Page 6: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Classification algorithm

• Two classification algorithms are presented in the paper– SLIQ– SPRINT

• SLIQ– Handles large disk-resident data– Requires some data per record that stay in memory-resident all the time

• SPRINT– Removes all of the memory restrictions– Fast and scalable and it can be easily parallelized

• How good it is?– Excellent scaleup, speedup and sizeup properties

Page 7: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

SPRINT

Serial Algorithm

• A decision tree classifier is built in two phases– Growth phase

– Prune phase

• Growth phase is computationally much more expensive than pruning

Page 8: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Growth Phase

• Key Issue– To find split points that define node tests.– Having chosen a split point, how to partition the

data

• Data Structures– Attribute lists– Histograms

Page 9: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Data Structure (Attribute lists)

• SPRINT creates an attribute list for each attribute

• Entries are called attribute records which contains– Attribute value

– Class label

– Index of the record

Page 10: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Data Structure (Attribute lists)

• The initial lists are associated with the root

• As the tree is grown, the attribute lists belonging to each node are partitioned and associated with the children

Page 11: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Data Structure (Histograms)

• For continuous attributes, two histograms are associated with each decision-tree node. These histograms, denoted as Cabove

and Cbelow– Cbelow:: maintains this distribution for attribute records that already

been processed– Cabove: maintains this distribution for attribute records that have not

been processed

Page 12: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Data Structure (Histograms)

• For categorical attributes, one histogram associated with a node. However, only one histogram is needed and called count matrix

Page 13: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Finding split points

• While growing the tree, the goal at each node is to determine the split point that “best” divides the training records belonging to the leaf

• In the paper, they use gini index

• Gini(S)=1-pj2

– Where pj is the relative of class j in S

• Ginisplit(S) = n1/n(S1)+n2/n(S2)

Page 14: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Example – Split point for the continuous attributes

Age Class Tid

17 High 1

20 High 5

23 High 0

32 Low 4

43 High 2

68 Low 3

22.0

44.0])3

2()

3

1[(1)(

0])1

0()

1

1[(1)(

)(6

3)(

6

3

:

)()(

3

222

221

213

22

11

split

split

split

Gini

Sgini

Sgini

SginiSginiGini

Example

Sginin

nSgini

n

nGini

H L

Cabove 3 0

Cbelow 1 2

CursorPosition 3:

Page 15: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Example – Split point for the continuous attributes

• After finding all the gini indexes we choose the lowest as the split point

• Therefore, we split at position 3 where the candidate split point is the mid-point between age 23 and 32 (i.e. Age < 27.5)

0.44 Gini

0.26 Gini

0.41 Gini

0.22 Gini

0.33 Gini

0.40 Gini

0.44 Gini

split6

split5

split4

split3

split2

split1

split0

Page 16: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Example – Split point for the categorical attributes

H L

Family 2 1

Sports 2 0

Truck 0 1

266.0

44.0

33.0

5.0)4

2()

4

2(1)(

0)2

0()

2

2(1)(

)(6

4)(

6

2

)(

)(

)(

222

221

21)(

trucksplit

familysplit

sportsplit

sportsplit

gini

gini

gini

Sgini

Sgini

SginiSginigini

Example

Page 17: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Performing the split

• Once the best split point has been found for a node, we then execute the split by creating child nodes and dividing the attribute records between them

• For the rest of the attribute lists (i.e. CarType) we need to retrieve the information by using rids

Page 18: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Serial Algorithm

Comparison with SLIQ

• SLIQ does not have separate sets of attribute lists for each node• Advantage

– Do not have to rewrite these lists during a split

– Reassignment of records is easy• Disadvantage

– It must stay in memory all the timewhich limits the amount of data thatcan be classified by SLIQ

SPRINT was not to outperform SLIQ. Rather, the purpose of the algorithm is to develop an accurate classifier for datasets, and to develop a classifier efficiently. Furthermore, SPRINT is designed to be easily parallelizable, thus the workload can be shared among N processor

Page 19: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

SPRINT

Parallelizing Classification

• SPRINT was specifically designed to remove any dependence on data structures that are either centralized or memory-resident

• These algorithms all based on a shared-nothing parallel environment where each of N processor has private memory and disks. The processor are connected by a communication network and can communicate only by passing message

Page 20: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Data placement and Workload Balancing

• The main data structures are– attribute lists – class histograms

• SPRINT achieves uniform data placement and workload balancing by distributing the attribute lists evenly over N processor

• Each processor to work on only 1/N of the total data

Page 21: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Finding split points (Continuous attributes)

• In a parallel environment, each processor has a separate contiguous section of a “global” attribute list– Cbelow:: must initially reflect the class distribution of

all sections of an attribute-list assigned to processors of lower rank

– Cabove: must initially reflect the class distribution of the local section as well as all sections assigned to processor of higher rank

Page 22: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Example: Split point for the continuous attributes

Age Class rid

17 High 1

20 High 5

23 High 0

H L

Cbelow 0 0

Cabove 4 2

Age Class rid

32 Low 4

43 High 2

68 Low 3

H L

Cbelow 3 0

Cabove 1 2

Processor 1

Processor 0

Page 23: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Finding split points (Categorical attributes)

• Each processor built a counter matrix for a leaf

• Exchange these matrices to get the “global” counts

• Sums the local matrices to get the global count matrices

Page 24: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Performing the Splits

• Splitting the attribute lists for each leaf is identical to the serial algorithm

• The only difference is that before building the probe structure, we will need to collect rids from all the processors.

• Exchange the rids • Each processor constructs a probe-structure

with all the rids and using it to split the leaf’s remaining attribute lists

Page 25: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Parallelizing Classification

Parallelizing SLIQ

• Two approaches for parallelizing SLIQ– The class list is replicated in memory of every

processor (SLIQ/R)– The class list is distributed such that each

processor’s memory holds only a portion of the entire list (SLIQ/D)

• Disadvantage– SLIQ/R – the size of the training set is limited by

the memory size of a single processor– SLIQ/D – High communication cost

Page 26: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

• The primary metric for evaluating classifier performance– Classification accuracy– Classification time– Size of the decision tree

• The accuracy and tree size characteristics of SPRINT are identical to SLIQ

• Focus only on the classification time

Page 27: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Datasets

• Each record in synthetic database consists of nine attributes .• In the paper, they present results for two of these function• Function 2 results in fairly

small decision tree• Function 7 produces

very large tree• Both these functions

divide the databaseinto two classes

Page 28: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Serial Performance

• We used training sets ranging in size from 10,000 records to 2.5 million records

• To examine how well SPRINT performs in operating regions where SLIQ can and cannot run

Page 29: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Comparison of Parallel Algorithms

• In this experiments, each processor contained 50,000 training examples and the number of processors varied from 2 to 16

Page 30: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Scaleup

Page 31: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Speedup

Page 32: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Sizeup

Page 33: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

Performance Evaluation

Conclusion

• Classification can handle very large database by building classifiers

• SPRINT efficiently allows classification of virtually any sized dataset and parallelizable.

• SPRINT scales nicely with the size of the dataset

• SPRINT parallelize efficiently in a shared-nothing environment

Page 34: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob

References

• John Shafer,Rakeeh Agrawal and Manish Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining

Page 35: CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob