powerpoint

04/08/23 CS267, Yelick 1

CS 267: Applications of Parallel Computers

Lecture 25:

Data Mining

Kathy Yelick

Material based on lecture by

Vipin Kumar and Mahesh Joshi

http://www-users.cs.umn.edu/~mjoshi/hpdmtut/

http://www-users.cs.umn.edu/~mjoshi/hpdmtut/

04/08/23 CS267, Yelick 2

Lecture Schedule

• 12/3: 3 things

Projects and performance analysis

(N-body assignment observations)

Data Mining

HKN Review at 3:40

• 12/5: The Future of Parallel Computing

David Bailey

• 12/13: CS267 Poster Session (2-4pm, Woz)

• 12/14: Final Papers due

04/08/23 CS267, Yelick 3

N-Body Assignment

• Some observations on your N-Body assignments• Problems and pitfalls to avoid in final project

• Performance analysis• Micro-benchmarks are good

• To understand application performance, build up performance model from measured pieces, e.g., network performance

• Noise is expected, but quantifying it is also useful• Means, alone, can be confusing

• Median + variance is good

• Carefully select problem sizes• Are they large enough to justify the # of processors?

• What do real users want?

• Can you vary the problem size in some reasonable way?

04/08/23 CS267, Yelick 4

N-Body Assignment

• Minor comments on N-Body Results• Describe performance graphs – what is expected, surprising

• Sanity check your numbers• Are you getting more than P time speedup on P processors?

• Does the observed running time (“time command”) match total?

• What is your Mflops rate? Is it between 10 and 90% of HW peak?

• Be careful of different timers• Get-time-of-day is wall-clock time (charged for OS and others)

• Clock is process time (Linux creates a process per thread)

• RT clock on Cray is wall clock time

• Check captions, titles, axes of figures/graphs

• Run spell checker

04/08/23 CS267, Yelick 5

Outline

• Overview of Data Mining

• Serial Algorithms for Classification

• Parallel Algorithms for Classification

• Summary

04/08/23 CS267, Yelick 6

Data Mining Overview

• What is Data Mining?

• Data Mining Tasks• Classification

• Clustering

• Association Rules and Sequential Patterns

• Regression

• Deviation Detection

04/08/23 CS267, Yelick 7

What is Data Mining?

• Several definitions:• Search for valuable information in large volumes of data

• Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover useful rules

• A step in the Knowledge Discovery in Databases (KDD) process

04/08/23 CS267, Yelick 8

Knowledge Discovery Process

• Knowledge Discovery in Databases: identify valid, novel, useful, and understandable patterns in data

Clean, collect,

summarize

Datapreparation

Datamining

Verificationand

evaluation

Datawarehouse

Operational Databases

TrainingData

Model,patterns

04/08/23 CS267, Yelick 9

Why Mine Data?

• Data collected and stored at enormous rate• Remote sensor on a satellite

• Telescope scanning the skies

• Microarrays generating gene expressions

• Scientific simulations

• Traditional techniques infeasible

• Data mining for data reduction• Cataloging, classifying, segmenting

• Help scientists formulate hypotheses

04/08/23 CS267, Yelick 10

Data Mining Tasks

• Predictive Methods: Use variable to predict unknown or future values of other variables

• Classification

• Regression

• Deviation Detection

• Descriptive Methods: Find human-interpretable patterns that describe data

• Clustering

• Association Rule Discovery

• Sequential Pattern Discovery

04/08/23 CS267, Yelick 11

Classification

• Given a collection of records (training set) • Each record contains a set of attributes, on of which is the class

• Find a model for class attributes as a function of the values of other attributes

• Goal: previously unseen records should be accurately assigned a class

• A test set is used to determine accuracy

• Examples:• Direct marketing: targeted mailings based on buy/don’t class

• Fraud detection: predict fraudulent use of credit cards, insurance, telephones, etc.

• Sky survey cataloging: catalog objects based as star/galaxy

04/08/23 CS267, Yelick 12

Classification Example: Sky Survey

•Currently 3K images

•23Kx23K pixels

Approach• Segment the image

• Measure image attributes – 40 per object

• Model the class (star/galaxy or stage) based on the attributes

Images from: http://aps.umn.edu

04/08/23 CS267, Yelick 13

Clustering

• Given a set of data points:• Each has a set of attributes

• A similarity measure among them

• Find clusters such that:• Points in one cluster are more similar to each other than points

in other clusters

• Similarities measures are problem specific:• E.g., Euclidean distance for continuous data

04/08/23 CS267, Yelick 14

Clustering Applications

• Market Segmentation:• Divide market into distinct subsets

• Document clustering: • Find group of related documents, based on common keywords

• Set in information retrieval

• Financial market analysis• Find groups of companies with common stock behavior

04/08/23 CS267, Yelick 15

Associate Rule Discovery

• Given a set of records, each containing set of items• Produce dependency rules that predict occurrences of an item

based on others

• Applications:• Marketing, sales promotion and shelf management

• Inventory management

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules:

{Milk} {Coke}

{Diaper,Milk} Beer

04/08/23 CS267, Yelick 16

Other Data Mining Problems

• Sequential Pattern Discovery • Given a set of objects, each with a timeline of events

• Find rules that predict sequential dependencies

• Example: patterns in telecommunications alarm logs

• Regression: • Predict a value one variable given others

• Assume a linear or non-linear model of dependence

• Examples:• Predict sales amounts based on advertising expenditures

• Predict wind velocities based on temperature, pressure, etc.

• Deviation Detection• Discover most significant change in data from previous values

04/08/23 CS267, Yelick 17

Serial Algorithms for Classification

• Decision Tree Classifiers• Overview of Decision Trees

• Tree induction

• Tree pruning

• Rule-based methods

• Memory Based Reasoning

• Neural networks

• Genetic algorithms

• Bayesian networks

Inexpensive

Easy to interpret

Easy to integrate into DBs

04/08/23 CS267, Yelick 18

Decision Tree Algorithms

• Many Algorithms• Hunt’s Algorithm

• CART

• ID3, C4.5

• SLIQ, SPRINT

Tid Refund Marital Income Cheat

1 Yes S 125K No

2 No M 100K No

3 No S 70K No

4 Yes M 120K No

5 No D 95K Yes

6 No M 60K No

7 Yes D 220K No

8 No S 85K Yes

9 No M 75K No

10 No S 90K Yes

Refund

Marital

Income

NO

NO

NO

YES

>80K<=80K

S,D M

Yes No

04/08/23 CS267, Yelick 19

Tree Induction

• Greedy strategy• Split based on attribute that optimizes splitting criterion

• Two phases at each node in tree• Split determining phase:

• Which attribute to split

• How to split

– Two-way split of multi-valued attribute (Marital: S,D,M)

– Continuous attributes: discretize in advance, cluster on the fly

• Splitting phase• Do the split and create child nodes

04/08/23 CS267, Yelick 20

GINI Splitting Criterion

• Gini Index: GINI(t) = 1 – j [p(j | t) ] 2

where p(j|t) is the relative frequence of class j at node t

• Measures impurity of a node• Max (1-1/nc) when records are equally distributed

• Minimum (0.) when all records belong to one class, implying most interesting information

• Other criteria may be better, but similar evaluation

C1 0

C2 6

Gini = 0.00

C1 1

C2 5

Gini = 0.28

C1 2

C2 6

Gini = 0.44

C1 3

C2 3

Gini = 0.50

04/08/23 CS267, Yelick 21

Splitting Based on GINI

• Use in CART, SLIQ, SPRINT

• Criterion: Minimize GINI index of the Split

• When a node p is split into k partitions (children), the quality of the split is computed as

GINIsplit = j

k

=1 nj / n GINI(j)

• Where nj = number of records at child j

n = number or records at node p

• To evaluate: • Categorical attributes: compute counts of each class

• Continuous attributes: sort and choose split (1 or more)

04/08/23 CS267, Yelick 22

Splitting Based on INFO

• Information/Entropy:

INFO(t) = – (j

k

=1 p(j | t) log g(j | t))• Information Gain

GAINsplit = INFO(p) – (j

k

=1 nj / n INFO(j))

• Measures reduction in entropy; choose split to maximize

• Used in ID3 and C4.5

• Problems: tends to prefer splits that are large in number• Variations avoid this

• Computation similar to GINI

04/08/23 CS267, Yelick 23

C4.5 Classification

• Simple depth-first construction of tree

• Sorts continuous attributes at each node

• Needs to fit data into memory • To avoid out-of-core sort

• Limits scalability

04/08/23 CS267, Yelick 24

SLIQ Classification

• Arrays of continuous attributes are pre-sorted

• Classification tree is grown breadth-first

• Class list structure maintains mapping: record id node

• Split determining phase: class list is referred to for computing best split for each attribute. (breadth-first)

• Splitting phase: the list of this splitting attribute is used to update the leave labels in the class list. (no physical splitting)

• Problem: class list is frequently and randomly accessed• Required to be in-memory for efficient performance

04/08/23 CS267, Yelick 25

SLIQ Example

04/08/23 CS267, Yelick 26

SPRINT

• Arrays of continuous attributes are presorted• Sorted order is maintained during splits

• Classification tree is grown breadth-first

• Attribute lists are physically split among nodes

• Split determining phase is just a linear scan of lists at each nodes

• Hashing scheme used in splitting phase• IDs of the splitting attribute are hashed with the tree node

• Remaining attribute arrays are split by querying this hash table

• Problems: Hash table is O(N) at root

04/08/23 CS267, Yelick 27

Parallel Algorithms for Classification

• Driven by need to handle large data sets• Larger aggregate memory on parallel machines

• Scales on cluster architecture

• I/O time dominates• More difficult to analyze benefits (cost/performance) than

simple MFLOP-limited problem

• I.e., buy disks for parallel Bandwidth vs. Processors+Memory

04/08/23 CS267, Yelick 28

Parallel Tree Construction: Approach 1

• First approach: partition data, data-parallel operations across nodes

• Global reduction per node

• Large number of nodes is expensive

04/08/23 CS267, Yelick 29

Parallel Tree Construction: Approach 2• Task parallelism: exploit parallelism between nodes

• Load imbalance as number of records vary

• Locality: child/parent need same data

04/08/23 CS267, Yelick 30

Parallel Tree Construction: Hybrid Approach

• Switch from data to task parallelism (within a node to between nodes) when:

total communication cost >=

Moving cost + Load balancing cost

• Splitting ensures:• Communication cost <= 2 * Optimal-Communication-cost

04/08/23 CS267, Yelick 31

Continuous Data

• Parallel mining algorithms with continuous data adds• Parallel sort

• Essentially a transpose of data – all to all

• Parallel hashing• Ramon small access

• Both are very hard to do efficiently on current machines

04/08/23 CS267, Yelick 32

Performance Results from ScalParC

• Parallel running time on Cray T3E

04/08/23 CS267, Yelick 33

Performance Results from ScalParC

• Runtime with constant size per processor, also T3E

powerpoint

Documents

data mining overview

data clean

large quantities of

data reduction cataloging

wallclock time

time linux

nbody results

performance graphs