cs570 introduction to data mining classification and ...lxiong/cs570_s11/share/slides/07... ·...

CS570 Introduction to Data Mining

Classification and Prediction 2

Partial slide credits:

Han and Kamber

Tan,Steinbach, Kumar

Classification and Prediction

� Last lecture

� Overview

� Decision tree induction

� Bayesian classification

� Today

� Bayesian network learning

Data Mining: Concepts and Techniques 2

� Model evaluation

� kNN classification and collaborative filtering

� Rule based methods

� Upcoming lectures

� Support Vector Machines (SVM)

� Neural Networks

� Regression

� Ensemble methods

Training Bayesian Networks

� Several scenarios:

� Given both the network structure and all variables observable: learn only the CPTs

� Network structure known, some hidden variables: gradient descent (greedy hill-climbing) method, analogous to neural network learning

analogous to neural network learning

� Network structure unknown, all variables observable: search through the model space to reconstruct network topology

� Unknown structure, all hidden variables: No good algorithms known for this purpose

� Ref. D. Heckerman: Bayesian networks for data mining

Training Bayesian Networks� Scenario: Given both the network structure and all variables observable: learn only the CPT (similar to naive Bayesien)

� Scenario: Network structure known, some variables

hidden: gradient descent (greedy hill-climbing) method, i.e., search for a solution along the steepest descent of a

criterion function (similar to neural network training)

� Example optimization function: likelyhood of observing

the data

� Weights are initialized to random probability values

� At each iteration, it moves towards what appears to be

the best solution at the moment, w.o. backtracking

� Weights are updated at each iteration & converge to

local optimum

� Scenario: Network structure unknown, all variables observable: search through the model space to reconstruct network topology

� Define a total order of the variables

� Construct conditional sequences and for each sequence remove the variables that do not affect the

February 12, 2008 Data Mining: Concepts and Techniques 6

sequence remove the variables that do not affect the current variable

� Creating an arc using remaining dependencies

� Last lecture

� Overview

� Today

� Neural Networks

� Regression

Model Evaluation

� Metrics for Performance Evaluation

� Methods for Model Comparison

� Methods for Performance Evaluation

Metrics for Performance Evaluation

� Focus on the predictive capability of a model

� Accuracy of a classifier: percentage of test set tuples that are

correctly classified by the model – limitations?

� Binary classification:

� Error rate (misclassification rate) = 1 – accuracy

Confusion matrix: given m classes, CMi,j, indicates # of tuples in

FNFPTNTP

+=Accuracy

� Confusion matrix: given m classes, CMi,j, indicates # of tuples in class i that are labeled by the classifier as class j

� Binary classification confusion matrix

PREDICTED CLASS

ACTUAL

positive negative

positive TP FN

negative FP TN

TP (true positive)

FN (false negative)

FP (false positive)

TN (true negative)

Limitation of Accuracy

� Consider a 2-class problem

� Number of Class 0 examples = 9990

� Number of Class 1 examples = 10

� If model predicts everything to be class 0,

accuracy is 9990/10000 = 99.9 %

� Accuracy is misleading because model does not

detect any class 1 example

Cost-Sensitive Measures

Precision

TPFNTP

(FNR) RateNegative False

(FPR) Rate PositiveFalse

+=y Sensitivit

RecallRecall

Recall*Precision2 measure-F

Recall

PREDICTED CLASS

ACTUAL

positive negative

positive TP FN

negative FP TN

precision

sensitivity/recall/true positive rate

specificity/true negative rate

y Specificit

Cost-Sensitive Measures

Precision

TPFNTP

(FNR) RateNegative False

(FPR) Rate PositiveFalse

+=y Sensitivit

RecallRecall

Recall*Precision2 measure-F

Recall

PREDICTED CLASS

ACTUAL

positive negative

positive 0 10

negative 0 9990

precision

sensitivity/recall/true positive rate

specificity/true negative rate

y Specificit

Predictor Error Measures

� Measure predictor accuracy: measure how far off the predicted value is

from the actual known value

� Loss function: measures the error betw. yi and the predicted value yi’

� Absolute error: | yi – yi’|

� Squared error: (yi – yi’)2

Test error (generalization error): the average loss over the test set

� Test error (generalization error): the average loss over the test set

� Mean absolute error: Mean squared error:

� Relative absolute error: Relative squared error:

The mean squared-error exaggerates the presence of outliers

Popularly use (square) root mean-square error, similarly, root relative

squared error

ii∑=

Model Evaluation

Model Comparison: ROC (Receiver

Operating Characteristic)

� From signal detection theory

� True positive rate vs. false positive rate

� Sensitivity vs (1 -

perfect classification

� Sensitivity vs (1 -specificity)

� Each prediction result represents one point (varying threshold, sample distribution, etc)

line of no discrimination

How to Construct an ROC curve

Instance P(+|A) True Class

1 0.95 +

2 0.93 +

3 0.87 -

4 0.85 -

• Sort instances

according to posterior

probability P(+|A) in

decreasing order4 0.85 -

5 0.85 -

6 0.85 +

7 0.76 -

8 0.53 +

9 0.43 -

10 0.25 +

• Apply threshold at

each unique value of

P(+|A)

• Compute and plot

TPR and FPR

How to construct an ROC curveClass + - + - - - + - + +

0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00

TP 5 4 4 3 3 3 3 2 2 1 0

FP 5 5 4 4 3 2 1 1 0 0 0

TN 0 0 1 1 2 3 4 4 5 5 5

FN 0 1 1 2 2 2 2 3 3 4 5

TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0

FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0

Threshold

ROC Curve:

Using ROC for Model Comparison

� Area Under the ROC

curve� Ideal: Area = 1

� Diagonal: Area = 0.5

� M1 vs. M2?� M1 vs. M2?

Test of Significance

� Given two models:

� Model M1: accuracy = 85%, tested on 30 instances

� Model M2: accuracy = 75%, tested on 5000 instances

� Can we say M1 is better than M2?� Can we say M1 is better than M2?

� How much confidence can we place on accuracy of M1

and M2?

� Can the difference in performance measure be

explained as a result of random fluctuations in the test

Confidence Interval for Accuracy

� Prediction can be regarded as a Bernoulli trial

� A Bernoulli trial has 2 possible outcomes

� Possible outcomes for prediction: correct or wrong

� Collection of Bernoulli trials has a Binomial distribution:

� Given x (# of correct predictions) or equivalently, � Given x (# of correct predictions) or equivalently,

acc=x/N, and N (# of test instances),

Can we predict p (true accuracy of model)?

� For large test sets (N > 30),

� acc has a normal distribution

with mean p and variance

p(1-p)/N

< )( Zpacc

Area = 1 - α

� Confidence Interval for p:

(2/12/

Zα/2 Z1- α /2

accNaccNZZaccNp

××−××+±+××=

� Consider a model that produces an accuracy of

80% when evaluated on 100 test instances:

� N=100, acc = 0.8

� Let 1-α = 0.95 (95% confidence)

From probability table, Z =1.96

1-α Z

0.99 2.58

0.98 2.33� From probability table, Zα/2=1.96 0.98 2.33

0.95 1.96

0.90 1.65

N 50 100 500 1000 5000

p(lower) 0.670 0.711 0.763 0.774 0.789

p(upper) 0.888 0.866 0.833 0.824 0.811

Model Evaluation

Methods of Evaluation

� Holdout method

� Given data is randomly partitioned into two independent sets

� Training set (e.g., 2/3) for model construction

� Test set (e.g., 1/3) for accuracy estimation

� Random sampling: a variation of holdout

� Repeat holdout k times, accuracy = avg. of the accuracies obtained

� Cross-validation (k-fold, where k = 10 is most popular)� Cross-validation (k-fold, where k = 10 is most popular)

� Randomly partition the data into k mutually exclusive subsets, each approximately equal size

� At i-th iteration, use k-1 sets as test set and remaining one as training set

� Leave-one-out: k folds where k = # of tuples, for small sized data

� Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

� Bootstrapping

� Sampling with replacement24

� Last lecture

� Overview

� Today

� Neural Networks

� Regression

Lazy vs. Eager Learning

� Lazy vs. eager learning

� Lazy learning (e.g. instance-based learning): stores training data (or only minor processing) and waits till receiving test data

� Eager learning (e.g. decision tree, Bayesian): constructs a classification model before receiving test data

� Efficiency

Lazy learning: less time in training but more in predicting

� Lazy learning: less time in training but more in predicting

� Eager learning: more time in training but less in predicting

� Accuracy

� Lazy learning: effectively uses a richer hypothesis space by using many local linear functions to form its global approximation to the target function

� Eager learning: must commit to a single hypothesis that covers the entire instance space

Lazy Learner: Instance-Based Methods

� Typical approaches

� k-nearest neighbor approach

� Instances represented as points in a Euclidean space.

� Locally weighted regression

Constructs local approximation

� Constructs local approximation

Nearest Neighbor Classifiers

� Basic idea:

� If it walks like a duck, quacks like a duck, then

it’s probably a duck

Test Compute

Distance

Training

Records

RecordDistance

Choose k of the

“nearest” records

Nearest-Neighbor Classifiers

� Algorithm

– Compute distance from

test record to training

records

– Identify k nearest

Unknown record

neighbors

– Use class labels of

nearest neighbors to

determine the class

label of unknown record

(e.g., by taking majority

Nearest Neighbor Classification

� Compute distance between two points:

� Euclidean distance

∑ −=i ii

qpqpd 2)(),(

� Determine the class from nearest neighbor list

� take the majority vote of class labels among

the k-nearest neighbors

� Weigh the vote according to distance

� weight factor, w = 1/d2

� Choosing the value of k:

� If k is too small, sensitive to noise points

� If k is too large, neighborhood may include points from

other classes

� Scaling issues

� Attributes may have to be scaled to prevent distance

measures from being dominated by one of the

attributes

� Example:� Example:

� height of a person may vary from 1.5m to 1.8m

� weight of a person may vary from 90lb to 300lb

� income of a person may vary from $10K to $1M

� Solution?

� Real-valued prediction for a given unknown tuple

� Returns the mean values of the k nearest neighbors

Collaborative filtering: kNN in action

Data Mining: Concepts and Techniques

Customers who bought this book also bought:

•Data Preparation for Data Mining: by Dorian Pyle (Author) •The Elements of Statistical Learning: by T. Hastie, et al •Data Mining: Introductory and Advanced Topics: by Margaret H. Dunham•Mining the Web: Analysis of Hypertext and Semi Structured Data

Basic Approaches for

Recommendation

� Collaborative Filtering (CF)

� Look at users collective behavior

� Look at the active user history

� Combine!

� Content-based Filtering

� Recommend items based on key-words

� More appropriate for information retrieval

Collaborative Filtering for

Recommendation

� Each user has a profile

� Users rate items

� Explicitly: score from 1..5

� Implicitly: web usage mining

� Time spent in viewing the item

� Navigation path

� Etc…

� System does the rest, How?

� Collaborative filtering (based on kNN!)

Collaborative Filtering: A Framework

Items: I

i1 i2 … ij … in

3 1.5 …. … 2

The task:

Q1: Find Unknown ratings?

Q2: Which items should we

recommend to this user?

Unknown function

f: U x I→→→→ R

Users: U

Collaborative Filtering

� User-User Methods

� Identify like-minded users

� Memory-based: K-NN

� Model-based: Clustering

� Item-Item Method

� Identify buying patterns

� Correlation Analysis

� Linear Regression

� Belief Network

� Association Rule Mining

User-User Similarity: Intuition

TargetTarget

CustomerCustomer

Q1: How to measure

similarity?

Q2: How to select

neighbors?

Q3: How to combine?

How to Measure Similarity?

• Pearson correlation coefficient

∑∑

∈∈

−−

Items RatedCommonly j

iijaaj

• Cosine measure

– Users are vectors in product-dimension space

rriaw =

Nearest Neighbor Approaches [SAR00a]

• Offline phase:

– Do nothing…just store transactions

• Online phase:

– Identify highly similar users to the active one

• Best K ones

• All with a measure greater than a threshold

• Prediction

∑ −

User a’s neutral

User i’s deviation

User a’s estimated deviation40

� Last lecture

� Overview

� Today

� Neural Networks

� Regression

Rule-Based Classifier

� Classify records by a collection of IF-THEN rules

� Basic concepts

� IF (Condition) THEN y

� (Condition) → y

� LHS: rule antecedent or condition

� RHS: rule consequent

� E.g. IF age = youth AND student = yes THEN buys_computer = yes

� Using the rules

� Learning the rules

Rule-based Classifier: Example

Name Blood Type Give Birth Can Fly Live in Water Class

human warm yes no no mammalspython cold no no no reptilessalmon cold no no yes fisheswhale warm yes no yes mammalsfrog cold no no sometimes amphibianskomodo cold no no no reptilesbat warm yes yes no mammalspigeon warm no yes no birdscat warm yes no no mammalsleopard shark cold yes no yes fishes

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes

R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals

leopard shark cold yes no yes fishesturtle cold no no sometimes reptilespenguin warm no no sometimes birdsporcupine warm yes no no mammalseel cold no no yes fishessalamander cold no no sometimes amphibiansgila monster cold no no no reptilesplatypus warm no no no mammalsowl warm no yes no birdsdolphin warm yes no yes mammalseagle warm no yes no birds

Assessment of a Rule

� Coverage of a rule:

� Fraction of records that satisfy the antecedent of a rule

� coverage(R) = ncovers /|D| where ncovers = # of tuples

covered by R and D is the training data set

� Accuracy of a rule:

� Fraction of records that satisfy both the antecedent and

consequent of a rule

� accuracy(R) = ncorrect / ncovers where ncorrect = # of tuples

correctly classified by R

Characteristics of Rule-Based Classifier

� Mutually exclusive rules

� Classifier contains mutually exclusive rules if

the rules are independent of each other

� Every record is covered by at most one rule

� Exhaustive rules

� Classifier has exhaustive coverage if it accounts

for every possible combination of attribute

values

� Each record is covered by at least one rule

Using the Rules

� Rules that are mutually exclusive and exhaustive

� Rules that are not mutually exclusive

� A record may trigger more than one rules

� Solution? – Conflict resolution� Solution? – Conflict resolution

� Rule Ordering

� Unordered rule set – use voting schemes

� Rules that are not exhaustive

� A record may not trigger any rules

� Solution?

� Use a default class46

Rule Ordering

� Rule-based ordering� Individual rules are ranked based on their quality

� Rule set is known as a decision list

� Class-based ordering� Classes are sorted in order of decreasing importance

� Rules are sorted by the classes

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes

R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals

R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles

R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class

turtle cold no no sometimes ?

Building Classification Rules

� Indirect Method: Extract rules from other

classification models

� Decision trees. E.g. C4.5 Rules

� Direct Method: Extract rules directly from data

� Sequential Covering. E.g.: CN2, RIPPER

� Associative Classification.

student? credit rating?

<=30>40

no yes yes

31..40

fairexcellentyesno

Rule Extraction from a Decision Tree

� One rule is created for each path from the root to

a leaf - each attribute-value pair forms a

conjunction, the leaf holds the class prediction

� Rules are mutually exclusive and exhaustive

� Pruning (C4.5): class-based ordering

no yes yes

� Example: Rule extraction from our buys_computer decision-tree

IF age = young AND student = no THEN buys_computer = no

IF age = young AND student = yes THEN buys_computer = yes

IF age = mid-age THEN buys_computer = yes

IF age = old AND credit_rating = excellent THEN buys_computer = yes

IF age = young AND credit_rating = fair THEN buys_computer = no

Direct Method: Sequential Covering

1. Start from an empty rule

2. Grow a rule using the Learn-One-Rule function

3. Remove training records covered by the rule

4. Repeat Step (2) and (3) until stopping criterion

is met

Example of Sequential Covering

(ii) Step 1

Example of Sequential Covering

(iii) Step 2 (iv) Step 3

Rule Growing

� Two common strategies

Learn-One-Rule

� Start with the most general rule possible: condition = empty

� Adding new attributes by adopting a greedy depth-first strategy

� Picks the one that most improves the rule quality

� Rule-Quality measures: consider both coverage and accuracy

� Foil-gain (in FOIL & RIPPER): assesses info_gain by extending

condition

It favors rules that have high accuracy and cover many positive tuples

� Rule pruning based on an independent set of test tuples

Pos/neg are # of positive/negative tuples covered by R.

If FOIL_Prune is higher for the pruned version of R, prune R

)log''

'(log'_ 22

negpos

posposGainFOIL

negpos

negposRPruneFOIL

−=)(_

Direct Method: Multi-Class

� For 2-class problem, choose one of the classes as positive

class, and the other as negative class

� Learn rules for positive class

� Negative class will be default class

� For multi-class problem

� Order the classes according to increasing class prevalence

(fraction of instances that belong to a particular class)

� Learn the rule set for smallest class first, treat the rest as negative

� Repeat with next smallest class as positive class

Associative Classification

� Associative classification

� Search for strong associations between frequent patterns

(conjunctions of attribute-value pairs) and class labels

� Classification: Based on evaluating a set of rules in the form of

P1 ^ p2 … ^ pl � “Aclass = C” (conf, sup)P1 ^ p2 … ^ pl � “Aclass = C” (conf, sup)

� Why effective?

� It explores highly confident associations among multiple attributes

and may overcome some constraints introduced by decision-tree

induction, which considers only one attribute at a time

� In many studies, associative classification has been found to be more

accurate than some traditional classification methods, such as C4.5

Rule-Based Classifiers: Comments

� As highly expressive as decision trees

� Easy to interpret

� Easy to generate

� Can classify new instances rapidly

� Performance comparable to decision trees

� Last lecture

� Overview

� Today

� Neural Networks

� Regression

cs570 introduction to data mining classification and ...lxiong/cs570_s11/share/slides/07... ·...

Documents

cs570: introduction to data mining - math/cs

kaggle: what’s cooking? -...

cs378 introduction to data mining data exploration and...

cs570 introduction to data mining - emory university

where next? data mining techniques and...

cs 377 database systems database design theory and...

privacy-preserving query processing over...

cs570: introduction to data mining - emory...

distributed anonymization: achieving privacy for both data...

introduction to data mining - emory...

skyline diagram: efficient space partitioning for skyline...

privacy preserving data mining – introduction and...

2018 dealer parts price list tg300, tg300ss, cs450, cs570...

cs570 introduction to data mining - emory...

chapter 2 elementary programming - department of...

li xiong department of mathematics and computer...

cs570 introduction to data mining - emory...

chap.12 practical planning cs570 artificial intelligence...

secure multiparty computation – applications for privacy...

cluster analysis - emory...