2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 data mining: concepts...

159
2022年5年7年 Data Mining: Concepts and Tec hniques 1 Classification ©Jiawei Han and Micheline Kamber http://www-sal.cs.uiuc.edu/~hanj/bk2/ Chp 6 modified by Donghui Zhang Integrated with slides from Prof. Andrew W. Moore http://www.cs.cmu.edu/~awm/tutorials

Upload: john-barber

Post on 25-Dec-2015

292 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 1

Classification©Jiawei Han and Micheline Kamber

http://www-sal.cs.uiuc.edu/~hanj/bk2/

Chp 6

modified by Donghui Zhang

Integrated with slides from Prof. Andrew W. Moore

http://www.cs.cmu.edu/~awm/tutorials

Page 2: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 2

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 3: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 3

Classification: models categorical class labels (discrete or

nominal) e.g. given a new customer, does she belong to

the “likely to buy a computer” class? Prediction:

models continuous-valued functions e.g. how many computers will a customer buy?

Typical Applications credit approval target marketing medical diagnosis treatment effectiveness analysis

Classification vs. Prediction

Page 4: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 4

Classification—A Two-Step Process

Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined

class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision

trees, or mathematical formulae Model usage: for classifying future or unknown objects

Estimate accuracy of the model The known label of test sample is compared with the

classified result from the model Accuracy rate is the percentage of test set samples that

are correctly classified by the model Test set is independent of training set, otherwise over-

fitting will occur If the accuracy is acceptable, use the model to classify

data tuples whose class labels are not known

Page 5: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 5

Classification Process (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Page 6: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 6

Classification Process (2): Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 7: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 7

Supervised vs. Unsupervised Learning

Supervised learning (classification) Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

New data is classified based on the training set Unsupervised learning (clustering)

The class labels of training data is unknown Given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data

Page 8: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 8

Evaluating Classification Methods

Predictive accuracy Speed and scalability

time to construct the model time to use the model

Robustness handling noise and missing values

Scalability efficiency in disk-resident databases

Interpretability: understanding and insight provided by the model

Goodness of rules decision tree size compactness of classification rules

Page 9: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 9

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 10: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 10

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Page 11: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 11

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 12: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 12

Extracting Classification Rules from Trees

Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example

IF age = “<=30” AND student = “no” THEN buys_computer = “no”

IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”

IF age = “31…40” THEN buys_computer = “yes”

IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes”

IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Page 13: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 13

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-

conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are

discretized in advance) Examples are partitioned recursively based on selected

attributes Test attributes are selected on the basis of a heuristic or

statistical measure (e.g., information gain) Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf There are no samples left

Page 14: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Tech

niques 14

Information gain slides adapted from

Andrew W. MooreAssociate Professor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 15: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 15

Bits

You are watching a set of independent random samples of X

You see that X has four possible values

So you might see: BAACBADCDADDDA…You transmit data over a binary serial link. You can encode each reading with two bits (e.g. A = 00, B = 01, C = 10, D = 11)

0100001001001110110011111100…

P(X=A) = 1/4 P(X=B) = 1/4 P(X=C) = 1/4 P(X=D) = 1/4

Page 16: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 16

Fewer Bits

Someone tells you that the probabilities are not equal

It’s possible…

…to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

Page 17: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 17

Fewer Bits

Someone tells you that the probabilities are not equal

It’s possible……to invent a coding for your transmission that only uses 1.75 bits on average per symbol. How?

(This is just one of several ways)

P(X=A) = 1/2 P(X=B) = 1/4 P(X=C) = 1/8 P(X=D) = 1/8

A 0

B 10

C 110

D 111

Page 18: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 18

Fewer Bits

Suppose there are three equally likely values…

Here’s a naïve coding, costing 2 bits per symbol

Can you think of a coding that would need only 1.6 bits per symbol on average?

In theory, it can in fact be done with 1.58496 bits per symbol.

P(X=B) = 1/3 P(X=C) = 1/3 P(X=D) = 1/3

A 00

B 01

C 10

Page 19: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 19

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys)

distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

jjj pp

12log

Page 20: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 20

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys)

distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

jjj pp

12log

A histogram of the frequency distribution of values of X would be flat

A histogram of the frequency distribution of values of X would have many lows and one or two highs

Page 21: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 21

Suppose X can have one of m values… V1, V2, … Vm

What’s the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X’s distribution? It’s

H(X) = The entropy of X “High Entropy” means X is from a uniform (boring) distribution “Low Entropy” means X is from varied (peaks and valleys)

distribution

General Case

mm ppppppXH 2222121 logloglog)(

P(X=V1) = p1 P(X=V2) = p2 …. P(X=Vm) = pm

m

jjj pp

12log

A histogram of the frequency distribution of values of X would be flat

A histogram of the frequency distribution of values of X would have many lows and one or two highs

..and so the values sampled from it would be all over the place

..and so the values sampled from it would be more predictable

Page 22: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 22

Entropy in a nut-shell

Low Entropy High Entropy

Page 23: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 23

Entropy in a nut-shell

Low Entropy High Entropy..the values (locations of soup) unpredictable... almost uniformly sampled throughout our dining room

..the values (locations of soup) sampled entirely from within the soup bowl

Page 24: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 24

Exercise:

Suppose 100 customers have two classes: “Buy Computer” and “Not Buy Computer”.

Uniform distribution: 50 buy. Entropy? Skewed distribution: 100 buy. Entropy?

Page 25: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 25

Specific Conditional Entropy

Suppose I’m trying to predict output Y and I have input X Let’s assume this reflects the true

probabilities

E.G. From this data we estimate

• P(LikeG = Yes) = 0.5

• P(Major = Math & LikeG = No) = 0.25

• P(Major = Math) = 0.5

• P(LikeG = Yes | Major = History) = 0

Note:

• H(X) = 1.5

•H(Y) = 1

X = College Major

Y = Likes “Gladiator”X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 26: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 26

Specific Conditional Entropy

Definition of Specific Conditional Entropy:

H(Y|X=v) = The entropy of Y among only those records in which X has value v

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 27: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 27

Specific Conditional Entropy

Definition of Conditional Entropy:

H(Y|X=v) = The entropy of Y among only those records in which X has value v

Example:

• H(Y|X=Math) = 1

• H(Y|X=History) = 0

• H(Y|X=CS) = 0

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 28: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 28

Conditional Entropy

Definition of Specific Conditional Entropy:

H(Y|X) = The average conditional entropy of Y

= if you choose a record at random what will be the conditional entropy of Y, conditioned on that row’s value of X

= Expected number of bits to transmit Y if both sides will know the value of X

= ΣjProb(X=vj) H(Y | X = vj)

X = College Major

Y = Likes “Gladiator”

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 29: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 29

Conditional EntropyDefinition of general Conditional Entropy:

H(Y|X) = The average conditional entropy of Y

= ΣjProb(X=vj) H(Y | X = vj)

X = College Major

Y = Likes “Gladiator”

Example:

vj Prob(X=vj) H(Y | X = vj)

Math 0.5 1

History 0.25 0

CS 0.25 0

H(Y|X) = 0.5 * 1 + 0.25 * 0 + 0.25 * 0 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 30: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 30

Information GainDefinition of Information Gain:

IG(Y|X) = I must transmit Y. How many bits on average would it save me if both ends of the line knew X?

IG(Y|X) = H(Y) - H(Y | X)

X = College Major

Y = Likes “Gladiator”

Example:

• H(Y) = 1

• H(Y|X) = 0.5

• Thus IG(Y|X) = 1 – 0.5 = 0.5

X Y

Math Yes

History No

CS Yes

Math No

Math No

CS Yes

History No

Math Yes

Page 31: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 31

What is Information Gain used for?

Suppose you are trying to predict whether someone is going live past 80 years. From historical data you might find…

•IG(LongLife | HairColor) = 0.01

•IG(LongLife | Smoker) = 0.2

•IG(LongLife | Gender) = 0.25

•IG(LongLife | LastDigitOfSSN) = 0.00001

IG tells you how interesting a 2-d contingency table is going to be.

Page 32: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 32

Conditional entropy H(C|age)

H(C|age<=30) = 2/5 * lg(5/2) + 3/5 * lg(5/3) = 0.971

H(C|age in 30..40) = 1 * lg 1 + 0 * lg 1/0 = 0 H(C|age>40) = 3/5 * lg(5/3) + 2/5 * lg(5/2) =

0.971

age buy no<=30 2 330…40 4 0>40 3 2

694.0)40|(14

5

])40,30(|(14

4)30|(

14

5)|(

ageCH

ageCHageCHageCH

Page 33: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 33

Select the attribute with lowest conditional entropy

H(C|age) = 0.694 H(C|income) = 0.911 H(C|student) = 0.789 H(C|credit_rating) =

0.892 Select “age” to be the

tree root! yes

age?

<=30 >4030..40

student?

no yes

no yes

credit rating?

fairexcellent

no yes

Page 34: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 34

Goodness in Decision Tree Induction

relatively faster learning speed (than other classification methods)

convertible to simple and easy to understand classification rules

can use SQL queries for accessing databases comparable classification accuracy with other

methods

Page 35: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 35

Scalable Decision Tree Induction Methods in Data Mining Studies

SLIQ (EDBT’96 — Mehta et al.) builds an index for each attribute and only class list

and the current attribute list reside in memory SPRINT (VLDB’96 — J. Shafer et al.)

constructs an attribute list data structure PUBLIC (VLDB’98 — Rastogi & Shim)

integrates tree splitting and tree pruning: stop growing the tree earlier

RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) separates the scalability aspects from the criteria

that determine the quality of the tree builds an AVC-list (attribute, value, class label)

Page 36: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 36

Visualization of a Decision Tree in SGI/MineSet 3.0

Page 37: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 37

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 38: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 38

Bayesian Classification: Why?

Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems

Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Page 39: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 39

Bayesian Classification

X: a data sample whose class label is unknown, e.g. X =(Income=medium, Credit_rating=Fair, Age=40).

Hi: a hypothesis that a record belongs to class C i, e.g.

Hi = a record belongs to the “buy computer” class. P(Hi), P(X): probabilities. P(Hi/X): a conditional probability: among all records

with medium income and fair credit rating, what’s the probability to buy a computer?

This is what we need for classification! Given X, P(H i/X) tells us the possibility that it belongs to some class.

What if we need to determine a single class for X?

Page 40: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 40

Bayesian Theorem

Another concept, P(X|Hi) : probability of observing the sample X, given that the hypothesis holds. E.g. among all people who buy computer, what percentage has the same value as X.

We know P(X Hi) = P(Hi|X) P(X) = P(X|Hi) P(Hi), So

)()()|(

)|(XP

iHPiHXPXiHP

We should assign X to the class Ci where P(Hi|X) is maximized, equivalent to maximize P(X|Hi) P(Hi).

Page 41: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 41

Basic Idea

Read the training data, Compute P(Hi) for each class. Compute P(Xk|Hi) for each distinct instance of X

among records in class Ci. To predict the class for a new data X,

for each class, compute P(X|Hi) P(Hi). return the class which has the largest value.

Any Problem?

Too many combinations of Xk. e.g. 50 ages, 50 credit ratings, 50 income levels

125,000 combinations of Xk!

Page 42: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 42

Naïve Bayes Classifier

A simplified assumption: attributes are conditionally independent:

The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)

No dependence relation between attributes Greatly reduces the number of probabilities to

maintain.

n

kCixPCiXP

k1

)|()|(

Page 43: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 43

Sample quiz questions

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

1. What data does

naïve Baysian netmaintain?

2. Given X =(age<=30,Income=medium,Student=yesCredit_rating=Fai

r)

buy or not buy?

Page 44: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 44

Naïve Bayesian Classifier: Example

Compute P(X/Ci) for each class

P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4

X=(age<=30 ,income =medium, student=yes,credit_rating=fair)

P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028

P(X|buys_computer=“no”) * P(buys_computer=“no”)=0.007

X belongs to class “buys_computer=yes”Pitfall: forget P(Ci)

Page 45: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 45

Naïve Bayesian Classifier: Comments

Advantages : Easy to implement Good results obtained in most of the cases

Disadvantages Assumption: class conditional independence , therefore

loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer,

diabetes etc Dependencies among these cannot be modeled by Naïve

Bayesian Classifier How to deal with these dependencies?

Bayesian Belief Networks

Page 46: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 46

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 47: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Tech

niques 47

Baysian Networks slides adapted from

Andrew W. MooreAssociate Professor

School of Computer ScienceCarnegie Mellon University

www.cs.cmu.edu/[email protected]

412-268-7599

Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.

Page 48: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 48

What we’ll discuss

Recall the numerous and dramatic benefits of Joint Distributions for describing uncertain worlds

Reel with terror at the problem with using Joint Distributions

Discover how Bayes Net methodology allows us to build Joint Distributions in manageable chunks

Discover there’s still a lurking problem… …Start to solve that problem

Page 49: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 49

Why this matters

In Andrew’s opinion, the most important technology in the Machine Learning / AI field to have emerged in the last 10 years.

A clean, clear, manageable language and methodology for expressing what you’re certain and uncertain about

Already, many practical applications in medicine, factories, helpdesks:

P(this problem | these symptoms)anomalousness of this observationchoosing next diagnostic test | these

observations

Anomaly Detection

Inference

Active Data Collection

Page 50: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 50

Ways to deal with Uncertainty

Three-valued logic: True / False / Maybe Fuzzy logic (truth values between 0 and 1) Non-monotonic reasoning (especially

focused on Penguin informatics) Dempster-Shafer theory (and an extension

known as quasi-Bayesian theory) Possibabilistic Logic Probability

Page 51: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 51

Discrete Random Variables

A is a Boolean-valued random variable if A denotes an event, and there is some degree of uncertainty as to whether A occurs.

Examples A = The US president in 2023 will be male A = You wake up tomorrow with a headache A = You have Ebola

Page 52: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 52

Probabilities

We write P(A) as “the fraction of possible worlds in which A is true”

We could at this point spend 2 hours on the philosophy of this.

But we won’t.

Page 53: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 53

Visualizing A

Event space of all possible worlds

Its area is 1Worlds in which A is False

Worlds in which A is true

P(A) = Area ofreddish oval

Page 54: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 54

Interpreting the axioms

0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

The area of A can’t get any smaller than 0

And a zero area would mean no world could ever have A true

Page 55: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 55

Interpreting the axioms

0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

The area of A can’t get any bigger than 1

And an area of 1 would mean all worlds will have A true

Page 56: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 56

Interpreting the axioms

0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

A

B

Page 57: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 57

Interpreting the axioms

0 <= P(A) <= 1 P(True) = 1 P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

A

B

P(A or B)

BP(A and B)

Simple addition and subtraction

Page 58: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 58

These Axioms are Not to be Trifled With

There have been attempts to do different methodologies for uncertainty

Fuzzy Logic Three-valued logic Dempster-Shafer Non-monotonic reasoning

But the axioms of probability are the only system with this property:

If you gamble using them you can’t be unfairly exploited by an opponent using some other system [di Finetti 1931]

Page 59: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 59

Theorems from the Axioms

0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

From these we can prove:P(not A) = P(~A) = 1-P(A)

How?

Page 60: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 60

Another important theorem

0 <= P(A) <= 1, P(True) = 1, P(False) = 0 P(A or B) = P(A) + P(B) - P(A and B)

From these we can prove:P(A) = P(A ^ B) + P(A ^ ~B)

How?

Page 61: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 61

Conditional Probability

P(A|B) = Fraction of worlds in which B is true that also have A true

F

H

H = “Have a headache”F = “Coming down with Flu”

P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

“Headaches are rare and flu is rarer, but if you’re coming down with ‘flu there’s a 50-50 chance you’ll have a headache.”

Page 62: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 62

Conditional Probability

F

H

H = “Have a headache”F = “Coming down with Flu”

P(H) = 1/10P(F) = 1/40P(H|F) = 1/2

P(H|F) = Fraction of flu-inflicted worlds in which you have a headache

= #worlds with flu and headache ------------------------------------ #worlds with flu

= Area of “H and F” region ------------------------------ Area of “F” region

= P(H ^ F) ----------- P(F)

Page 63: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 63

Definition of Conditional Probability

P(A ^ B) P(A|B) = ----------- P(B)

Corollary: The Chain Rule

P(A ^ B) = P(A|B) P(B)

Page 64: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 64

Bayes Rule

P(A ^ B) P(A|B) P(B)P(B|A) = ----------- = --------------- P(A) P(A)

This is Bayes Rule

Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

Page 65: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 65

Using Bayes Rule to Gamble

The “Win” envelope has a dollar and four beads in it

$1.00

The “Lose” envelope has three beads and no money

Trivial question: someone draws an envelope at random and offers to sell it to you. How much should you pay?

R R B B R B B

Page 66: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 66

Using Bayes Rule to Gamble

The “Win” envelope has a dollar and four beads in it

$1.00

The “Lose” envelope has three beads and no money

Interesting question: before deciding, you are allowed to see one bead drawn from the envelope.

Suppose it’s black: How much should you pay? Suppose it’s red: How much should you pay?

Page 67: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 67

Another Example

You friend told you that she has two children (not twin). Probability that both children are male?

You asked her whether she has at least one son, and she said yes. Probability that both children are male?

You visited her house and saw one son of hers. Probability that both children are male?

Note: P(SeeASon|OneSon) and P(TellASon|OneSon) are different!

Page 68: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 68

Multivalued Random Variables

Suppose A can take on more than 2 values A is a random variable with arity k if it can take

on exactly one value out of {v1,v2, .. vk} Thus…

jivAvAP ji if 0)(

1)( 21 kvAvAvAP

Page 69: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 69

An easy fact about Multivalued Random Variables:

Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(1)( 21 kvAvAvAP

)()(1

21

i

jji vAPvAvAvAP

Page 70: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 70

An easy fact about Multivalued Random Variables:

Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(1)( 21 kvAvAvAP

)()(1

21

i

jji vAPvAvAvAP

• And thus we can prove

1)(1

k

jjvAP

Page 71: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 71

Another fact about Multivalued Random Variables:

Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(1)( 21 kvAvAvAP

)(])[(1

21

i

jji vABPvAvAvABP

Page 72: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 72

Another fact about Multivalued Random Variables:

Using the axioms of probability…

0 <= P(A) <= 1, P(True) = 1, P(False) = 0P(A or B) = P(A) + P(B) - P(A and B)

And assuming that A obeys…

• It’s easy to prove that

jivAvAP ji if 0)(1)( 21 kvAvAvAP

)(])[(1

21

i

jji vABPvAvAvABP

• And thus we can prove

)()(1

k

jjvABPBP

Page 73: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 73

More General Forms of Bayes Rule

)(~)|~()()|(

)()|()|(

APABPAPABP

APABPBAP

)(

)()|()|(

XBP

XAPXABPXBAP

Page 74: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 74

More General Forms of Bayes Rule

An

kkk

iii

vAPvABP

vAPvABPBvAP

1

)()|(

)()|()|(

Page 75: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 75

Useful Easy-to-prove facts

1)|()|( BAPBAP

1)|(1

An

kk BvAP

Page 76: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 76

From Probability to Bayesian Net

• Suppose there are some diseases, symptoms and related facts.

• E.g. flu, headache, fever, lung cancer, smoker.• We are interested to know some (conditional or

unconditional) probabilities such as• P( flu )• P( lung cancer | smoker )• P( flu | headache ^ ~fever )• What shall we do?

Page 77: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 77

The Joint Distribution

Recipe for making a joint distribution of M variables:

Example: Boolean variables A, B, C

Page 78: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 78

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

Example: Boolean variables A, B, C

A B C0 0 0

0 0 1

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1

Page 79: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 79

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

Example: Boolean variables A, B, C

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

Page 80: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 80

The Joint Distribution

Recipe for making a joint distribution of M variables:

1. Make a truth table listing all combinations of values of your variables (if there are M Boolean variables then the table will have 2M rows).

2. For each combination of values, say how probable it is.

3. If you subscribe to the axioms of probability, those numbers must sum to 1.

Example: Boolean variables A, B, C

A B C Prob0 0 0 0.30

0 0 1 0.05

0 1 0 0.10

0 1 1 0.05

1 0 0 0.05

1 0 1 0.10

1 1 0 0.25

1 1 1 0.10

A

B

C0.050.25

0.10 0.050.05

0.10

0.100.30

Page 81: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 81

Using the Joint

Once you have the JD you can ask for the probability of any logical expression involving your attribute

E

PEP matching rows

)row()(

Page 82: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 82

Using the Joint

P(Poor Male) = 0.4654 E

PEP matching rows

)row()(

Page 83: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 83

Using the Joint

P(Poor) = 0.7604 E

PEP matching rows

)row()(

Page 84: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 84

Inference with the Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

Page 85: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 85

Inference with the Joint

2

2 1

matching rows

and matching rows

2

2121 )row(

)row(

)(

)()|(

E

EE

P

P

EP

EEPEEP

P(Male | Poor) = 0.4654 / 0.7604 = 0.612

Page 86: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 86

Joint distributions

Good news

Once you have a joint distribution, you can ask important questions about stuff that involves a lot of uncertainty

Bad news

Impossible to create for more than about ten attributes because there are so many numbers needed when you build the damn thing.

Page 87: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 87

Using fewer numbersSuppose there are two events:

M: Manuela teaches the class (otherwise it’s Andrew)

S: It is sunnyThe joint p.d.f. for these events contain four entries.

If we want to build the joint p.d.f. we’ll have to invent those four numbers. OR WILL WE?? We don’t have to specify with bottom level

conjunctive events such as P(~M^S) IF… …instead it may sometimes be more convenient

for us to specify things like: P(M), P(S).But just P(M) and P(S) don’t derive the joint distribution. So you

can’t answer all questions.

Page 88: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 88

Using fewer numbersSuppose there are two events:

M: Manuela teaches the class (otherwise it’s Andrew)

S: It is sunnyThe joint p.d.f. for these events contain four entries.

If we want to build the joint p.d.f. we’ll have to invent those four numbers. OR WILL WE?? We don’t have to specify with bottom level

conjunctive events such as P(~M^S) IF… …instead it may sometimes be more convenient

for us to specify things like: P(M), P(S).But just P(M) and P(S) don’t derive the joint distribution. So you

can’t answer all questions.

What extra assumption can you

make?

Page 89: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 89

Independence“The sunshine levels do not depend on and do not influence who is teaching.”

This can be specified very simply:

P(S M) = P(S)This is a powerful statement!

It required extra domain knowledge. A different kind of knowledge than numerical probabilities. It needed an understanding of causation.

Page 90: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 90

Independence

From P(S M) = P(S), the rules of probability imply: (can you prove these?)

P(~S M) = P(~S)

P(M S) = P(M)

P(M ^ S) = P(M) P(S)

P(~M ^ S) = P(~M) P(S), P(M^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)

Page 91: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 91

Independence

From P(S M) = P(S), the rules of probability imply: (can you prove these?)

P(~S M) = P(~S)

P(M S) = P(M)

P(M ^ S) = P(M) P(S)

P(~M ^ S) = P(~M) P(S), (PM^~S) = P(M)P(~S), P(~M^~S) = P(~M)P(~S)

And in general:

P(M=u ^ S=v) = P(M=u) P(S=v)

for each of the four combinations of

u=True/False

v=True/False

Page 92: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 92

IndependenceWe’ve stated:

P(M) = 0.6P(S) = 0.3P(S M) = P(S)

M S Prob

T T

T F

F T

F F

And since we now have the joint pdf, we can make any queries we like.

From these statements, we can derive the full joint pdf.

Page 93: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 93

A more interesting case

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive later than Manuela.

Page 94: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 94

A more interesting case

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

Page 95: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 95

A more interesting case

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

Let’s begin with writing down knowledge we’re happy about:P(S M) = P(S), P(S) = 0.3, P(M) = 0.6

Lateness is not independent of the weather and is not independent of the lecturer.

We already know the Joint of S and M, so all we need now isP(L S=u, M=v)

in the 4 cases of u/v = True/False.

Page 96: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 96

A more interesting case

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Now we can derive a full joint p.d.f. with a “mere” six numbers instead of seven*

*Savings are larger for larger numbers of variables.

Page 97: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 97

A more interesting case

M : Manuela teaches the class S : It is sunny L : The lecturer arrives slightly late.

Assume both lecturers are sometimes delayed by bad weather. Andrew is more likely to arrive late than Manuela.

P(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

Question: ExpressP(L=x ^ M=y ^ S=z)

in terms that only need the above expressions, where x,y and z may each be True or False.

Page 98: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 98

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

S M

L

P(s)=0.3P(M)=0.6

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 99: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 99

A bit of notationP(S M) = P(S)P(S) = 0.3P(M) = 0.6

P(L M ^ S) = 0.05P(L M ^ ~S) = 0.1P(L ~M ^ S) = 0.1P(L ~M ^ ~S) = 0.2

S M

L

P(s)=0.3P(M)=0.6

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Read the absence of an arrow between S and M to mean “it

would not help me predict M if I knew the value of S”

Read the two arrows into L to mean that if I want to know the

value of L it may help me to know M and to know S.

Th

is kind o

f stuff w

ill be

tho

rou

ghly form

alize

d la

ter

Page 100: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 100

An even cuter trick

Suppose we have these three events: M : Lecture taught by Manuela L : Lecturer arrives late R : Lecture concerns robotsSuppose: Andrew has a higher chance of being late than Manuela. Andrew has a higher chance of giving robotics lectures.What kind of independence can we find?

How about: P(L M) = P(L) ? P(R M) = P(R) ? P(L R) = P(L) ?

Page 101: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 101

Conditional independence

Once you know who the lecturer is, then whether they arrive late doesn’t affect whether the lecture concerns robots.

P(R M,L) = P(R M) andP(R ~M,L) = P(R ~M)

We express this in the following way:

“R and L are conditionally independent given M”

M

L R

Given knowledge of M, knowing anything else in the diagram won’t help us with L, etc.

..which is also notated by the following diagram.

Page 102: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 102

Conditional Independence formalized

R and L are conditionally independent given M iffor all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

Page 103: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 103

Example:

R and L are conditionally independent given M iffor all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

“Shoe-size is conditionally independent of Glove-size given height weight and age”

means

forall s,g,h,w,a

P(ShoeSize=s|Height=h,Weight=w,Age=a)

=

P(ShoeSize=s|Height=h,Weight=w,Age=a,GloveSize=g)

Page 104: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 104

Example:

R and L are conditionally independent given M iffor all x,y,z in {T,F}:

P(R=x M=y ^ L=z) = P(R=x M=y)

More generally:Let S1 and S2 and S3 be sets of variables.

Set-of-variables S1 and set-of-variables S2 are conditionally independent given S3 if for all assignments of values to the variables in the sets,

P(S1’s assignments S2’s assignments & S3’s assignments)=

P(S1’s assignments S3’s assignments)

“Shoe-size is conditionally independent of Glove-size given height weight and age”

does not mean

forall s,g,h

P(ShoeSize=s|Height=h)

=

P(ShoeSize=s|Height=h, GloveSize=g)

Page 105: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 105

Conditional independence

M

L R

We can write down P(M). And then, since we know L is only directly influenced by M, we can write down the values of P(LM) and P(L~M) and know we’ve fully specified L’s behavior. Ditto for R.

P(M) = 0.6 P(L M) = 0.085 P(L ~M) = 0.17 P(R M) = 0.3 P(R ~M) = 0.6

‘R and L conditionally independent given M’

Page 106: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 106

Conditional independenceM

L R

P(M) = 0.6

P(L M) = 0.085

P(L ~M) = 0.17

P(R M) = 0.3

P(R ~M) = 0.6

Conditional Independence:

P(RM,L) = P(RM),

P(R~M,L) = P(R~M)

Again, we can obtain any member of the Joint prob dist that we desire:

P(L=x ^ R=y ^ M=z) =

Page 107: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 107

Assume five variables

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

T only directly influenced by L (i.e. T is conditionally independent of R,M,S given L)

L only directly influenced by M and S (i.e. L is conditionally independent of R given M & S)

R only directly influenced by M (i.e. R is conditionally independent of L,S, given M)

M and S are independent

Page 108: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 108

Making a Bayes net

S M

R

L

T

Step One: add variables.• Just choose the variables you’d like to be included in the

net.

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 109: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 109

Making a Bayes net

S M

R

L

T

Step Two: add links.• The link structure must be acyclic.• If node X is given parents Q1,Q2,..Qn you are promising

that any variable that’s a non-descendent of X is conditionally independent of X given {Q1,Q2,..Qn}

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 110: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 110

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step Three: add a probability table for each node.• The table for node X must list P(X|Parent Values) for each

possible combination of parent values

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 111: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 111

Making a Bayes net

S M

R

L

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

• Two unconnected variables may still be correlated• Each node is conditionally independent of all non-

descendants in the tree, given its parents.• You can deduce many other conditional independence

relations from a Bayes net. See the next lecture.

T: The lecture started by 10:35L: The lecturer arrives lateR: The lecture concerns robotsM: The lecturer is ManuelaS: It is sunny

Page 112: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 112

Bayes Nets FormalizedA Bayes net (also called a belief network) is an augmented directed acyclic graph, represented by the pair V , E where:

V is a set of vertices. E is a set of directed edges joining

vertices. No loops of any length are allowed.

Each vertex in V contains the following information: The name of a random variable A probability distribution table indicating

how the probability of this variable’s values depends on all possible combinations of parental values.

Page 113: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 113

Building a Bayes Net

1. Choose a set of relevant variables.2. Choose an ordering for them3. Assume they’re called X1 .. Xm (where X1 is the first in

the ordering, X1 is the second, etc)4. For i = 1 to m:

1. Add the Xi node to the network

2. Set Parents(Xi ) to be a minimal subset of {X1…Xi-1} such that we have conditional independence of Xi and all other members of {X1…Xi-1} given Parents(Xi )

3. Define the probability table of P(Xi =k Assignments of Parents(Xi ) ).

Page 114: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 114

Computing a Joint EntryHow to compute an entry in a joint distribution?E.G: What is P(S ^ ~M ^ L ~R ^ T)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 115: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 115

Computing with Bayes Net

P(T ^ ~R ^ L ^ ~M ^ S) =P(T ~R ^ L ^ ~M ^ S) * P(~R ^ L ^ ~M ^ S) = P(T L) * P(~R ^ L ^ ~M ^ S) =P(T L) * P(~R L ^ ~M ^ S) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L^~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M^S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M | S)*P(S) =P(T L) * P(~R ~M) * P(L~M^S)*P(~M)*P(S).

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 116: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 116

The general case

P(X1=x1 ^ X2=x2 ^ ….Xn-1=xn-1 ^ Xn=xn) =

P(Xn=xn ^ Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 ^…. X2=x2 ^ X1=x1) =

P(Xn=xn Xn-1=xn-1 ^ ….X2=x2 ^ X1=x1) * P(Xn-1=xn-1 …. X2=x2 ^ X1=x1) *

P(Xn-2=xn-2 ^…. X2=x2 ^ X1=x1) =

: :=

n

iiii

n

iiiii

XxXP

xXxXxXP

1

11111

Parents of sAssignment

So any entry in joint pdf table can be computed. And so any conditional probability can be computed.

Page 117: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 117

Where are we now?

We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.

Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values

to the variables. And we can do it in time linear with the number of nodes.

So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 118: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 118

Where are we now?

We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.

Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values

to the variables. And we can do it in time linear with the number of nodes.

So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(T ^ ~S)

Page 119: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 119

Where are we now?

We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.

Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values

to the variables. And we can do it in time linear with the number of nodes.

So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match T ^ ~S

Page 120: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 120

Where are we now?

We have a methodology for building Bayes nets. We don’t require exponential storage to hold our probability table.

Only exponential in the maximum number of parents of any node. We can compute probabilities of any given assignment of truth values

to the variables. And we can do it in time linear with the number of nodes.

So we can also compute answers to any questions.

E.G. What could we do to compute P(R T,~S)?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Step 1: Compute P(R ^ T ^ ~S)

Step 2: Compute P(T ^ ~S)

Step 3: Return

P(R ^ T ^ ~S)

-------------------------------------

P(T ^ ~S)

Sum of all the rows in the Joint that match R ^ T ^ ~S

Sum of all the rows in the Joint that match T ^ ~S

Each of these obtained by the “computing a joint probability entry” method of the earlier slides

4 joint computes

8 joint computes

Page 121: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 121

The good news

We can do inference. We can compute any conditional probability:P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

Page 122: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 122

The good news

We can do inference. We can compute any conditional probability:P( Some variable Some other variable values )

2

2 1

matching entriesjoint

and matching entriesjoint

2

2121 )entryjoint (

)entryjoint (

)(

)()|(

E

EE

P

P

EP

EEPEEP

Suppose you have m binary-valued variables in your Bayes Net and expression E2 mentions k variables.

How much work is the above computation?

Page 123: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 123

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

Page 124: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 124

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference,

you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks

too, right?

Page 125: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 125

The sad, bad news

Conditional probabilities by enumerating all matching entries in the joint are expensive:

Exponential in the number of variables.

But perhaps there are faster ways of querying Bayes nets? In fact, if I ever ask you to manually do a Bayes Net inference,

you’ll find there are often many tricks to save you time. So we’ve just got to program our computer to do those tricks

too, right?

Sadder and worse news:General querying of Bayes nets is NP-complete.

Page 126: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 126

Bayes nets inference algorithms

A poly-tree is a directed acyclic graph in which no two nodes have more than one path between them.

A poly tree Not a poly tree(but still a legal Bayes net)

S

RL

T

L

T

MSM

R

X1X2

X4X3

X5

X1 X2

X3

X5

X4

• If net is a poly-tree, there is a linear-time algorithm (see a later Andrew lecture).

• The best general-case algorithms convert a general net to a poly-tree (often at huge expense) and calls the poly-tree algorithm.

• Another popular, practical approach (doesn’t assume poly-tree): Stochastic Simulation.

Page 127: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 127

Sampling from the Joint Distribution

It’s pretty easy to generate a set of variable-assignments at random with the same probability as the underlying joint distribution.

How?

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 128: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 128

Sampling from the Joint Distribution

1. Randomly choose S. S = True with prob 0.32. Randomly choose M. M = True with prob 0.63. Randomly choose L. The probability that L is true

depends on the assignments of S and M. E.G. if steps 1 and 2 had produced S=True, M=False, then probability that L is true is 0.1

4. Randomly choose R. Probability depends on M.5. Randomly choose T. Probability depends on L

S M

RL

T

P(s)=0.3P(M)=0.6

P(RM)=0.3P(R~M)=0.6

P(TL)=0.3P(T~L)=0.8

P(LM^S)=0.05P(LM^~S)=0.1P(L~M^S)=0.1P(L~M^~S)=0.2

Page 129: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 129

A general sampling algorithm

Let’s generalize the example on the previous slide to a general Bayes Net.

As in Slides 16-17 , call the variables X1 .. Xn, where Parents(Xi) must be a subset of {X1 .. Xi-1}.

For i=1 to n:

1. Find parents, if any, of Xi. Assume n(i) parents. Call them Xp(i,1), Xp(i,2), …Xp(i,n(i)).

2. Recall the values that those parents were randomly given: xp(i,1), xp(i,2), …xp(i,n(i)).

3. Look up in the lookup-table for: P(Xi=True Xp(i,1)=xp(i,1),Xp(i,2)=xp(i,2)…Xp(i,n(i))=xp(i,n(i)))

4. Randomly set xi=True according to this probability

x1, x2,…xn are now a sample from the joint distribution of X1, X2,…Xn.

Page 130: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 130

Stochastic Simulation ExampleSomeone wants to know P(R = True T = True ^ S = False )

We’ll do lots of random samplings and count the number of occurrences of the following:

Nc : Num. samples in which T=True and S=False. Ns : Num. samples in which R=True, T=True and

S=False. N : Number of random samplings

Now if N is big enough:Nc /N is a good estimate of P(T=True and S=False).

Ns /N is a good estimate of P(R=True ,T=True , S=False).

P(RT^~S) = P(R^T^~S)/P(T^~S), so Ns / Nc can be a good estimate of P(RT^~S).

Page 131: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 131

General Stochastic SimulationSomeone wants to know P(E1 E2 )

We’ll do lots of random samplings and count the number of occurrences of the following:

Nc : Num. samples in which E2

Ns : Num. samples in which E1 and E2

N : Number of random samplingsNow if N is big enough:Nc /N is a good estimate of P(E2).

Ns /N is a good estimate of P(E1 , E2).

P(E1 E2) = P(E1^ E2)/P(E2), so Ns / Nc can be a good estimate of P(E1 E2).

Page 132: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 132

Likelihood weighting

Problem with Stochastic Sampling:With lots of constraints in E, or unlikely events in E, then most of the simulations will be thrown away, (they’ll have no effect on Nc, or Ns).

Imagine we’re part way through our simulation.

In E2 we have the constraint Xi = v

We’re just about to generate a value for Xi at random. Given the values assigned to the parents, we see that P(Xi = v parents) = p .

Now we know that with stochastic sampling: we’ll generate “Xi = v” proportion p of the time, and proceed. And we’ll generate a different value proportion 1-p of the time, and the simulation will be wasted.

Instead, always generate Xi = v, but weight the answer by weight “p” to compensate.

Page 133: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 133

Likelihood weightingSet Nc :=0, Ns :=0

1. Generate a random assignment of all variables that matches E2. This process returns a weight w.

2. Define w to be the probability that this assignment would have been generated instead of an unmatching assignment during its generation in the original algorithm.Fact: w is a product of all likelihood factors involved in the generation.

3. Nc := Nc + w

4. If our sample matches E1 then Ns := Ns + w

5. Go to 1Again, Ns / Nc estimates P(E1 E2 )

Page 134: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 134

Case Study I

Pathfinder system. (Heckerman 1991, Probabilistic Similarity Networks, MIT Press, Cambridge MA).

Diagnostic system for lymph-node diseases. 60 diseases and 100 symptoms and test-results. 14,000 probabilities Expert consulted to make net.

8 hours to determine variables. 35 hours for net topology. 40 hours for probability table values.

Apparently, the experts found it quite easy to invent the causal links and probabilities.

Pathfinder is now outperforming the world experts in diagnosis. Being extended to several dozen other medical domains.

Page 135: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 135

Questions

What are the strengths of probabilistic networks compared with propositional logic?

What are the weaknesses of probabilistic networks compared with propositional logic?

What are the strengths of probabilistic networks compared with predicate logic?

What are the weaknesses of probabilistic networks compared with predicate logic?

(How) could predicate logic and probabilistic networks be combined?

Page 136: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 136

What you should know

The meanings and importance of independence and conditional independence.

The definition of a Bayes net. Computing probabilities of assignments of

variables (i.e. members of the joint p.d.f.) with a Bayes net.

The slow (exponential) method for computing arbitrary, conditional probabilities.

The stochastic simulation method and likelihood weighting.

Page 137: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 137

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 138: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 138

Neural Networks

Assume each record X is a vector of length n.

There are two classes which are valued 1 and -1.

E.g. class 1 means “buy computer”, and class -1

means “no buy”.

We can incrementally change weights to learn to

produce these outputs using the perceptron

learning rule.

Page 139: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 139

A Neuron

The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

Page 140: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 140

A Neuron

k-

f

weighted sum

Inputvector x

output y

Activationfunction

weightvector w

w0

w1

wn

x0

x1

xn

)sign(y

ExampleFor n

0ikii xw

Page 141: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Multi-Layer Perceptron

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector: xi

wij

i

jiijj xwI

Page 142: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Network Training

The ultimate objective of training obtain a set of weights that makes almost all the

tuples in the training data classified correctly Steps

Initialize weights with random values Feed the input tuples into the network one by

one For each unit

Compute the net input to the unit as a linear combination of all the inputs to the unit

Compute the output value using the activation function Compute the error Update the weights and the bias

Page 143: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 144

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM)

Page 144: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

value( )= -1, e.g. does not buy computer

Linear Support Vector Machines

Goal: find a plane that divides the two sets of points.

value( )= 1, e.g. buy computer

Also, need to maximize margin.

Margin

Page 145: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Linear Support Vector Machines

Support Vectors

Small Margin Large Margin

Page 146: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Linear Support Vector Machines

The hyperplane is defined by vector w and value b.

For any point x where value(x)=-1, w·x +b ≤ -1.

For any point x where value(x)=1, w·x +b ≥ 1.

Here let x = (xh , xv)

w=(wh , wv), we have:

w·x = wh xh + wv xv

w·x +

b =

-1

w·x +

b =

1

Page 147: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Linear Support Vector Machines

For instance, the hyperplane is defined by vector w = (1, 1/ ) and value b = -11.

w·x +

b =

-1w

·x +b =

1

10 11 12

60°

311

3123

Some examples:x = (10, 0) w·x +b = -1x = (0, ) w·x +b = 1x = (0, 0) w·x +b = -11

312

310

||

2

w3

2

)3/1(122

A little more detail: w is perpendicular to minus

plane, margin M = , e.g.

M

Page 148: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

Linear Support Vector Machines

Each record xi in training data has a vector and a class, There are two classes: 1 and -1, E.g. class 1 = “buy computer”, class -1 = “not

buy”. E.g. a record (age, salary) = (45, 50k), class =

1. Read the training data, find the hyperplane

defined by w and b. Should maximize margin, minimize error. Can be

done by Quadratic Programming Technique. Predict the class of a new data x by looking at the

sign of w·x +b. Positive: class 1. Negative: class -1.

Page 149: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 150

SVM – Cont.

What if the data is not linearly separable? Project the data to high dimensional space where

it is linearly separable and then we can use linear SVM – (Using Kernels)

-1 0 +1

+ +-

(1,0)(0,0)

(0,1) +

+-

Page 150: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 151

Non-Linear SVM

0?

bwxi

Classification using SVM (w,b)

0),(?

bwxK i

In non linear case we can see this as

Kernel – Can be thought of as doing dot product in some high dimensional space

Page 151: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 152

Exam

ple

of

Non-l

inear

SV

M

Page 152: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 153

Results

Page 153: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 154

SVM vs. Neural Network

SVM Relatively new concept Nice Generalization

properties Hard to learn – learned

in batch mode using quadratic programming techniques

Using kernels can learn very complex functions

Neural Network Quite Old Generalizes well but

doesn’t have strong mathematical foundation

Can easily be learned in incremental fashion

To learn complex functions – use multilayer perceptron (not that trivial)

Page 154: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 155

SVM Related Links

http://svm.dcs.rhbnc.ac.uk/ http://www.kernel-machines.org/ C. J. C. Burges.

A Tutorial on Support Vector Machines for Pattern Recognition. Knowledge Discovery and Data Mining, 2(2), 1998.

SVMlight – Software (in C) http://ais.gmd.de/~thorsten/svm_light

BOOK: An Introduction to Support Vector MachinesN. Cristianini and J. Shawe-TaylorCambridge University Press

Page 155: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 156

Content

What is classification? Decision tree Naïve Bayesian Classifier Baysian Networks Neural Networks Support Vector Machines (SVM) Bagging and Boosting

Page 156: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 157

Bagging and Boosting

General idea Training data

Altered Training data

Altered Training data…….. Aggregation ….

Classifier CClassification method (CM)

CM

Classifier C1

CM

Classifier C2

Classifier C*

Page 157: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 158

Bagging

Given a set S of s samples Generate a bootstrap sample T from S. Cases in S

may not appear in T or may appear more than once. Repeat this sampling procedure, getting a sequence

of k independent training sets A corresponding sequence of classifiers C1,C2,…,Ck

is constructed for each of these training sets, by using the same classification algorithm

To classify an unknown sample X,let each classifier predict or vote

The Bagged Classifier C* counts the votes and assigns X to the class with the “most” votes

Page 158: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 159

Boosting Technique — Algorithm

Assign every example an equal weight 1/N For t = 1, 2, …, T Do

Obtain a hypothesis (classifier) h(t) under w(t)

Calculate the error of h(t) and re-weight the examples based on the error . Each classifier is dependent on the previous ones. Samples that are incorrectly predicted are weighted more heavily

Normalize w(t+1) to sum to 1 (weights assigned to different classifiers sum to 1)

Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set

Page 159: 2015年8月29日星期六 2015年8月29日星期六 2015年8月29日星期六 Data Mining: Concepts and Techniques1 Classification ©Jiawei Han and Micheline Kamber hanj/bk2

2023年4月19日Data Mining: Concepts and Technique

s 160

Summary

Classification is an extensively studied problem

(mainly in statistics, machine learning & neural

networks)

Classification is probably one of the most widely

used data mining techniques with a lot of extensions

Scalability is still an important issue for database

applications: thus combining classification with

database techniques should be a promising topic

Research directions: classification of non-relational

data, e.g., text, spatial, multimedia, etc..