data mining 2012 bayesian network classifiersdata mining 2012 bayesian network classi ers ad...

48
Data Mining 2012 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 1 / 48

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Data Mining 2012Bayesian Network Classifiers

Ad Feelders

Universiteit Utrecht

October 25, 2012

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 1 / 48

Page 2: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Literature

N. Friedman, D. Geiger and M. GoldszmidtBayesian Network ClassifiersMachine Learning, 29, pp. 131-163 (1997)(except section 6)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 2 / 48

Page 3: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Bayesian Network Classifiers

Bayesian Networks are models of the joint distribution of a collection ofrandom variables. The joint distribution is simplified by introducingindependence assumptions.

In many applications we are in fact interested in the conditionaldistribution of one variable (the class variable) given the other variables(attributes).

Can we use Bayesian Networks as classifiers?

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 3 / 48

Page 4: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

The Naive Bayes Classifier

C

A2A1 Ak. . .

This Bayesian Network is equivalent to its undirected version (why?):

C

A2A1 Ak. . .

Attributes are independent given the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 4 / 48

Page 5: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

The Naive Bayes Classifier

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation corresponding to NB classifier is:

P(C ,A1, . . . ,Ak) = P(C )P(A1|C ) · · ·P(Ak |C )

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 5 / 48

Page 6: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Naive Bayes assumption

Via Bayes rule we have

P(C = i |A) =P(A1,A2, . . . ,Ak ,C = i)

P(A1,A2, . . . ,Ak)(product rule)

=P(A1,A2, . . . ,Ak | C = i)P(C = i)∑cj=1 P(A1,A2, . . . ,Ak | C = j)P(C = j)

(product rule and sum rule)

=P(A1|C = i)P(A2|C = i) · · ·P(Ak |C = i)P(C = i)∑cj=1 P(A1|C = j)P(A2|C = j) · · ·P(Ak |C = j)P(C = j)

(NB factorisation)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 6 / 48

Page 7: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Why Naive Bayes is competitive

The conditional independence assumption is often clearly inappropriate,yet the predictive accuracy of Naive Bayes is competitive with morecomplex classifiers. How come?

Probability estimates of Naive Bayes may be way off, but this doesnot necessarily result in wrong classification!

Naive Bayes has only few parameters compared to more complexmodels, so it can estimate parameters more reliably.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 7 / 48

Page 8: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Naive Bayes: Example

P(C = 0) = 0.4, P(C = 1) = 0.6

C = 0 A2

A1 0 1 P(A1)

0 0.2 0.1 0.31 0.1 0.6 0.7

P(A2) 0.3 0.7 1

C = 1 A2

A1 0 1 P(A1)

0 0.5 0.2 0.71 0.1 0.2 0.3

P(A2) 0.6 0.4 1

We have that

P(C = 1|A1 = 0,A2 = 0) =0.5× 0.6

0.5× 0.6 + 0.2× 0.4= 0.79

According to naive Bayes

P(C = 1|A1 = 0,A2 = 0) =0.7× 0.6× 0.6

0.7× 0.6× 0.6 + 0.3× 0.3× 0.4= 0.88

Naive Bayes assigns to the right class.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 8 / 48

Page 9: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

What about this model?

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation is:

P(C ,A1, . . . ,Ak) = P(C |A1, . . . ,Ak)P(A1) · · ·P(Ak)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 9 / 48

Page 10: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Bayesian Networks as Classifiers

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 10 / 48

Page 11: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Markov Blanket of C : Moral Graph

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Local Markov property: C ⊥⊥ rest | boundary(C )

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 11 / 48

Page 12: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Bayesian Networks as Classifiers

Loglikelihood under model M is

L(M|D) =n∑

j=1

logPM(X (j))

where X (j) = (A(j)1 ,A

(j)2 , . . . ,A

(j)k ,C (j)).

We can rewrite this as

L(M|D) =n∑

j=1

logPM(C (j)|A(j)) +n∑

j=1

logPM(A(j))

If there are many attributes, the second term will dominate the loglikelihood score.

But we are not interested in modeling the distribution of the attributes!

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 12 / 48

Page 13: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Bayesian Networks as Classifiers

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

P(x)

−lo

g P

(x)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 13 / 48

Page 14: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Naive Bayes vs. Unrestricted BN138 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

0

5

10

15

20

25

30

35

40

45

22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5

Per

cent

age

Cla

ssif

icat

ion

Err

or

Data Set

Bayesian NetworkNaive Bayes

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45

Nai

ve B

ayes

Err

or

Bayesian Network Error

Figure 2. Error curves (top) and scatter plot (bottom) comparing unsupervised Bayesian networks (solid line,x axis) to naive Bayes (dashed line,y axis). In the error curves, the horizontal axis lists the data sets, whichare sorted so that the curves cross only once, and the vertical axis measures the percentage of test instances thatwere misclassified (i.e., prediction errors). Thus, the smaller the value, the better the accuracy. Each data pointis annotated by a 90% confidence interval. In the scatter plot, each point represents a data set, where thexcoordinate of a point is the percentage of misclassifications according to unsupervised Bayesian networks and they coordinate is the percentage of misclassifications according to naive Bayes. Thus, points above the diagonal linecorrespond to data sets on which unrestricted Bayesian networks perform better, and points below the diagonalline correspond to data sets on which naive Bayes performs better.

To confirm this hypothesis, we conducted an experiment comparing the classificationaccuracy of Bayesian networks learned using the MDL score (i.e., classifiers based onunrestricted networks) to that of the naive Bayesian classifier. We ran this experiment on

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 14 / 48

Page 15: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Use Conditional Log-likelihood?

Discriminative vs. Generative learning.

Conditional loglikelihood function:

CL(M|D) =n∑

j=1

logPM(C (j)|A(j)1 , . . . ,A

(j)k )

No closed form solution for ML estimates.

Remark: can be done via Logistic Regression for models with perfectgraphs (Naive Bayes, TAN’s).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 15 / 48

Page 16: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

NB and Logistic Regression

The logistic regression assumption is

log

{P(C = 1|A)

P(C = 0|A)

}= α +

k∑

i=1

βiAi ,

that is, the log odds is a linear function of the attributes.

Under the naive Bayes assumption, this is exactly true.

Assign to class 1 if α +∑k

i=1 βiAi > 0 and to class 0 otherwise.

Logistic regression maximizes conditional likelihood under this assumption (it is aso-called discriminative model).

There is no closed form solution for the maximum likelihood estimates ofα and βi , but the loglikelihood function is globally concave (unique globaloptimum).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 16 / 48

Page 17: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Proof (for binary attributes Ai)

Under the naive Bayes assumption we have:

P(C = 1|a)

P(C = 0|a)=

P(a1|C = 1) · · ·P(ak |C = 1)P(C = 1)

P(a1|C = 0) · · ·P(ak |C = 0)P(C = 0)

=k∏

i=1

{(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)ai(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)(1−ai )}

× P(C = 1)

P(C = 0)

Taking the log we get

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

{ai log

(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)

+ (1− ai ) log

(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)}+ log

(P(C = 1)

P(C = 0)

)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 17 / 48

Page 18: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Proof (continued)

Expand and collect terms.

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

ai

βi︷ ︸︸ ︷{log

P(ai = 1|C = 1)

P(ai = 1|C = 0)

P(ai = 0|C = 0)

P(ai = 0|C = 1)

}

+k∑

i=1

{log

P(ai = 0|C = 1)

P(ai = 0|C = 0)

}+ log

P(C = 1)

P(C = 0)︸ ︷︷ ︸

α

which is a linear function of a.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 18 / 48

Page 19: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example

Suppose P(C = 1) = 0.6,P(a1 = 1|C = 1) = 0.8,P(a1 = 1|C = 0) = 0.5,P(a2 = 1|C = 1) = 0.6,P(a2 = 1|C = 0) = 0.3.

Then

log

(P(C = 1|a1, a2)

P(C = 0|a1, a2)

)= 1.386a1 + 1.253a2 − 1.476 + 0.405

= −1.071 + 1.386a1 + 1.253a2

Classify a point with a1 = 1 and a2 = 0:

log

(P(C = 1|1, 0)

P(C = 0|1, 0)

)= −1.071 + 1.386× 1 + 1.253× 0 = 0.315

Decision rule: assign to class 1 if

α +k∑

i=1

βiAi > 0

and to class 0 otherwise. Linear decision boundary.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 19 / 48

Page 20: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Linear Decision Boundary

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A1

A2

a2 = 0.855 − 1.106a1

CLASS 0

CLASS 1

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 20 / 48

Page 21: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Relax strong assumptions of NB

Conditional independence assumption of NB is often incorrect, andcould lead to suboptimal classification performance.

Relax this assumption by allowing (restricted) dependencies betweenattributes.

This may produce more accurate probability estimates, possiblyleading to better classification performance.

This is not guaranteed, because the more complex model may beoverfitting.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 21 / 48

Page 22: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Tree Structured BN

For tree structured Bayesian Networks there is an algorithm, due to Chowand Liu, that produces the optimal structure in polynomial time.

This algorithm is guaranteed to produce the tree structure that maximizesthe loglikelihood score.

Tree structure: each node (except the root of the tree) has exactly oneparent.

Why no penalty for complexity?

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 22 / 48

Page 23: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Mutual Information

Measure of association between (discrete) random variables X and Y :

IP(X ;Y ) =∑

x ,y

P(x , y) logP(x , y)

P(x)P(y)

“Distance” (Kullback-Leibler divergence) between joint distribution of Xand Y , and their joint distribution under the independence assumption.

If X and Y are independent, their mutual information is zero, otherwise itis some positive quantity.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 23 / 48

Page 24: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Algorithm Construct-Tree of Chow and Liu

Compute IP̂D(Xi ;Xj) between each pair of variables. (O(nk2))

Build complete undirected graph with weights IP̂D(Xi ;Xj)

Build a maximum weighted spanning tree (O(k2 log k))

Choose root, and let all edges point away from it

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 24 / 48

Page 25: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example Data Set

obs X1 X2 X3 X4

1 1 1 1 12 1 1 1 13 1 1 2 14 1 2 2 15 1 2 2 26 2 1 1 27 2 1 2 38 2 1 2 39 2 2 2 3

10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 25 / 48

Page 26: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Mutual Information

IP̂D(X1;X4) =

x1,x4

P̂D(x1, x4) logP̂D(x1, x4)

P̂D(x1)P̂D(x4)

= 0.4 log0.4

(0.5)(0.4)+ 0.1 log

0.1

(0.5)(0.2)

+ 0 log0

(0.5)(0.4)+ 0 log

0

(0.5)(0.4)

+ 0.1 log0.1

(0.5)(0.2)+ 0.4 log

0.4

(0.5)(0.4)

= 0.55

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 26 / 48

Page 27: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Build Graph with Weights

1

2 3

4

0 0

0.55

0.032

0.032 0.032

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 27 / 48

Page 28: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Maximum weighted spanning tree

1

2 3

4

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 28 / 48

Page 29: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Choose root node

1

2 3

4

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 29 / 48

Page 30: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Algorithm to construct TAN

Compute IP̂D(Ai ;Aj |C ) between each pair of attributes, where

IP(Ai ;Aj |C ) =∑

ai ,aj ,c

P(ai , aj , c) logP(ai , aj |c)

P(ai |c)P(aj |c)

is the conditional mutual information between Ai and Aj given C .

Build complete undirected graph with weights IP̂D(Ai ;Aj |C )

Build a maximum weighted spanning tree

Choose root, and let all edges point away from it

Construct a TAN by adding C and an arc from C to all attributes.

This algorithm is guaranteed to produce the TAN structure with optimalloglikelihood score.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 30 / 48

Page 31: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

TAN for Pima Indians Data

140 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

C

Pregnant

InsulinAge

DPF

GlucoseMass

Figure 3. A TAN model learned for the data set “pima.” The dashed lines are those edges required by the naiveBayesian classifier. The solid lines are correlation edges between attributes.

These approaches, however, remove the strong assumptions of independence in naive Bayesby finding correlations among attributes that are warranted by the training data.

4.1. Augmented naive Bayesian networks as classifiers

We argued above that the performance of a Bayesian network as a classifier may improveif the learning procedure takes into account the special status of the class variable. An easyway to ensure this is to bias the structure of the network, as in the naive Bayesian classifier,such that there is an edge from the class variable to each attribute. This ensures that, in thelearned network, the probabilityP (C|A1, . . . , An) will take all attributes into account. Inorder to improve the performance of a classifier based on this bias, we propose to augmentthe naive Bayes structure with edges among the attributes, when needed, thus dispensingwith its strong assumptions about independence. We call these structuresaugmented naiveBayesian networksand these edgesaugmenting edges.

In an augmented structure, an edge fromAi toAj implies that the influence ofAi on theassessment of the class variable also depends on the value ofAj . For example, in Figure 3,the influence of the attribute “Glucose” on the classC depends on the value of “Insulin,”while in the naive Bayesian classifier the influence of each attribute on the class variableis independent of other attributes. These edges affect the classification process in that avalue of “Glucose” that is typically surprising (i.e.,P (g|c) is low) may be unsurprising ifthe value of its correlated attribute, “Insulin,” is also unlikely (i.e.,P (g|c, i) is high). In thissituation, the naive Bayesian classifier will overpenalize the probability of the class variableby considering two unlikely observations, while the augmented network of Figure 3 willnot.

Adding the best set of augmenting edges is an intractable problem, since it is equivalentto learning the best Bayesian network among those in whichC is a root. Thus, even if wecould improve the performance of a naive Bayes classifier in this way, the computationaleffort required may not be worthwhile. However, by imposing acceptable restrictions onthe form of the allowed interactions, we can actually learn the optimal set of augmentingedges in polynomial time.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 31 / 48

Page 32: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Interpretation of TAN’s

Just like NB models, TAN’s are equivalent to their undirected counterparts(why?).

C

P A

I

DM G

Since there is an edge between Pregnant and Age, the influence of Age onthe class label depends on (is different for different values of) Pregnant(and vice versa).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 32 / 48

Page 33: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Smoothing by adding “prior counts”

Sometimes we have few observations to estimate (conditional)probabilities.

Add “prior counts” to “smooth” the estimates.

p̂s(xi | xpa(i)) =n(xpa(i))p̂(xi | xpa(i)) + m(xpa(i))p

0(xi | xpa(i))

n(xpa(i)) + m(xpa(i))

where m(xpa(i)) is the prior precision, p̂s(xi | xpa(i)) is the smoothedestimate, and p0(xi | xpa(i)) is our prior estimate of p(xi | xpa(i)).

Common to take m(xpa(i)) to be the same for all parent configurations.

Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 33 / 48

Page 34: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

ML estimates

For example

p̂1(1) =n(x1 = 1)

n=

5

10=

1

2

and

p̂3|1,2(1|1, 2) =n(x1 = 1, x2 = 2, x3 = 1)

n(x1 = 1, x2 = 2)=

0

2= 0

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 34 / 48

Page 35: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Smoothed estimates

Suppose we set m = 2, and p0(xi | xpa(i)) = p̂(xi )

Then we get

p̂s1(1) =10× 0.5 + 2× 0.5

10 + 2=

1

2

and

p̂s3|1,2(1|1, 2) =2× 0 + 2× 0.4

2 + 2= 0.2

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 35 / 48

Page 36: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Bayesian Multinets

Build structure on attributes for each class separately and use

PM(C = i ,A1, . . . ,Ak) = P(C = i)PM(A1, . . . ,Ak |C = i)

= P(C = i)PMi(A1, . . . ,Ak) i = 1, . . . , c

Using trees for class-conditional structures

Split D into c partitions, D1,D2, . . . ,Dc where c is the number ofdistinct values of class label C . Di contains all records in D withC = i .

Set P(C = i) = P̂D(C = i) for i = 1, . . . , c .

Apply Construct-Tree on Di to construct Mi

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 36 / 48

Page 37: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Experimental Study of Friedman et al.

The following classifiers were compared:

NB: Naive Bayes Classifier

SNB: Naive Bayes with attribute selection

BN: Unrestricted Bayesian Network (Markov Blanket of Class)

C4.5: Classification Tree

And also

structure per class

structure same different

tree TAN CLdag ANB MN

Superscript s indicates smoothing of parameter estimates.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 37 / 48

Page 38: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

BAYESIAN NETWORK CLASSIFIERS 151

Table 1.Description of data sets used in the experiments.

Dataset # Attributes # Classes # InstancesTrain Test

1 australian 14 2 690 CV-52 breast 10 2 683 CV-53 chess 36 2 2130 10664 cleve 13 2 296 CV-55 corral 6 2 128 CV-56 crx 15 2 653 CV-57 diabetes 8 2 768 CV-58 flare 10 2 1066 CV-59 german 20 2 1000 CV-5

10 glass 9 7 214 CV-511 glass2 9 2 163 CV-512 heart 13 2 270 CV-513 hepatitis 19 2 80 CV-514 iris 4 3 150 CV-515 letter 16 26 15000 500016 lymphography 18 4 148 CV-517 mofn-3-7-10 10 2 300 102418 pima 8 2 768 CV-519 satimage 36 6 4435 200020 segment 19 7 1540 77021 shuttle-small 9 7 3866 193422 soybean-large 35 19 562 CV-523 vehicle 18 4 846 CV-524 vote 16 2 435 CV-525 waveform-21 21 3 300 4700

The accuracy of each classifier is based on the percentage of successful predictions on thetest sets of each data set. We used the MLC++ system (Kohavi et al., 1994) to estimate theprediction accuracy for each classifier, as well as the variance of this accuracy. Accuracy wasmeasured via the holdout method for the larger data sets (that is, the learning procedures weregiven a subset of the instances and were evaluated on the remaining instances), and via five-fold cross validation, using the methods described by Kohavi (1995), for the smaller ones.3

Since we do not deal, at present, with missing data, we removed instances with missingvalues from the data sets. Currently, we also do not handle continuous attributes. Instead,we applied a pre-discretization step in the manner described by Dougherty et al. (1995).This pre-discretization is based on a variant of Fayyad and Irani’s (1993) discretizationmethod. These preprocessing stages were carried out by the MLC++ system. Runs withthe various learning procedures were carried out on the same training sets and evaluatedon the same test sets. In particular, the cross-validation folds were the same for all theexperiments on each data set.

Table 2 displays the accuracies of the main classification approaches we have discussedthroughout the paper using the abbreviations:

NB: the naive Bayesian classifier

BN: unrestricted Bayesian networks learned with the MDL score

TANs: TAN networks learned according to Theorem 2, with smoothed parameters

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 38 / 48

Page 39: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

BAYESIAN NETWORK CLASSIFIERS 153

Table 2.Experimental results of the primary approaches discussed in this paper.

performance. For the naive Bayesian classifier, he shows that, under certain conditions, thelow variance associated with this classifier can dramatically mitigate the effect of the highbias that results from the strong independence assumptions.

One goal of the work described in this paper has been to improve the performance of thenaive Bayesian classifier by relaxing these independence assumptions. Indeed, our empir-ical results indicate that a more accurate modeling of the dependencies amongst featuresleads to improved classification. Previous extensions to the naive Bayesian classifier alsoidentified the strong independence assumptions as the source of classification errors, butdiffer in how they address this problem. These works fall into two categories.

Work in the first category, such as that of Langley and Sage (1994) and of John andKohavi (1997), has attempted to improve prediction accuracy by rendering some of theattributes irrelevant. The rationale is as follows. As we explained in Section 4.1, if twoattributes, sayAi andAj , are correlated, then the naive Bayesian classifier may overamplifythe weight of the evidence of these two attributes on the class. The proposed solution inthis category is simply to ignore one of these two attributes. (Removing attributes is alsouseful if some attributes are irrelevant, since they only introduce noise in the classificationproblem.) This is a straightforward application offeature subset selection. The usualapproach to this problem is to search for a good subset of the attributes, using an estimationscheme, such as cross validation, to repeatedly evaluate the predictive accuracy of the naiveBayesian classifier on various subsets. The resulting classifier is called theselective naiveBayesian classifier, following Langley and Sage (1994).

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 39 / 48

Page 40: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

154 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

Table 3.Experimental results describing the effect of smoothing parameters.

It is clear that, if two attributes are perfectly correlated, then the removal of one can onlyimprove the performance of the naive Bayesian classifier. Problems arise, however, if twoattributes are only partially correlated. In these cases the removal of an attribute may leadto the loss of useful information, and the selective naive Bayesian classifier may still retainboth attributes. In addition, this wrapper-based approach is, in general, computationallyexpensive. Our experimental results (see Figure 6) show that the methods we examine hereare usually more accurate than the selective naive Bayesian classifier as used by John andKohavi (1997).

Work in the second category (Kononenko, 1991; Pazzani, 1995; Ezawa & Schuermann,1995) are closer in spirit to our proposal, since they attempt to improve the predictiveaccuracy by removing some of the independence assumptions. Thesemi-naive Bayesianclassifier(Kononenko, 1991) is a model of the form

P (C,A1, . . . , An) = P (C) · P (A1|C) · · ·P (Ak|C) (9)

whereA1, . . . ,Ak are pairwise disjoint groups of attributes. Such a model assumes thatAiis conditionally independent ofAj if, and only if, they are in different groups. Thus, no as-sumption of independence is made about attributes that are in the same group. Kononenko’smethod uses statistical tests of independence to partition the attributes into groups. This pro-cedure, however, tends to select large groups, which can lead to overfitting problems. Thenumber of parameters needed to estimateP (Ai|C) is |Val(C)| · (∏Aj∈Ai |Val(Aj)| − 1),which grows exponentially with the number of attributes in the group. Thus, the parameters

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 40 / 48

Page 41: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

BAYESIAN NETWORK CLASSIFIERS 155

Table 4. Experimental results of comparing tree-like networks with unrestricted augmented naive Bayes andmultinets.

assessed forP (Ai|C) may quickly become unreliable ifAi contains more than two or threeattributes.

Pazzani suggests that this problem can be solved by using a cross-validation scheme toevaluate the accuracy of a classifier. His procedure starts with singleton groups (i.e.,A1 ={A1}, . . . ,An = {An}) and then combines, in a greedy manner, pairs of groups. (Healso examines a procedure that performs feature subset selection after the stage of joiningattributes.) This procedure does not, in general, select large groups, since these lead to poorprediction accuracy in the cross-validation test. Thus, Pazzani’s procedure learns classifiersthat partition the attributes in to many small groups. Since each group of attributes isconsidered independent of the rest given the class, these classifiers can capture only smallnumber of correlations among the attributes.

Both Kononenko and Pazzani essentially assume that all of the attributes in each groupAi can be arbitrarily correlated. To understand the implications of this assumption, we usea Bayesian network representation. If we letAi = {Ai1 , . . . , Ail}, then, using the chainrule, we get

P (Ai|C) = P (Ai1 |C) · P (Ai2 |Ai,1, C) · · ·P (Ail |Ai,1, . . . , Ail−1, C).

Applying this decomposition to each of the terms in Equation 9, we get a product formfrom which a Bayesian network can be built. Indeed, this is an augmented naive Bayesnetwork, in which there is a complete subgraph—that is, one to which we cannot add arcs

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 41 / 48

Page 42: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example: data on death penalty

Data provided by the Georgia parole board.

Variable Description

death Did the defendant get the death penalty?blkdef Is the defendant black?whtvict Is the victim white?aggcirc Number of aggravating circumstancesstranger Were victim and defendant strangers?

We have 100 observations.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 42 / 48

Page 43: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example in bnlearn package

> death.small[1:5,]

death blkdef whtvict aggcirc stranger

1 0 1 0 1 0

2 0 1 0 1 0

3 0 1 0 1 0

4 1 1 0 2 0

5 0 1 0 1 0

> summary(death.small)

death blkdef whtvict aggcirc stranger

0:51 0:47 0:26 1:22 0:49

1:49 1:53 1:74 2:78 1:51

> death.nb <- naive.bayes("death",data=death.small)

> death.nb.pred <- predict(death.nb,death.small)

> table(death.small[,1],death.nb.pred)

death.nb.pred

0 1

0 36 15

1 12 37

> sum(diag(table(death.small[,1],death.nb.pred))/nrow(death.small))

[1] 0.73

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 43 / 48

Page 44: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example in bnlearn package

# fit TAN to death penalty data with "death" as class variable

> death.tan <- tree.bayes(death.small,"death")

# fit TAN parameters using maximum likelihood estimation ("mle")

> death.tan.fit <- bn.fit(death.tan,death.small,"mle")

# predict class on training sample

> death.tan.pred <- predict(death.tan.fit,death.small)

# make confusion matrix

> table(death.small[,1],death.tan.pred)

death.tan.pred

0 1

0 32 19

1 8 41

# compute accuracy

> sum(diag(table(death.small[,1],death.tan.pred))/nrow(death.small))

[1] 0.73

# plot the TAN structure

> plot(death.tan)

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 44 / 48

Page 45: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

TAN for death penalty data

death

blkdef whtvict

aggcirc

stranger

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 45 / 48

Page 46: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Example in bnlearn package

# learn network structure with hill-climber

> death.hc <- hc(death.small)

> plot(death.hc)

> death.hc.fit <- bn.fit(death.hc,death.small,"mle")

> death.hc.pred <- predict(death.hc.fit,node="death",data=death.small)

> table(death.small[,1],death.hc.pred)

death.hc.pred

0 1

0 20 31

1 6 43

> sum(diag(table(death.small[,1],death.hc.pred))/nrow(death.small))

[1] 0.63

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 46 / 48

Page 47: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Network structure with hill climbing for death penalty data

death

blkdef whtvict

aggcirc

stranger

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 47 / 48

Page 48: Data Mining 2012 Bayesian Network ClassifiersData Mining 2012 Bayesian Network Classi ers Ad Feelders Universiteit Utrecht October 25, 2012 Ad Feelders ( Universiteit Utrecht ) Data

Network structure with hill climbing for death penalty data

death

blkdef whtvict

aggcirc

stranger

Death penalty and race of defendant are independent given race of victim.

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 48 / 48