data mining 2016 bayesian network classifiers

49
Data Mining 2016 Bayesian Network Classifiers Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48

Upload: others

Post on 19-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining 2016 Bayesian Network Classifiers

Data Mining 2016Bayesian Network Classifiers

Ad Feelders

Universiteit Utrecht

Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 48

Page 2: Data Mining 2016 Bayesian Network Classifiers

Literature

N. Friedman, D. Geiger and M. GoldszmidtBayesian Network ClassifiersMachine Learning, 29, pp. 131-163 (1997)(except section 6)

Ad Feelders ( Universiteit Utrecht ) Data Mining 2 / 48

Page 3: Data Mining 2016 Bayesian Network Classifiers

Bayesian Network Classifiers

Bayesian Networks are models of the joint distribution of a collection ofrandom variables. The joint distribution is simplified by introducingindependence assumptions.

In many applications we are in fact interested in the conditionaldistribution of one variable (the class variable) given the other variables(attributes).

Can we use Bayesian Networks as classifiers?

Ad Feelders ( Universiteit Utrecht ) Data Mining 3 / 48

Page 4: Data Mining 2016 Bayesian Network Classifiers

The Naive Bayes Classifier

C

A2A1 Ak. . .

This Bayesian Network is equivalent to its undirected version (why?):

C

A2A1 Ak. . .

Attributes are independent given the class label.

Ad Feelders ( Universiteit Utrecht ) Data Mining 4 / 48

Page 5: Data Mining 2016 Bayesian Network Classifiers

The Naive Bayes Classifier

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation corresponding to NB classifier is:

P(C ,A1, . . . ,Ak) = P(C )P(A1|C ) · · ·P(Ak |C )

Ad Feelders ( Universiteit Utrecht ) Data Mining 5 / 48

Page 6: Data Mining 2016 Bayesian Network Classifiers

Naive Bayes assumption

Via Bayes rule we have

P(C = i |A) =P(A1,A2, . . . ,Ak ,C = i)

P(A1,A2, . . . ,Ak)(product rule)

=P(A1,A2, . . . ,Ak | C = i)P(C = i)∑cj=1 P(A1,A2, . . . ,Ak | C = j)P(C = j)

(product rule and sum rule)

=P(A1|C = i)P(A2|C = i) · · ·P(Ak |C = i)P(C = i)∑cj=1 P(A1|C = j)P(A2|C = j) · · ·P(Ak |C = j)P(C = j)

(NB factorisation)

Ad Feelders ( Universiteit Utrecht ) Data Mining 6 / 48

Page 7: Data Mining 2016 Bayesian Network Classifiers

Why Naive Bayes is competitive

The conditional independence assumption is often clearly inappropriate,yet the predictive accuracy of Naive Bayes is competitive with morecomplex classifiers. How come?

Probability estimates of Naive Bayes may be way off, but this doesnot necessarily result in wrong classification!

Naive Bayes has only few parameters compared to more complexmodels, so it can estimate parameters more reliably.

Ad Feelders ( Universiteit Utrecht ) Data Mining 7 / 48

Page 8: Data Mining 2016 Bayesian Network Classifiers

Naive Bayes: Example

P(C = 0) = 0.4, P(C = 1) = 0.6

C = 0 A2

A1 0 1 P(A1)

0 0.2 0.1 0.31 0.1 0.6 0.7

P(A2) 0.3 0.7 1

C = 1 A2

A1 0 1 P(A1)

0 0.5 0.2 0.71 0.1 0.2 0.3

P(A2) 0.6 0.4 1

We have that

P(C = 1|A1 = 0,A2 = 0) =0.5× 0.6

0.5× 0.6 + 0.2× 0.4= 0.79

According to naive Bayes

P(C = 1|A1 = 0,A2 = 0) =0.7× 0.6× 0.6

0.7× 0.6× 0.6 + 0.3× 0.3× 0.4= 0.88

Naive Bayes assigns to the right class.

Ad Feelders ( Universiteit Utrecht ) Data Mining 8 / 48

Page 9: Data Mining 2016 Bayesian Network Classifiers

What about this model?

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation is:

P(C ,A1, . . . ,Ak) = P(C |A1, . . . ,Ak)P(A1) · · ·P(Ak)

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48

Page 10: Data Mining 2016 Bayesian Network Classifiers

What about this model?

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation is:

P(C ,A1, . . . ,Ak) = P(C |A1, . . . ,Ak)P(A1) · · ·P(Ak)

Ad Feelders ( Universiteit Utrecht ) Data Mining 9 / 48

Page 11: Data Mining 2016 Bayesian Network Classifiers

Bayesian Networks as Classifiers

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Ad Feelders ( Universiteit Utrecht ) Data Mining 10 / 48

Page 12: Data Mining 2016 Bayesian Network Classifiers

Markov Blanket of C : Moral Graph

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Local Markov property: C ⊥⊥ rest | boundary(C )

Ad Feelders ( Universiteit Utrecht ) Data Mining 11 / 48

Page 13: Data Mining 2016 Bayesian Network Classifiers

Bayesian Networks as Classifiers

Loglikelihood under model M is

L(M|D) =n∑

j=1

logPM(X (j))

=n∑

j=1

logPM(A(j),C (j))

where A(j) = (A(j)1 ,A

(j)2 , . . . ,A

(j)k ).

We can rewrite this as (product rule and log ab = log a + log b):

L(M|D) =n∑

j=1

logPM(C (j)|A(j)) +n∑

j=1

logPM(A(j))

If there are many attributes, the second term will dominate the loglikelihood score.

But we are not interested in modeling the distribution of the attributes!

Ad Feelders ( Universiteit Utrecht ) Data Mining 12 / 48

Page 14: Data Mining 2016 Bayesian Network Classifiers

Bayesian Networks as Classifiers

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

P(x)

−lo

g P

(x)

Ad Feelders ( Universiteit Utrecht ) Data Mining 13 / 48

Page 15: Data Mining 2016 Bayesian Network Classifiers

BAYESIAN NETWORK CLASSIFIERS 151

Table 1.Description of data sets used in the experiments.

Dataset # Attributes # Classes # InstancesTrain Test

1 australian 14 2 690 CV-52 breast 10 2 683 CV-53 chess 36 2 2130 10664 cleve 13 2 296 CV-55 corral 6 2 128 CV-56 crx 15 2 653 CV-57 diabetes 8 2 768 CV-58 flare 10 2 1066 CV-59 german 20 2 1000 CV-5

10 glass 9 7 214 CV-511 glass2 9 2 163 CV-512 heart 13 2 270 CV-513 hepatitis 19 2 80 CV-514 iris 4 3 150 CV-515 letter 16 26 15000 500016 lymphography 18 4 148 CV-517 mofn-3-7-10 10 2 300 102418 pima 8 2 768 CV-519 satimage 36 6 4435 200020 segment 19 7 1540 77021 shuttle-small 9 7 3866 193422 soybean-large 35 19 562 CV-523 vehicle 18 4 846 CV-524 vote 16 2 435 CV-525 waveform-21 21 3 300 4700

The accuracy of each classifier is based on the percentage of successful predictions on thetest sets of each data set. We used the MLC++ system (Kohavi et al., 1994) to estimate theprediction accuracy for each classifier, as well as the variance of this accuracy. Accuracy wasmeasured via the holdout method for the larger data sets (that is, the learning procedures weregiven a subset of the instances and were evaluated on the remaining instances), and via five-fold cross validation, using the methods described by Kohavi (1995), for the smaller ones.3

Since we do not deal, at present, with missing data, we removed instances with missingvalues from the data sets. Currently, we also do not handle continuous attributes. Instead,we applied a pre-discretization step in the manner described by Dougherty et al. (1995).This pre-discretization is based on a variant of Fayyad and Irani’s (1993) discretizationmethod. These preprocessing stages were carried out by the MLC++ system. Runs withthe various learning procedures were carried out on the same training sets and evaluatedon the same test sets. In particular, the cross-validation folds were the same for all theexperiments on each data set.

Table 2 displays the accuracies of the main classification approaches we have discussedthroughout the paper using the abbreviations:

NB: the naive Bayesian classifier

BN: unrestricted Bayesian networks learned with the MDL score

TANs: TAN networks learned according to Theorem 2, with smoothed parameters

Ad Feelders ( Universiteit Utrecht ) Data Mining 14 / 48

Page 16: Data Mining 2016 Bayesian Network Classifiers

Naive Bayes vs. Unrestricted BN138 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

0

5

10

15

20

25

30

35

40

45

22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5

Per

cent

age

Cla

ssif

icat

ion

Err

or

Data Set

Bayesian NetworkNaive Bayes

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45

Nai

ve B

ayes

Err

or

Bayesian Network Error

Figure 2. Error curves (top) and scatter plot (bottom) comparing unsupervised Bayesian networks (solid line,x axis) to naive Bayes (dashed line,y axis). In the error curves, the horizontal axis lists the data sets, whichare sorted so that the curves cross only once, and the vertical axis measures the percentage of test instances thatwere misclassified (i.e., prediction errors). Thus, the smaller the value, the better the accuracy. Each data pointis annotated by a 90% confidence interval. In the scatter plot, each point represents a data set, where thexcoordinate of a point is the percentage of misclassifications according to unsupervised Bayesian networks and they coordinate is the percentage of misclassifications according to naive Bayes. Thus, points above the diagonal linecorrespond to data sets on which unrestricted Bayesian networks perform better, and points below the diagonalline correspond to data sets on which naive Bayes performs better.

To confirm this hypothesis, we conducted an experiment comparing the classificationaccuracy of Bayesian networks learned using the MDL score (i.e., classifiers based onunrestricted networks) to that of the naive Bayesian classifier. We ran this experiment on

Ad Feelders ( Universiteit Utrecht ) Data Mining 15 / 48

Page 17: Data Mining 2016 Bayesian Network Classifiers

Use Conditional Log-likelihood?

Discriminative vs. Generative learning.

Conditional loglikelihood function:

CL(M|D) =n∑

j=1

logPM(C (j)|A(j)1 , . . . ,A

(j)k )

No closed form solution for ML estimates.

Remark: can be done via Logistic Regression for models with perfectgraphs (Naive Bayes, TAN’s).

Ad Feelders ( Universiteit Utrecht ) Data Mining 16 / 48

Page 18: Data Mining 2016 Bayesian Network Classifiers

NB and Logistic Regression

The logistic regression assumption is

log

{P(C = 1|A)

P(C = 0|A)

}= α +

k∑

i=1

βiAi ,

that is, the log odds is a linear function of the attributes.

Under the naive Bayes assumption, this is exactly true.

Assign to class 1 if α +∑k

i=1 βiAi > 0 and to class 0 otherwise.

Logistic regression maximizes conditional likelihood under this assumption (it is aso-called discriminative model).

There is no closed form solution for the maximum likelihood estimates of α and βi ,but the loglikelihood function is globally concave (unique global optimum).

Ad Feelders ( Universiteit Utrecht ) Data Mining 17 / 48

Page 19: Data Mining 2016 Bayesian Network Classifiers

Proof (for binary attributes Ai)

Under the naive Bayes assumption we have:

P(C = 1|a)

P(C = 0|a)=

P(a1|C = 1) · · ·P(ak |C = 1)P(C = 1)

P(a1|C = 0) · · ·P(ak |C = 0)P(C = 0)

=k∏

i=1

{(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)ai(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)(1−ai )}

× P(C = 1)

P(C = 0)

Taking the log we get

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

{ai log

(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)

+ (1− ai ) log

(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)}+ log

(P(C = 1)

P(C = 0)

)

Ad Feelders ( Universiteit Utrecht ) Data Mining 18 / 48

Page 20: Data Mining 2016 Bayesian Network Classifiers

Proof (continued)

Expand and collect terms.

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

ai

βi︷ ︸︸ ︷{log

P(ai = 1|C = 1)

P(ai = 1|C = 0)

P(ai = 0|C = 0)

P(ai = 0|C = 1)

}

+k∑

i=1

{log

P(ai = 0|C = 1)

P(ai = 0|C = 0)

}+ log

P(C = 1)

P(C = 0)︸ ︷︷ ︸

α

which is a linear function of a.

Ad Feelders ( Universiteit Utrecht ) Data Mining 19 / 48

Page 21: Data Mining 2016 Bayesian Network Classifiers

Example

Suppose P(C = 1) = 0.6,P(a1 = 1|C = 1) = 0.8,P(a1 = 1|C = 0) = 0.5,P(a2 = 1|C = 1) = 0.6,P(a2 = 1|C = 0) = 0.3.

Then

log

(P(C = 1|a1, a2)

P(C = 0|a1, a2)

)= 1.386a1 + 1.253a2 − 1.476 + 0.405

= −1.071 + 1.386a1 + 1.253a2

Classify a point with a1 = 1 and a2 = 0:

log

(P(C = 1|1, 0)

P(C = 0|1, 0)

)= −1.071 + 1.386× 1 + 1.253× 0 = 0.315

Decision rule: assign to class 1 if

α +k∑

i=1

βiAi > 0

and to class 0 otherwise. Linear decision boundary.

Ad Feelders ( Universiteit Utrecht ) Data Mining 20 / 48

Page 22: Data Mining 2016 Bayesian Network Classifiers

Linear Decision Boundary

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A1

A2

a2 = 0.855 − 1.106a1

CLASS 0

CLASS 1

Ad Feelders ( Universiteit Utrecht ) Data Mining 21 / 48

Page 23: Data Mining 2016 Bayesian Network Classifiers

Relax strong assumptions of NB

Conditional independence assumption of NB is often incorrect, andcould lead to suboptimal classification performance.

Relax this assumption by allowing (restricted) dependencies betweenattributes.

This may produce more accurate probability estimates, possiblyleading to better classification performance.

This is not guaranteed, because the more complex model may beoverfitting.

Ad Feelders ( Universiteit Utrecht ) Data Mining 22 / 48

Page 24: Data Mining 2016 Bayesian Network Classifiers

Tree Structured BN

Tree structure: each node (except the root of the tree) has exactlyone parent.

For tree structured Bayesian Networks there is an algorithm, due toChow and Liu, that produces the optimal structure in polynomialtime.

This algorithm is guaranteed to produce the tree structure thatmaximizes the loglikelihood score.

Why no penalty for complexity?

Ad Feelders ( Universiteit Utrecht ) Data Mining 23 / 48

Page 25: Data Mining 2016 Bayesian Network Classifiers

Mutual Information

Measure of association between (discrete) random variables X and Y :

IP(X ;Y ) =∑

x ,y

P(x , y) logP(x , y)

P(x)P(y)

“Distance” (Kullback-Leibler divergence) between joint distribution of Xand Y , and their joint distribution under the independence assumption.

If X and Y are independent, their mutual information is zero, otherwise itis some positive quantity.

Ad Feelders ( Universiteit Utrecht ) Data Mining 24 / 48

Page 26: Data Mining 2016 Bayesian Network Classifiers

Algorithm Construct-Tree of Chow and Liu

Compute IP̂D(Xi ;Xj) between each pair of variables. (O(nk2))

Build complete undirected graph with weights IP̂D(Xi ;Xj)

Build a maximum weighted spanning tree (O(k2 log k))

Choose root, and let all edges point away from it

Ad Feelders ( Universiteit Utrecht ) Data Mining 25 / 48

Page 27: Data Mining 2016 Bayesian Network Classifiers

Example Data Set

obs X1 X2 X3 X4

1 1 1 1 12 1 1 1 13 1 1 2 14 1 2 2 15 1 2 2 26 2 1 1 27 2 1 2 38 2 1 2 39 2 2 2 3

10 2 2 1 3

Ad Feelders ( Universiteit Utrecht ) Data Mining 26 / 48

Page 28: Data Mining 2016 Bayesian Network Classifiers

Mutual Information

IP̂D(X1;X4) =

x1,x4

P̂D(x1, x4) logP̂D(x1, x4)

P̂D(x1)P̂D(x4)

= 0.4 log0.4

(0.5)(0.4)+ 0.1 log

0.1

(0.5)(0.2)

+ 0 log0

(0.5)(0.4)+ 0 log

0

(0.5)(0.4)

+ 0.1 log0.1

(0.5)(0.2)+ 0.4 log

0.4

(0.5)(0.4)

= 0.55

Ad Feelders ( Universiteit Utrecht ) Data Mining 27 / 48

Page 29: Data Mining 2016 Bayesian Network Classifiers

Build Graph with Weights

1

2 3

4

0 0

0.55

0.032

0.032 0.032

Ad Feelders ( Universiteit Utrecht ) Data Mining 28 / 48

Page 30: Data Mining 2016 Bayesian Network Classifiers

Maximum weighted spanning tree

1

2 3

4

Ad Feelders ( Universiteit Utrecht ) Data Mining 29 / 48

Page 31: Data Mining 2016 Bayesian Network Classifiers

Choose root node

1

2 3

4

Ad Feelders ( Universiteit Utrecht ) Data Mining 30 / 48

Page 32: Data Mining 2016 Bayesian Network Classifiers

Algorithm to construct TAN

Compute IP̂D(Ai ;Aj |C ) between each pair of attributes, where

IP(Ai ;Aj |C ) =∑

ai ,aj ,c

P(ai , aj , c) logP(ai , aj |c)

P(ai |c)P(aj |c)

is the conditional mutual information between Ai and Aj given C .

Build complete undirected graph with weights IP̂D(Ai ;Aj |C )

Build a maximum weighted spanning tree

Choose root, and let all edges point away from it

Construct a TAN by adding C and an arc from C to all attributes.

This algorithm is guaranteed to produce the TAN structure with optimalloglikelihood score.

Ad Feelders ( Universiteit Utrecht ) Data Mining 31 / 48

Page 33: Data Mining 2016 Bayesian Network Classifiers

TAN for Pima Indians Data

140 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

C

Pregnant

InsulinAge

DPF

GlucoseMass

Figure 3. A TAN model learned for the data set “pima.” The dashed lines are those edges required by the naiveBayesian classifier. The solid lines are correlation edges between attributes.

These approaches, however, remove the strong assumptions of independence in naive Bayesby finding correlations among attributes that are warranted by the training data.

4.1. Augmented naive Bayesian networks as classifiers

We argued above that the performance of a Bayesian network as a classifier may improveif the learning procedure takes into account the special status of the class variable. An easyway to ensure this is to bias the structure of the network, as in the naive Bayesian classifier,such that there is an edge from the class variable to each attribute. This ensures that, in thelearned network, the probabilityP (C|A1, . . . , An) will take all attributes into account. Inorder to improve the performance of a classifier based on this bias, we propose to augmentthe naive Bayes structure with edges among the attributes, when needed, thus dispensingwith its strong assumptions about independence. We call these structuresaugmented naiveBayesian networksand these edgesaugmenting edges.

In an augmented structure, an edge fromAi toAj implies that the influence ofAi on theassessment of the class variable also depends on the value ofAj . For example, in Figure 3,the influence of the attribute “Glucose” on the classC depends on the value of “Insulin,”while in the naive Bayesian classifier the influence of each attribute on the class variableis independent of other attributes. These edges affect the classification process in that avalue of “Glucose” that is typically surprising (i.e.,P (g|c) is low) may be unsurprising ifthe value of its correlated attribute, “Insulin,” is also unlikely (i.e.,P (g|c, i) is high). In thissituation, the naive Bayesian classifier will overpenalize the probability of the class variableby considering two unlikely observations, while the augmented network of Figure 3 willnot.

Adding the best set of augmenting edges is an intractable problem, since it is equivalentto learning the best Bayesian network among those in whichC is a root. Thus, even if wecould improve the performance of a naive Bayes classifier in this way, the computationaleffort required may not be worthwhile. However, by imposing acceptable restrictions onthe form of the allowed interactions, we can actually learn the optimal set of augmentingedges in polynomial time.

Ad Feelders ( Universiteit Utrecht ) Data Mining 32 / 48

Page 34: Data Mining 2016 Bayesian Network Classifiers

Interpretation of TAN’s

Just like NB models, TAN’s are equivalent to their undirected counterparts(why?).

C

P A

I

DM G

Since there is an edge between Pregnant and Age, the influence of Age onthe class label depends on (is different for different values of) Pregnant(and vice versa).

Ad Feelders ( Universiteit Utrecht ) Data Mining 33 / 48

Page 35: Data Mining 2016 Bayesian Network Classifiers

Smoothing by adding “prior counts”

Recall that the maximum likelihood estimate of p(xi | xpa(i)) is:

p̂(xi | xpa(i)) =n(xi , xpa(i))

n(xpa(i)),

where

n(xpa(i)) is the number of observations (rows) with parentconfiguration xpa(i), and

n(xi , xpa(i)) is the number of observations with parent configurationxpa(i) and value xi for variable Xi .

But sometimes we have no (n(xpa(i)) = 0) or very few observations toestimate these (conditional) probabilities.

Ad Feelders ( Universiteit Utrecht ) Data Mining 34 / 48

Page 36: Data Mining 2016 Bayesian Network Classifiers

Smoothing by adding “prior counts”

Add “prior counts” to “smooth” the estimates.

p̂s(xi | xpa(i)) =n(xpa(i))p̂(xi | xpa(i)) + m(xpa(i))p

0(xi | xpa(i))

n(xpa(i)) + m(xpa(i))

where m(xpa(i)) is the prior precision, p̂s(xi | xpa(i)) is the smoothedestimate, and p0(xi | xpa(i)) is our prior estimate of p(xi | xpa(i)).

Common to take m(xpa(i)) to be the same for all parent configurations.

Weighted average of ML estimate and prior estimate.

Ad Feelders ( Universiteit Utrecht ) Data Mining 35 / 48

Page 37: Data Mining 2016 Bayesian Network Classifiers

ML estimate vs. smoothed estimate

For example

p̂3|1,2(1|1, 2) =n(x1 = 1, x2 = 2, x3 = 1)

n(x1 = 1, x2 = 2)=

0

2= 0

Suppose we set m = 2, and p0(xi | xpa(i)) = p̂(xi )

Then we get

p̂s3|1,2(1|1, 2) =2× 0 + 2× 0.4

2 + 2= 0.2,

since p̂3(1) = 0.4.

Ad Feelders ( Universiteit Utrecht ) Data Mining 36 / 48

Page 38: Data Mining 2016 Bayesian Network Classifiers

Bayesian Multinets

Build structure on attributes for each class separately and use

PM(C = i ,A1, . . . ,Ak) = P(C = i)PM(A1, . . . ,Ak |C = i)

= P(C = i)PMi(A1, . . . ,Ak) i = 1, . . . , c

Using trees for class-conditional structures

Split D into c partitions, D1,D2, . . . ,Dc where c is the number ofdistinct values of class label C . Di contains all records in D withC = i .

Set P(C = i) = P̂D(C = i) for i = 1, . . . , c .

Apply Construct-Tree on Di to construct Mi

Unlike in TAN’s you may get a different tree structure for each class!

Ad Feelders ( Universiteit Utrecht ) Data Mining 37 / 48

Page 39: Data Mining 2016 Bayesian Network Classifiers

Experimental Study of Friedman et al.

The following classifiers were compared:

NB: Naive Bayes Classifier

SNB: Naive Bayes with attribute selection

BN: Unrestricted Bayesian Network (Markov Blanket of Class)

C4.5: Classification Tree

And also

structure per class

structure same different

tree TAN CLdag ANB MN

Superscript s indicates smoothing of parameter estimates.

Ad Feelders ( Universiteit Utrecht ) Data Mining 38 / 48

Page 40: Data Mining 2016 Bayesian Network Classifiers

BAYESIAN NETWORK CLASSIFIERS 151

Table 1.Description of data sets used in the experiments.

Dataset # Attributes # Classes # InstancesTrain Test

1 australian 14 2 690 CV-52 breast 10 2 683 CV-53 chess 36 2 2130 10664 cleve 13 2 296 CV-55 corral 6 2 128 CV-56 crx 15 2 653 CV-57 diabetes 8 2 768 CV-58 flare 10 2 1066 CV-59 german 20 2 1000 CV-5

10 glass 9 7 214 CV-511 glass2 9 2 163 CV-512 heart 13 2 270 CV-513 hepatitis 19 2 80 CV-514 iris 4 3 150 CV-515 letter 16 26 15000 500016 lymphography 18 4 148 CV-517 mofn-3-7-10 10 2 300 102418 pima 8 2 768 CV-519 satimage 36 6 4435 200020 segment 19 7 1540 77021 shuttle-small 9 7 3866 193422 soybean-large 35 19 562 CV-523 vehicle 18 4 846 CV-524 vote 16 2 435 CV-525 waveform-21 21 3 300 4700

The accuracy of each classifier is based on the percentage of successful predictions on thetest sets of each data set. We used the MLC++ system (Kohavi et al., 1994) to estimate theprediction accuracy for each classifier, as well as the variance of this accuracy. Accuracy wasmeasured via the holdout method for the larger data sets (that is, the learning procedures weregiven a subset of the instances and were evaluated on the remaining instances), and via five-fold cross validation, using the methods described by Kohavi (1995), for the smaller ones.3

Since we do not deal, at present, with missing data, we removed instances with missingvalues from the data sets. Currently, we also do not handle continuous attributes. Instead,we applied a pre-discretization step in the manner described by Dougherty et al. (1995).This pre-discretization is based on a variant of Fayyad and Irani’s (1993) discretizationmethod. These preprocessing stages were carried out by the MLC++ system. Runs withthe various learning procedures were carried out on the same training sets and evaluatedon the same test sets. In particular, the cross-validation folds were the same for all theexperiments on each data set.

Table 2 displays the accuracies of the main classification approaches we have discussedthroughout the paper using the abbreviations:

NB: the naive Bayesian classifier

BN: unrestricted Bayesian networks learned with the MDL score

TANs: TAN networks learned according to Theorem 2, with smoothed parameters

Ad Feelders ( Universiteit Utrecht ) Data Mining 39 / 48

Page 41: Data Mining 2016 Bayesian Network Classifiers

BAYESIAN NETWORK CLASSIFIERS 153

Table 2.Experimental results of the primary approaches discussed in this paper.

performance. For the naive Bayesian classifier, he shows that, under certain conditions, thelow variance associated with this classifier can dramatically mitigate the effect of the highbias that results from the strong independence assumptions.

One goal of the work described in this paper has been to improve the performance of thenaive Bayesian classifier by relaxing these independence assumptions. Indeed, our empir-ical results indicate that a more accurate modeling of the dependencies amongst featuresleads to improved classification. Previous extensions to the naive Bayesian classifier alsoidentified the strong independence assumptions as the source of classification errors, butdiffer in how they address this problem. These works fall into two categories.

Work in the first category, such as that of Langley and Sage (1994) and of John andKohavi (1997), has attempted to improve prediction accuracy by rendering some of theattributes irrelevant. The rationale is as follows. As we explained in Section 4.1, if twoattributes, sayAi andAj , are correlated, then the naive Bayesian classifier may overamplifythe weight of the evidence of these two attributes on the class. The proposed solution inthis category is simply to ignore one of these two attributes. (Removing attributes is alsouseful if some attributes are irrelevant, since they only introduce noise in the classificationproblem.) This is a straightforward application offeature subset selection. The usualapproach to this problem is to search for a good subset of the attributes, using an estimationscheme, such as cross validation, to repeatedly evaluate the predictive accuracy of the naiveBayesian classifier on various subsets. The resulting classifier is called theselective naiveBayesian classifier, following Langley and Sage (1994).

Ad Feelders ( Universiteit Utrecht ) Data Mining 40 / 48

Page 42: Data Mining 2016 Bayesian Network Classifiers

154 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

Table 3.Experimental results describing the effect of smoothing parameters.

It is clear that, if two attributes are perfectly correlated, then the removal of one can onlyimprove the performance of the naive Bayesian classifier. Problems arise, however, if twoattributes are only partially correlated. In these cases the removal of an attribute may leadto the loss of useful information, and the selective naive Bayesian classifier may still retainboth attributes. In addition, this wrapper-based approach is, in general, computationallyexpensive. Our experimental results (see Figure 6) show that the methods we examine hereare usually more accurate than the selective naive Bayesian classifier as used by John andKohavi (1997).

Work in the second category (Kononenko, 1991; Pazzani, 1995; Ezawa & Schuermann,1995) are closer in spirit to our proposal, since they attempt to improve the predictiveaccuracy by removing some of the independence assumptions. Thesemi-naive Bayesianclassifier(Kononenko, 1991) is a model of the form

P (C,A1, . . . , An) = P (C) · P (A1|C) · · ·P (Ak|C) (9)

whereA1, . . . ,Ak are pairwise disjoint groups of attributes. Such a model assumes thatAiis conditionally independent ofAj if, and only if, they are in different groups. Thus, no as-sumption of independence is made about attributes that are in the same group. Kononenko’smethod uses statistical tests of independence to partition the attributes into groups. This pro-cedure, however, tends to select large groups, which can lead to overfitting problems. Thenumber of parameters needed to estimateP (Ai|C) is |Val(C)| · (∏Aj∈Ai |Val(Aj)| − 1),which grows exponentially with the number of attributes in the group. Thus, the parameters

Ad Feelders ( Universiteit Utrecht ) Data Mining 41 / 48

Page 43: Data Mining 2016 Bayesian Network Classifiers

BAYESIAN NETWORK CLASSIFIERS 155

Table 4. Experimental results of comparing tree-like networks with unrestricted augmented naive Bayes andmultinets.

assessed forP (Ai|C) may quickly become unreliable ifAi contains more than two or threeattributes.

Pazzani suggests that this problem can be solved by using a cross-validation scheme toevaluate the accuracy of a classifier. His procedure starts with singleton groups (i.e.,A1 ={A1}, . . . ,An = {An}) and then combines, in a greedy manner, pairs of groups. (Healso examines a procedure that performs feature subset selection after the stage of joiningattributes.) This procedure does not, in general, select large groups, since these lead to poorprediction accuracy in the cross-validation test. Thus, Pazzani’s procedure learns classifiersthat partition the attributes in to many small groups. Since each group of attributes isconsidered independent of the rest given the class, these classifiers can capture only smallnumber of correlations among the attributes.

Both Kononenko and Pazzani essentially assume that all of the attributes in each groupAi can be arbitrarily correlated. To understand the implications of this assumption, we usea Bayesian network representation. If we letAi = {Ai1 , . . . , Ail}, then, using the chainrule, we get

P (Ai|C) = P (Ai1 |C) · P (Ai2 |Ai,1, C) · · ·P (Ail |Ai,1, . . . , Ail−1, C).

Applying this decomposition to each of the terms in Equation 9, we get a product formfrom which a Bayesian network can be built. Indeed, this is an augmented naive Bayesnetwork, in which there is a complete subgraph—that is, one to which we cannot add arcs

Ad Feelders ( Universiteit Utrecht ) Data Mining 42 / 48

Page 44: Data Mining 2016 Bayesian Network Classifiers

Example: RHC small data set

Variable Description

cat1 Primary disease categorydeath Did patient die within 180 days?swang1 Did patient get Swan-Ganz catheter?gender Genderrace Raceninsclas Insurance classincome Incomeca Cancer statusage Agemeanbp1 Mean blood pressure

We have 4000 observations for training and 1735 for testing.

Ad Feelders ( Universiteit Utrecht ) Data Mining 43 / 48

Page 45: Data Mining 2016 Bayesian Network Classifiers

Naive Bayes

> library(bnlearn)

> library(Rgraphviz)

> death.nb <- naive.bayes(rhcsmall.dat[train.index,],"death")

> plot(as(amat(death.nb),"graphNEL"))

> death.nb.pred <- predict(death.nb,rhcsmall.dat[-train.index,])

Warning message:

In check.data(data, allowed.types = discrete.data.types) :

variable cat1 has levels that are not observed in the data.

> confmat <- table(rhcsmall.dat[-train.index,"death"],death.nb.pred)

> confmat

death.nb.pred

No Yes

No 259 341

Yes 217 918

> sum(diag(confmat))/sum(confmat)

[1] 0.6783862

> sum(confmat[2,])/sum(confmat)

[1] 0.6541787

Ad Feelders ( Universiteit Utrecht ) Data Mining 44 / 48

Page 46: Data Mining 2016 Bayesian Network Classifiers

NB for RHC data

death

cat1 swang1 gender race ninsclas income ca age meanbp1

Ad Feelders ( Universiteit Utrecht ) Data Mining 45 / 48

Page 47: Data Mining 2016 Bayesian Network Classifiers

Tree Augmented Naive Bayes (TAN)

# fit TAN structure to data with "death" as class variable

> death.tan <- tree.bayes(rhcsmall.dat[train.index,],"death")

# fit TAN parameters using maximum likelihood estimation ("mle")

> death.tan.fit <- bn.fit(death.tan,rhcsmall.dat[train.index,],"mle")

# predict class on test sample

> death.tan.pred <- predict(death.tan.fit,rhcsmall.dat[-train.index,])

# make confusion matrix

> confmat <- table(rhcsmall.dat[-train.index,"death"],death.tan.pred)

> confmat

death.tan.pred

No Yes

No 230 370

Yes 171 964

# compute accuracy

> sum(diag(confmat))/sum(confmat)

[1] 0.6881844

# plot the TAN structure

> plot(as(amat(death.tan),"graphNEL"))

Ad Feelders ( Universiteit Utrecht ) Data Mining 46 / 48

Page 48: Data Mining 2016 Bayesian Network Classifiers

TAN for RHC data

death

cat1

swang1 gender

race

ninsclas

income

ca

age

meanbp1

Ad Feelders ( Universiteit Utrecht ) Data Mining 47 / 48

Page 49: Data Mining 2016 Bayesian Network Classifiers

Logistic Regression

# fit model on train set

> death.logreg <- glm(death~.,data=rhcsmall.dat[train.index,],

family="binomial")

# make predictions on test set

> death.logreg.pred <- predict(death.logreg,rhcsmall.dat[-train.index,],

type="response")

# compute confusion matrix

> confmat <- table(rhcsmall.dat[-train.index,"death"],

death.logreg.pred > 0.5)

> confmat

FALSE TRUE

No 212 388

Yes 156 979

# compute accuracy

> sum(diag(confmat))/sum(confmat)

[1] 0.6864553

Ad Feelders ( Universiteit Utrecht ) Data Mining 48 / 48