data mining 2012 bayesian network classifiersdata mining 2012 bayesian network classi ers ad...

Data Mining 2012Bayesian Network Classifiers

Ad Feelders

Universiteit Utrecht

October 25, 2012

Ad Feelders ( Universiteit Utrecht ) Data Mining October 25, 2012 1 / 48

Literature

N. Friedman, D. Geiger and M. GoldszmidtBayesian Network ClassifiersMachine Learning, 29, pp. 131-163 (1997)(except section 6)


Bayesian Network Classifiers

Bayesian Networks are models of the joint distribution of a collection ofrandom variables. The joint distribution is simplified by introducingindependence assumptions.

In many applications we are in fact interested in the conditionaldistribution of one variable (the class variable) given the other variables(attributes).

Can we use Bayesian Networks as classifiers?


The Naive Bayes Classifier

C

A2A1 Ak. . .

This Bayesian Network is equivalent to its undirected version (why?):

C

A2A1 Ak. . .

Attributes are independent given the class label.


The Naive Bayes Classifier

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation corresponding to NB classifier is:

P(C ,A1, . . . ,Ak) = P(C )P(A1|C ) · · ·P(Ak |C )


Why Naive Bayes is competitive

The conditional independence assumption is often clearly inappropriate,yet the predictive accuracy of Naive Bayes is competitive with morecomplex classifiers. How come?

Probability estimates of Naive Bayes may be way off, but this doesnot necessarily result in wrong classification!

Naive Bayes has only few parameters compared to more complexmodels, so it can estimate parameters more reliably.


Naive Bayes: Example

P(C = 0) = 0.4, P(C = 1) = 0.6

C = 0 A2

A1 0 1 P(A1)

0 0.2 0.1 0.31 0.1 0.6 0.7

P(A2) 0.3 0.7 1

C = 1 A2

A1 0 1 P(A1)

0 0.5 0.2 0.71 0.1 0.2 0.3

P(A2) 0.6 0.4 1

We have that

P(C = 1|A1 = 0,A2 = 0) =0.5× 0.6

0.5× 0.6 + 0.2× 0.4= 0.79

According to naive Bayes

P(C = 1|A1 = 0,A2 = 0) =0.7× 0.6× 0.6

0.7× 0.6× 0.6 + 0.3× 0.3× 0.4= 0.88

Naive Bayes assigns to the right class.


What about this model?

C

A2A1 Ak. . .

BN factorisation:

P(X ) =k∏

i=1

P(Xi | Xpa(i)),

So factorisation is:

P(C ,A1, . . . ,Ak) = P(C |A1, . . . ,Ak)P(A1) · · ·P(Ak)


Bayesian Networks as Classifiers

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.


Markov Blanket of C : Moral Graph

C

A2A1

A3

A4 A5

A6 A7 A8

Markov Blanket: Parents, Children and Parents of Children.

Local Markov property: C ⊥⊥ rest | boundary(C )



Loglikelihood under model M is

L(M|D) =n∑

j=1

logPM(X (j))

where X (j) = (A(j)1 ,A

(j)2 , . . . ,A

(j)k ,C (j)).

We can rewrite this as

L(M|D) =n∑

j=1

logPM(C (j)|A(j)) +n∑

j=1

logPM(A(j))

If there are many attributes, the second term will dominate the loglikelihood score.

But we are not interested in modeling the distribution of the attributes!



0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

P(x)

−lo

g P

(x)


Naive Bayes vs. Unrestricted BN138 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

0

5

10

15

20

25

30

35

40

45

22 19 10 25 16 11 9 4 6 18 17 2 13 1 15 14 12 21 7 20 23 8 24 3 5

Per

cent

age

Cla

ssif

icat

ion

Err

or

Data Set

Bayesian NetworkNaive Bayes

0

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45

Nai

ve B

ayes

Err

or

Bayesian Network Error

Figure 2. Error curves (top) and scatter plot (bottom) comparing unsupervised Bayesian networks (solid line,x axis) to naive Bayes (dashed line,y axis). In the error curves, the horizontal axis lists the data sets, whichare sorted so that the curves cross only once, and the vertical axis measures the percentage of test instances thatwere misclassified (i.e., prediction errors). Thus, the smaller the value, the better the accuracy. Each data pointis annotated by a 90% confidence interval. In the scatter plot, each point represents a data set, where thexcoordinate of a point is the percentage of misclassifications according to unsupervised Bayesian networks and they coordinate is the percentage of misclassifications according to naive Bayes. Thus, points above the diagonal linecorrespond to data sets on which unrestricted Bayesian networks perform better, and points below the diagonalline correspond to data sets on which naive Bayes performs better.

To confirm this hypothesis, we conducted an experiment comparing the classificationaccuracy of Bayesian networks learned using the MDL score (i.e., classifiers based onunrestricted networks) to that of the naive Bayesian classifier. We ran this experiment on


Use Conditional Log-likelihood?

Discriminative vs. Generative learning.

Conditional loglikelihood function:

CL(M|D) =n∑

j=1

logPM(C (j)|A(j)1 , . . . ,A

(j)k )

No closed form solution for ML estimates.

Remark: can be done via Logistic Regression for models with perfectgraphs (Naive Bayes, TAN’s).


NB and Logistic Regression

The logistic regression assumption is

log

{P(C = 1|A)

P(C = 0|A)

}= α +

k∑

i=1

βiAi ,

that is, the log odds is a linear function of the attributes.

Under the naive Bayes assumption, this is exactly true.

Assign to class 1 if α +∑k

i=1 βiAi > 0 and to class 0 otherwise.

Logistic regression maximizes conditional likelihood under this assumption (it is aso-called discriminative model).

There is no closed form solution for the maximum likelihood estimates ofα and βi , but the loglikelihood function is globally concave (unique globaloptimum).


Proof (for binary attributes Ai)

Under the naive Bayes assumption we have:

P(C = 1|a)

P(C = 0|a)=

P(a1|C = 1) · · ·P(ak |C = 1)P(C = 1)

P(a1|C = 0) · · ·P(ak |C = 0)P(C = 0)

=k∏

i=1

{(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)ai(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)(1−ai )}

× P(C = 1)

P(C = 0)

Taking the log we get

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

{ai log

(P(ai = 1|C = 1)

P(ai = 1|C = 0)

)

+ (1− ai ) log

(P(ai = 0|C = 1)

P(ai = 0|C = 0)

)}+ log

(P(C = 1)

P(C = 0)

)


Proof (continued)

Expand and collect terms.

log

(P(C = 1|a)

P(C = 0|a)

)=

k∑

i=1

ai

βi︷︸︸︷{log

P(ai = 1|C = 1)

P(ai = 1|C = 0)

P(ai = 0|C = 0)

P(ai = 0|C = 1)

}

+k∑

i=1

{log

P(ai = 0|C = 1)

P(ai = 0|C = 0)

}+ log

P(C = 1)

P(C = 0)︸︷︷︸

α

which is a linear function of a.


Example

Suppose P(C = 1) = 0.6,P(a1 = 1|C = 1) = 0.8,P(a1 = 1|C = 0) = 0.5,P(a2 = 1|C = 1) = 0.6,P(a2 = 1|C = 0) = 0.3.

Then

log

(P(C = 1|a1, a2)

P(C = 0|a1, a2)

)= 1.386a1 + 1.253a2 − 1.476 + 0.405

= −1.071 + 1.386a1 + 1.253a2

Classify a point with a1 = 1 and a2 = 0:

log

(P(C = 1|1, 0)

P(C = 0|1, 0)

)= −1.071 + 1.386× 1 + 1.253× 0 = 0.315

Decision rule: assign to class 1 if

α +k∑

i=1

βiAi > 0

and to class 0 otherwise. Linear decision boundary.


Linear Decision Boundary

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

A1

A2

a2 = 0.855 − 1.106a1

CLASS 0

CLASS 1


Relax strong assumptions of NB

Conditional independence assumption of NB is often incorrect, andcould lead to suboptimal classification performance.

Relax this assumption by allowing (restricted) dependencies betweenattributes.

This may produce more accurate probability estimates, possiblyleading to better classification performance.

This is not guaranteed, because the more complex model may beoverfitting.


Tree Structured BN

For tree structured Bayesian Networks there is an algorithm, due to Chowand Liu, that produces the optimal structure in polynomial time.

This algorithm is guaranteed to produce the tree structure that maximizesthe loglikelihood score.

Tree structure: each node (except the root of the tree) has exactly oneparent.

Why no penalty for complexity?


Mutual Information

Measure of association between (discrete) random variables X and Y :

IP(X ;Y ) =∑

x ,y

P(x , y) logP(x , y)

P(x)P(y)

“Distance” (Kullback-Leibler divergence) between joint distribution of Xand Y , and their joint distribution under the independence assumption.

If X and Y are independent, their mutual information is zero, otherwise itis some positive quantity.


Algorithm Construct-Tree of Chow and Liu

Compute IP̂D(Xi ;Xj) between each pair of variables. (O(nk2))

Build complete undirected graph with weights IP̂D(Xi ;Xj)

Build a maximum weighted spanning tree (O(k2 log k))

Choose root, and let all edges point away from it


Example Data Set

obs X1 X2 X3 X4

1 1 1 1 12 1 1 1 13 1 1 2 14 1 2 2 15 1 2 2 26 2 1 1 27 2 1 2 38 2 1 2 39 2 2 2 3

10 2 2 1 3


Mutual Information

IP̂D(X1;X4) =

∑

x1,x4

P̂D(x1, x4) logP̂D(x1, x4)

P̂D(x1)P̂D(x4)

= 0.4 log0.4

(0.5)(0.4)+ 0.1 log

0.1

(0.5)(0.2)

+ 0 log0

(0.5)(0.4)+ 0 log

0

(0.5)(0.4)

+ 0.1 log0.1

(0.5)(0.2)+ 0.4 log

0.4

(0.5)(0.4)

= 0.55


Build Graph with Weights

1

2 3

4

0 0

0.55

0.032

0.032 0.032


Maximum weighted spanning tree

1

2 3

4


Choose root node

1

2 3

4


Algorithm to construct TAN

Compute IP̂D(Ai ;Aj |C ) between each pair of attributes, where

IP(Ai ;Aj |C ) =∑

ai ,aj ,c

P(ai , aj , c) logP(ai , aj |c)

P(ai |c)P(aj |c)

is the conditional mutual information between Ai and Aj given C .

Build complete undirected graph with weights IP̂D(Ai ;Aj |C )

Build a maximum weighted spanning tree

Choose root, and let all edges point away from it

Construct a TAN by adding C and an arc from C to all attributes.

This algorithm is guaranteed to produce the TAN structure with optimalloglikelihood score.


TAN for Pima Indians Data

140 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

C

Pregnant

InsulinAge

DPF

GlucoseMass

Figure 3. A TAN model learned for the data set “pima.” The dashed lines are those edges required by the naiveBayesian classifier. The solid lines are correlation edges between attributes.

These approaches, however, remove the strong assumptions of independence in naive Bayesby finding correlations among attributes that are warranted by the training data.

4.1. Augmented naive Bayesian networks as classifiers

We argued above that the performance of a Bayesian network as a classifier may improveif the learning procedure takes into account the special status of the class variable. An easyway to ensure this is to bias the structure of the network, as in the naive Bayesian classifier,such that there is an edge from the class variable to each attribute. This ensures that, in thelearned network, the probabilityP (C|A1, . . . , An) will take all attributes into account. Inorder to improve the performance of a classifier based on this bias, we propose to augmentthe naive Bayes structure with edges among the attributes, when needed, thus dispensingwith its strong assumptions about independence. We call these structuresaugmented naiveBayesian networksand these edgesaugmenting edges.

In an augmented structure, an edge fromAi toAj implies that the influence ofAi on theassessment of the class variable also depends on the value ofAj . For example, in Figure 3,the influence of the attribute “Glucose” on the classC depends on the value of “Insulin,”while in the naive Bayesian classifier the influence of each attribute on the class variableis independent of other attributes. These edges affect the classification process in that avalue of “Glucose” that is typically surprising (i.e.,P (g|c) is low) may be unsurprising ifthe value of its correlated attribute, “Insulin,” is also unlikely (i.e.,P (g|c, i) is high). In thissituation, the naive Bayesian classifier will overpenalize the probability of the class variableby considering two unlikely observations, while the augmented network of Figure 3 willnot.

Adding the best set of augmenting edges is an intractable problem, since it is equivalentto learning the best Bayesian network among those in whichC is a root. Thus, even if wecould improve the performance of a naive Bayes classifier in this way, the computationaleffort required may not be worthwhile. However, by imposing acceptable restrictions onthe form of the allowed interactions, we can actually learn the optimal set of augmentingedges in polynomial time.


Interpretation of TAN’s

Just like NB models, TAN’s are equivalent to their undirected counterparts(why?).

C

P A

I

DM G

Since there is an edge between Pregnant and Age, the influence of Age onthe class label depends on (is different for different values of) Pregnant(and vice versa).


Smoothing by adding “prior counts”

Sometimes we have few observations to estimate (conditional)probabilities.

Add “prior counts” to “smooth” the estimates.

p̂s(xi | xpa(i)) =n(xpa(i))p̂(xi | xpa(i)) + m(xpa(i))p

0(xi | xpa(i))

n(xpa(i)) + m(xpa(i))

where m(xpa(i)) is the prior precision, p̂s(xi | xpa(i)) is the smoothedestimate, and p0(xi | xpa(i)) is our prior estimate of p(xi | xpa(i)).

Common to take m(xpa(i)) to be the same for all parent configurations.

Weighted average of ML estimate and prior estimate.


ML estimates

For example

p̂1(1) =n(x1 = 1)

n=

5

10=

1

2

and

p̂3|1,2(1|1, 2) =n(x1 = 1, x2 = 2, x3 = 1)

n(x1 = 1, x2 = 2)=

0

2= 0


Smoothed estimates

Suppose we set m = 2, and p0(xi | xpa(i)) = p̂(xi )

Then we get

p̂s1(1) =10× 0.5 + 2× 0.5

10 + 2=

1

2

and

p̂s3|1,2(1|1, 2) =2× 0 + 2× 0.4

2 + 2= 0.2


Bayesian Multinets

Build structure on attributes for each class separately and use

PM(C = i ,A1, . . . ,Ak) = P(C = i)PM(A1, . . . ,Ak |C = i)

= P(C = i)PMi(A1, . . . ,Ak) i = 1, . . . , c

Using trees for class-conditional structures

Split D into c partitions, D1,D2, . . . ,Dc where c is the number ofdistinct values of class label C . Di contains all records in D withC = i .

Set P(C = i) = P̂D(C = i) for i = 1, . . . , c .

Apply Construct-Tree on Di to construct Mi


Experimental Study of Friedman et al.

The following classifiers were compared:

NB: Naive Bayes Classifier

SNB: Naive Bayes with attribute selection

BN: Unrestricted Bayesian Network (Markov Blanket of Class)

C4.5: Classification Tree

And also

structure per class

structure same different

tree TAN CLdag ANB MN

Superscript s indicates smoothing of parameter estimates.


BAYESIAN NETWORK CLASSIFIERS 151

Table 1.Description of data sets used in the experiments.

Dataset # Attributes # Classes # InstancesTrain Test

1 australian 14 2 690 CV-52 breast 10 2 683 CV-53 chess 36 2 2130 10664 cleve 13 2 296 CV-55 corral 6 2 128 CV-56 crx 15 2 653 CV-57 diabetes 8 2 768 CV-58 flare 10 2 1066 CV-59 german 20 2 1000 CV-5

10 glass 9 7 214 CV-511 glass2 9 2 163 CV-512 heart 13 2 270 CV-513 hepatitis 19 2 80 CV-514 iris 4 3 150 CV-515 letter 16 26 15000 500016 lymphography 18 4 148 CV-517 mofn-3-7-10 10 2 300 102418 pima 8 2 768 CV-519 satimage 36 6 4435 200020 segment 19 7 1540 77021 shuttle-small 9 7 3866 193422 soybean-large 35 19 562 CV-523 vehicle 18 4 846 CV-524 vote 16 2 435 CV-525 waveform-21 21 3 300 4700

The accuracy of each classifier is based on the percentage of successful predictions on thetest sets of each data set. We used the MLC++ system (Kohavi et al., 1994) to estimate theprediction accuracy for each classifier, as well as the variance of this accuracy. Accuracy wasmeasured via the holdout method for the larger data sets (that is, the learning procedures weregiven a subset of the instances and were evaluated on the remaining instances), and via five-fold cross validation, using the methods described by Kohavi (1995), for the smaller ones.3

Since we do not deal, at present, with missing data, we removed instances with missingvalues from the data sets. Currently, we also do not handle continuous attributes. Instead,we applied a pre-discretization step in the manner described by Dougherty et al. (1995).This pre-discretization is based on a variant of Fayyad and Irani’s (1993) discretizationmethod. These preprocessing stages were carried out by the MLC++ system. Runs withthe various learning procedures were carried out on the same training sets and evaluatedon the same test sets. In particular, the cross-validation folds were the same for all theexperiments on each data set.

Table 2 displays the accuracies of the main classification approaches we have discussedthroughout the paper using the abbreviations:

NB: the naive Bayesian classifier

BN: unrestricted Bayesian networks learned with the MDL score

TANs: TAN networks learned according to Theorem 2, with smoothed parameters



Table 2.Experimental results of the primary approaches discussed in this paper.

performance. For the naive Bayesian classifier, he shows that, under certain conditions, thelow variance associated with this classifier can dramatically mitigate the effect of the highbias that results from the strong independence assumptions.

One goal of the work described in this paper has been to improve the performance of thenaive Bayesian classifier by relaxing these independence assumptions. Indeed, our empir-ical results indicate that a more accurate modeling of the dependencies amongst featuresleads to improved classification. Previous extensions to the naive Bayesian classifier alsoidentified the strong independence assumptions as the source of classification errors, butdiffer in how they address this problem. These works fall into two categories.

Work in the first category, such as that of Langley and Sage (1994) and of John andKohavi (1997), has attempted to improve prediction accuracy by rendering some of theattributes irrelevant. The rationale is as follows. As we explained in Section 4.1, if twoattributes, sayAi andAj , are correlated, then the naive Bayesian classifier may overamplifythe weight of the evidence of these two attributes on the class. The proposed solution inthis category is simply to ignore one of these two attributes. (Removing attributes is alsouseful if some attributes are irrelevant, since they only introduce noise in the classificationproblem.) This is a straightforward application offeature subset selection. The usualapproach to this problem is to search for a good subset of the attributes, using an estimationscheme, such as cross validation, to repeatedly evaluate the predictive accuracy of the naiveBayesian classifier on various subsets. The resulting classifier is called theselective naiveBayesian classifier, following Langley and Sage (1994).


154 N. FRIEDMAN, D. GEIGER, AND M. GOLDSZMIDT

Table 3.Experimental results describing the effect of smoothing parameters.

It is clear that, if two attributes are perfectly correlated, then the removal of one can onlyimprove the performance of the naive Bayesian classifier. Problems arise, however, if twoattributes are only partially correlated. In these cases the removal of an attribute may leadto the loss of useful information, and the selective naive Bayesian classifier may still retainboth attributes. In addition, this wrapper-based approach is, in general, computationallyexpensive. Our experimental results (see Figure 6) show that the methods we examine hereare usually more accurate than the selective naive Bayesian classifier as used by John andKohavi (1997).

Work in the second category (Kononenko, 1991; Pazzani, 1995; Ezawa & Schuermann,1995) are closer in spirit to our proposal, since they attempt to improve the predictiveaccuracy by removing some of the independence assumptions. Thesemi-naive Bayesianclassifier(Kononenko, 1991) is a model of the form

P (C,A1, . . . , An) = P (C) · P (A1|C) · · ·P (Ak|C) (9)

whereA1, . . . ,Ak are pairwise disjoint groups of attributes. Such a model assumes thatAiis conditionally independent ofAj if, and only if, they are in different groups. Thus, no as-sumption of independence is made about attributes that are in the same group. Kononenko’smethod uses statistical tests of independence to partition the attributes into groups. This pro-cedure, however, tends to select large groups, which can lead to overfitting problems. Thenumber of parameters needed to estimateP (Ai|C) is |Val(C)| · (∏Aj∈Ai |Val(Aj)| − 1),which grows exponentially with the number of attributes in the group. Thus, the parameters



Table 4. Experimental results of comparing tree-like networks with unrestricted augmented naive Bayes andmultinets.

assessed forP (Ai|C) may quickly become unreliable ifAi contains more than two or threeattributes.

Pazzani suggests that this problem can be solved by using a cross-validation scheme toevaluate the accuracy of a classifier. His procedure starts with singleton groups (i.e.,A1 ={A1}, . . . ,An = {An}) and then combines, in a greedy manner, pairs of groups. (Healso examines a procedure that performs feature subset selection after the stage of joiningattributes.) This procedure does not, in general, select large groups, since these lead to poorprediction accuracy in the cross-validation test. Thus, Pazzani’s procedure learns classifiersthat partition the attributes in to many small groups. Since each group of attributes isconsidered independent of the rest given the class, these classifiers can capture only smallnumber of correlations among the attributes.

Both Kononenko and Pazzani essentially assume that all of the attributes in each groupAi can be arbitrarily correlated. To understand the implications of this assumption, we usea Bayesian network representation. If we letAi = {Ai1 , . . . , Ail}, then, using the chainrule, we get

P (Ai|C) = P (Ai1 |C) · P (Ai2 |Ai,1, C) · · ·P (Ail |Ai,1, . . . , Ail−1, C).

Applying this decomposition to each of the terms in Equation 9, we get a product formfrom which a Bayesian network can be built. Indeed, this is an augmented naive Bayesnetwork, in which there is a complete subgraph—that is, one to which we cannot add arcs


Example: data on death penalty

Data provided by the Georgia parole board.

Variable Description

death Did the defendant get the death penalty?blkdef Is the defendant black?whtvict Is the victim white?aggcirc Number of aggravating circumstancesstranger Were victim and defendant strangers?

We have 100 observations.


Example in bnlearn package

> death.small[1:5,]

death blkdef whtvict aggcirc stranger

1 0 1 0 1 0

2 0 1 0 1 0

3 0 1 0 1 0

4 1 1 0 2 0

5 0 1 0 1 0

> summary(death.small)

death blkdef whtvict aggcirc stranger

0:51 0:47 0:26 1:22 0:49

1:49 1:53 1:74 2:78 1:51

> death.nb <- naive.bayes("death",data=death.small)

> death.nb.pred <- predict(death.nb,death.small)

> table(death.small[,1],death.nb.pred)

death.nb.pred

0 1

0 36 15

1 12 37

> sum(diag(table(death.small[,1],death.nb.pred))/nrow(death.small))

[1] 0.73



# fit TAN to death penalty data with "death" as class variable

> death.tan <- tree.bayes(death.small,"death")

# fit TAN parameters using maximum likelihood estimation ("mle")

> death.tan.fit <- bn.fit(death.tan,death.small,"mle")

# predict class on training sample

> death.tan.pred <- predict(death.tan.fit,death.small)

# make confusion matrix

> table(death.small[,1],death.tan.pred)

death.tan.pred

0 1

0 32 19

1 8 41

# compute accuracy

> sum(diag(table(death.small[,1],death.tan.pred))/nrow(death.small))

[1] 0.73

# plot the TAN structure

> plot(death.tan)


TAN for death penalty data

death

blkdef whtvict

aggcirc

stranger



# learn network structure with hill-climber

> death.hc <- hc(death.small)

> plot(death.hc)

> death.hc.fit <- bn.fit(death.hc,death.small,"mle")

> death.hc.pred <- predict(death.hc.fit,node="death",data=death.small)

> table(death.small[,1],death.hc.pred)

death.hc.pred

0 1

0 20 31

1 6 43

> sum(diag(table(death.small[,1],death.hc.pred))/nrow(death.small))

[1] 0.63


Network structure with hill climbing for death penalty data

death

blkdef whtvict

aggcirc

stranger


Network structure with hill climbing for death penalty data

death

blkdef whtvict

aggcirc

stranger

Death penalty and race of defendant are independent given race of victim.


data mining 2012 bayesian network classifiersdata mining 2012 bayesian network classi ers ad...

Documents