learning probabilistic decision trees for auc

8
Learning probabilistic decision trees for AUC Harry Zhang * , Jiang Su Faculty of Computer Science, University of New Brunswick, P.O. Box 4400, Fredericton, NB, Canada E3B 5A3 Available online 13 December 2005 Abstract Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditional learning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees produce good classification accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, a probability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe the poor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation also plays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence, called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-based algorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learn- ing algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an effective model and algorithm for appli- cations in which an accurate ranking is required. Ó 2005 Elsevier B.V. All rights reserved. Keywords: Decision trees; AUC; Naive Bayes; Ranking 1. Introduction Classification is one of the most important tasks in machine learning and pattern recognition. In classification, a classifier is built from a set of training examples with class labels. A key performance measure of a classifier is its pre- dictive accuracy (or error rate, 1 accuracy). Many classi- fiers can also produce the class probability estimates p(cjE) that is the probability of an example E in the class c. How- ever, this information is largely ignored—the error rate does not consider how ‘‘far-off’’ (be it 0.45 or 0.01) the pre- diction of each example is from its target, but only the class with the largest probability estimate. In many applications, however, classification and error rate are not enough. For example, in direct marketing, we often need to promote the top X% of customers during gradual roll-out, or we often deploy different promotion strategies to customers with different likelihoods of buying some products. To accomplish these tasks, we need more than a mere classification of buyers and non-buyers. We need (at least) a ranking of customers in terms of their like- lihoods of buying. Thus, a ranking is much more desirable than just a classification. If we are aiming at accurate ranking from a classifier, one might naturally think that we must need the true rank- ing of the training examples. In most scenarios, however, that is not possible. Most likely, what we are given is a data set of examples with class labels. Fortunately, when only a training set with class labels is given, the area under ROC (receiver operating characteristics) curve (Swets, 1988; Pro- vost and Fawcett, 1997), or simply AUC, can be used to evaluate classifiers that also produce rankings. Hand and Till (2001) show that, for binary classification, AUC is equivalent to the probability that a randomly chosen exam- ple of class will have a smaller estimated probability of belonging to class + than a randomly chosen example of class +. They present a simple approach to calculating the AUC of a classifier G below: 0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.10.013 * Corresponding author. Fax: +1 506 453 3566. E-mail address: [email protected] (H. Zhang). www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 892–899

Upload: harry-zhang

Post on 15-Jul-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

www.elsevier.com/locate/patrec

Pattern Recognition Letters 27 (2006) 892–899

Learning probabilistic decision trees for AUC

Harry Zhang *, Jiang Su

Faculty of Computer Science, University of New Brunswick, P.O. Box 4400, Fredericton, NB, Canada E3B 5A3

Available online 13 December 2005

Abstract

Accurate ranking, measured by AUC (the area under the ROC curve), is crucial in many real-world applications. Most traditionallearning algorithms, however, aim only at high classification accuracy. It has been observed that traditional decision trees produce goodclassification accuracy but poor probability estimates. Since the ranking generated by a decision tree is based on the class probabilities, aprobability estimation tree (PET) with accurate probability estimates is desired in order to yield high AUC. Some researchers ascribe thepoor probability estimates of decision trees to the decision tree learning algorithms. To our observation, however, the representation alsoplays an important role. In this paper, we propose to extend decision trees to represent a joint distribution and conditional independence,called conditional independence trees (CITrees), which is a more suitable model for yielding high AUC. We propose a novel AUC-basedalgorithm for learning CITrees, and our experiments show that the CITree algorithm outperforms the state-of-the-art decision tree learn-ing algorithm C4.4 (a variant of C4.5), naive Bayes, and NBTree in AUC. Our work provides an effective model and algorithm for appli-cations in which an accurate ranking is required.� 2005 Elsevier B.V. All rights reserved.

Keywords: Decision trees; AUC; Naive Bayes; Ranking

1. Introduction

Classification is one of the most important tasks inmachine learning and pattern recognition. In classification,a classifier is built from a set of training examples with classlabels. A key performance measure of a classifier is its pre-dictive accuracy (or error rate, 1 � accuracy). Many classi-fiers can also produce the class probability estimates p(cjE)that is the probability of an example E in the class c. How-ever, this information is largely ignored—the error ratedoes not consider how ‘‘far-off’’ (be it 0.45 or 0.01) the pre-diction of each example is from its target, but only the classwith the largest probability estimate.

In many applications, however, classification and errorrate are not enough. For example, in direct marketing,we often need to promote the top X% of customers duringgradual roll-out, or we often deploy different promotion

0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2005.10.013

* Corresponding author. Fax: +1 506 453 3566.E-mail address: [email protected] (H. Zhang).

strategies to customers with different likelihoods of buyingsome products. To accomplish these tasks, we need morethan a mere classification of buyers and non-buyers. Weneed (at least) a ranking of customers in terms of their like-lihoods of buying. Thus, a ranking is much more desirablethan just a classification.

If we are aiming at accurate ranking from a classifier,one might naturally think that we must need the true rank-ing of the training examples. In most scenarios, however,that is not possible. Most likely, what we are given is a dataset of examples with class labels. Fortunately, when only atraining set with class labels is given, the area under ROC(receiver operating characteristics) curve (Swets, 1988; Pro-vost and Fawcett, 1997), or simply AUC, can be used toevaluate classifiers that also produce rankings. Hand andTill (2001) show that, for binary classification, AUC isequivalent to the probability that a randomly chosen exam-ple of class � will have a smaller estimated probability ofbelonging to class + than a randomly chosen example ofclass +. They present a simple approach to calculatingthe AUC of a classifier G below:

H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899 893

bA ¼ S0 � n0ðn0 þ 1Þ=2

n0n1

; ð1Þ

where n0 and n1 are the numbers of negative and positiveexamples, respectively, and S0 =

Pri, where ri is the rank

of ith positive example in the ranking.From Eq. (1), it is clear that AUC is essentially a mea-

sure of the quality of a ranking. For example, the AUCof a ranking is 1 (the maximum value of AUC) if there isno positive example preceding a negative example.

If we are aiming at accurate probability-based ranking,what is the performance of the traditional learning algo-rithms, such as decision trees and naive Bayes? While deci-sion trees perform quite well in classification, it is alsofound that their probability estimates are poor (Pazzaniet al., 1994; Provost et al., 1998). Building decision treeswith accurate probability estimates, called probability esti-mation trees (PETs), has received a great deal of attentionrecently (Provost and Domingos, 2003). Some researchersascribe the poor probability estimates of decision treesto the decision tree learning algorithms. Thus, manytechniques have been proposed to improve the learningalgorithms in producing accurate probability estimates(Provost and Domingos, 2003).

To our observation, however, the representation alsoplays an important role. Indeed, the representation of deci-sion trees is fully expressive theoretically, but it is oftenimpractical to learn such a representation with accurateprobability estimates from limited training data. In a deci-sion tree, the class probability p(cjE) is estimated by thefraction of the examples of class c in the leaf into whichE falls. Thus, the class probabilities of all the examples inthe same leaf are equal. This is an obstacle in building anaccurate PET, because two contradictory factors are inplay at the same time. On one hand, traditional decisiontree algorithms, such as C4.5, prefer a small tree. Thus, aleaf has more examples and the class probability estimatesare more reliable. A small tree, however, has a small num-ber of leaves, thus more examples will have the same classprobability. That prevents the learning algorithm frombuilding an accurate PET. On the other hand, if the treeis large, not only may the tree overfit the training data,but also the number of examples in each leaf is also small,and thus the probability estimates would not be accurateand reliable. Such a contradiction does exist in traditionaldecision trees.

Our motivation is to build a model to produce accurateranking by extending the representation of traditional deci-sion trees not only to represent accurate probabilities butalso to be easily learnable from limited data in practice.Naturally, if an accurate PET is built, the ranking yieldedby it should also be accurate, since an accurate approxima-tion of p(cjE) is found and can be used for ranking. Inother words, its AUC should be high.

In this paper, a training example is represented by a vec-tor of attribute values and a class label. We denote a vectorof attributes by an bold-face upper-case letter A, A =

(A1,A2, . . . ,An), and an assignment of value to each attri-bute in A by a corresponding bold-face lower-case lettera. We use C to denote the class variable and c to denoteits value. Thus, a training example E = (a,c), where a =(a1,a2, . . . ,an), and ai is the value of attribute Ai. Aclassifier is a function that maps an example to a classlabel.

The rest of the paper is organized as follows. Section 2introduces the related work on learning decision trees withaccurate probability estimates and ranking. Section 3 pre-sents a novel model for ranking and a corresponding algo-rithm. In Section 4, we present empirical experiments. Thepaper concludes with discussion and some directions forfuture work.

2. Related work

Traditional decision tree algorithms, such as C4.5, havebeen observed to produce poor estimates of probabilities(Pazzani et al., 1994; Provost et al., 1998). According toProvost and Domingos (2003), the decision tree representa-tion, however, is not (inherently) doomed to produce poorprobability estimates, and a part of the problem is thatmodern decision tree algorithms are biased against buildingthe tree with accurate probability estimates. Provost andDomingos propose the following techniques to improvethe AUC of C4.5.

(1) Smooth probability estimates by Laplace correction.Assume that there are p examples of the class at a leaf,N total examples, and C total classes. The frequency-based estimation calculates the estimated probabilityas p

N. The Laplace estimation calculates the estimatedprobability as pþ1

NþC.(2) Turn off pruning and collapsing. Provost and Domin-

gos (2003) show that pruning a large tree damagesthe probability estimation. Thus, a simple strategy toimprove the probability estimation is to build a largetree without pruning.

Provost and Domingos call the resulting algorithm C4.4.They compared C4.4 to C4.5 by empirical experiments, andfound that C4.4 is a significant improvement over C4.5with regard to AUC.

Ling and Yan (2003) propose a method to improve theAUC of a decision tree. They present a novel probabilityestimation algorithm, in which the class probability of anexample is an average of the probability estimates fromall leaves of the tree, instead of only using the leaf intowhich it falls. In other words, each leaf contributes to theclass probability estimate of an example. Ferri et al.(2003) propose a new probability smoothing techniquem-branch smoothing for decision trees, in which the classdistributions of all nodes from the root to each leaf aretaken into account.

In learning a decision tree, a critical step is to choose the‘‘best’’ attribute in each step. The entropy-based splitting

A1

1 0

1 0

1 0

P(C=+)=0.7P(C=+)=0.3

P(C=+)=0.1P(C=+)=0.9

P(C=+)=0.7

P(C=+)=0.3

P(C=+)=0.1

P(C=+)=0.9

P(C=+)=0.4P(C=_)=0.6

P(C=+)=0.8P(C=_)=0.2

A2 A3

A3 A3

1

10

0

Fig. 1. An example of an probabilistic tree.

894 H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899

criteria, such as information gain, gain ratio, have beenwidely used. Recently, Ferri et al. (2002) propose a novelsplitting criterion based on ROC curve. Their experimentsshow that the new algorithm results in better probabilityestimates, without sacrificing accuracy.

A questionable point of traditional decision trees(including probabilistic trees) is that only the attributesalong the path from the root to a leaf are used in both clas-sification and probability estimation. Since a small tree ispreferred by traditional decision tree learning algorithms,many attributes may not be used. This is a more seriousissue in ranking than classification. Kohavi (1996) pro-poses to deploy a naive Bayes in each leaf, and the result-ing decision tree is called an NBTree. The algorithm forlearning an NBTree is similar to C4.5. After a tree isgrown, a naive Bayes is constructed for each leaf usingthe data associated with that leaf. An NBTree classifiesan example by sorting it to a leaf and applying the naiveBayes in that leaf to assign a class label to it. Actually,deploying a model at leaves to calibrate the probabilityestimates of a decision tree has been proposed by Symthet al. (1996). They also notice that every example from aparticular leaf has the same probability estimate, and thussuggest to place a kernel-based probability density estima-tor at each leaf.

Our work is inspired by the works of Kohavi andSymth et al., but from different point of view. Indeed, ifa local model that incorporates the attributes not occur-ring on the path is deployed at each leaf, together withthe conditional probability of the attributes occurring onthe path, the resulting tree represents accurate probabili-ties. If the structure of standard decision trees is learnedand used in the same way as in C4.5, however, the leafmodels would not directly and explicitly benefit from thestructure, and thus would still play a role of smoothing.Our motivation is how to learn and use the structure ofa tree to explore conditional independencies among attri-butes, such that a simple leaf model, like naive Bayes,gives accurate probability estimates. Then, the resultingmodel is more compact and more easily learnable, whileits representation is still accurate. Since the probabilityestimates are more accurate, the ranking yielded by it isalso more accurate.

3. Understanding decision trees from probabilistic

perspective

Even though there theoretically exists a decision treewith accurate probability estimates for any given problem,such a tree tends to be large and learnable only when suf-ficient (huge) training data are available. This issue is calledthe fragmentation problem (Pagallo and Haussler, 1990;Kohavi, 1996). In practice, a small tree is preferred. Thus,poor probability estimates are yielded. Therefore, the rep-resentation of a decision tree should be extended to repre-sent accurate probabilities and be learnable from limitedtraining data.

3.1. Probabilistic decision trees

Fig. 1 shows an example of a probabilistic tree (Buntine,1991), in which each leaf L represents a conditional distri-bution p(CjAp(L)), where Ap(L) are the attributes thatoccur in the path from the root to L. For simplicity, theattributes that occur in the path is called the path attributesof L, and all other attributes are called the leaf attributes ofL, denoted by Al(L).

In practice, p(CjAp(L)) is often estimated by using thefraction of examples of class C in L, and the classificationof a decision tree is based on p(CjAp(L)). Thus, from theprobabilistic point of view, a decision tree defines a classi-fier, shown as

CdtðEÞ ¼ argc

max pðcjapðLÞÞ; ð2Þ

where L is the leaf into which E falls, ap(L) is the value ofthe path attributes of L, and Cdt(E) is the classificationgiven by the decision tree.

In a decision tree, p(cjap(L)) is actually used as anapproximation of p(cjE). Thus, all the examples falling intothe same leaf have the same class probability.

3.2. Conditional independence trees

In a probabilistic tree, a leaf L represents the conditionalprobability distribution p(CjAp(L)). If there is a representa-tion of the conditional probability distribution over the leafattributes at each leaf, called the local conditional distribu-tion and denoted by p(Al(L)jAp(L),C), then each leaf repre-sents a full joint distribution over all the attributes, shown as

pðA;CÞ ¼ apðCjApðLÞÞpðAlðLÞjApðLÞ;CÞ; ð3Þwhere a is a normalization factor.

A probabilistic decision tree T is called a joint probabilis-

tic tree, if each of its leaves represents both the conditionalprobability distribution p(CjAp(L)) and p(Al(L)jAp(L),C).

A joint probability tree T is called a conditional indepen-

dence tree, or simply CITree, if the local conditional indepen-dence assumption, shown in Eq. (4), is true for each leaf L:

A1

A2

A3 A3

An An An An

A3

An

1 0

1 1

1 0 1 01 0

... ...

...

......

An

A2

1 0

A3

0 0

Fig. 2. A CITree to represent any joint distribution p(A1,A2, . . . ,An),where A1,A2, . . . ,An are Boolean attributes.

H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899 895

pðAlðLÞjApðLÞ;CÞ ¼Ymi¼1

pðAlijC;ApðLÞÞ; ð4Þ

where Al(L) = (Al1,Al2, . . . ,Alm) are the leaf attributes of L.Given an example E, the class probability p(cjE) is com-

puted as follows. E is sorted from the root to a leaf using itsattribute values (path attributes), and then the local modelon the leaf is applied to compute the probability p(cjE) onlyusing leaf attributes.

The structure of a CITree represents the conditionalindependencies among attributes, and its leaves representa joint distribution. A CITree is different from a probabilis-tic tree in the following aspects.

(1) A CITree represents a joint distribution over all theattributes, but a probabilistic tree represents onlythe conditional probability distribution of the pathattributes.

(2) A CITree explicitly defines conditional dependenciesamong attributes.

Notice the conditional independence assumption onwhich naive Bayes is based on, shown as

pðajcÞ ¼Yn

i¼1

pðaijcÞ. ð5Þ

Comparing Eq. (4) with Eq. (5), we notice that the localconditional independence assumption of CITrees is a relax-ation of the (global) conditional independence assumptionof naive Bayes. Thus, the local conditional independenceassumption is more realistic in applications. In addition,the local conditional independence represented in a CITreeis also different from the conditional independence in aBayesian network. In a Bayesian network, an attributeA1 is conditionally independent of attribute A2 given A3

means that for all the values of A3, A1 is independent ofA2. In a CITree, however, the conditional independenceis that A1 is independent of A2, given a specified value ofA3. The granularity in a CITree is finer than that in aBayesian network.

It is interesting to notice that, after growing a CITree, ifa naive Bayes is deployed on each leaf using only the dataassociated with it, the naive Bayes, called leaf naive Bayes,represents the actual joint distribution. A leaf naive Bayesin leaf L is shown as

ClnbðEÞ ¼ argc

max pLðcÞYmi¼1

pLðalijcÞ; ð6Þ

where pL(c) denotes the probability of examples in L beingin c, and pL(alijc) is the probability that the examples ofclass c have Ali = ali in L. It is obvious that pL(c) =p(cjap(L)) and pL(alijc) = p(alijc,ap(L)) on the wholetraining data. So pLðcÞ

Qmi¼1pLðalijcÞ is proportional to

p(cjE). Thus, if the structure of the CITree is found, naiveBayes is a perfect model for leaves.

Generally, if the local model is naive Bayes, a CITreecan be viewed as a combination of a decision tree and naiveBayes. It is well known that decision trees are fully expres-sive with the class of propositional language; that is, any

Boolean function is representable by a decision tree. How-ever, naive Bayes has limited expressive power; that is, itcan only represent linear Boolean functions (Domingosand Pazzani, 1997). Interestingly, any joint distribution isrepresentable by a CITree. According to the product rule,

pðA1;A2; . . . ;An;CÞ ¼ pðCÞpðA1jCÞP ðA2jA1;CÞ � � �P ðAnjA1; . . . ;An�1;CÞ. ð7Þ

A CITree representing any joint distribution p(A1,A2, . . . ,An) is shown in Fig. 2. Thus, CITrees are also fullyexpressive.

The representation of CITrees, however, is more compactthan that of decision trees. To show this, let us consider onlyfull dependencies among attributes. An attribute Ai is said tofully depend on Aj, if Ai = Aj. Notice that if an attribute isconditionally independent of all other attributes, it doesnot occur on any path. If several attributes conditionallydepend on one attribute, only that attribute occurs in thepath. In the extreme case that the global conditional inde-pendent assumption is true, a CITree has only one node,which is just a global naive Bayes. Assume that there are n

attributes. The maximum height of a CITree is n2, which cor-

responds to that each attributes depends exactly on anotherattribute. The maximum height of a decision tree is n. Ourexperiments in Section 4 show that the average size ofCITrees is much smaller than that of decision trees.

3.3. An AUC-based algorithm for learning CITree

From the discussion in the preceding section, a CITreecan represent any joint distribution. Thus, a CITree is aperfect PET, and the ranking yielded by CITree is accurate.But in practice, learning the structure of a CITree is just astime-consuming as learning an optimal decision tree. How-ever, a good approximation of a CITree, which gives goodestimates of class probabilities, is satisfiable in many appli-cations. If the structure of a CITree is determined, a leafnaive Bayes is a perfect model representing the local condi-tional distributions at leaves.

Building a CITree could be also a greedy and recursiveprocess, similar to building a decision tree. At each step,

Table 1Description of the data sets used in the experiments

Data set Size Numberof attributes

Numberof classes

Letter 20,000 17 26Mushroom 8124 22 2Waveform 5000 41 3Sick 3772 30 2Hypothyroid 3772 30 4Chess end-game 3196 36 2Splice 3190 62 3Segment 2310 20 7German credit 1000 24 2Vowel 990 14 11Anneal 898 39 6Vehicle 846 19 4Pima Indians diabetes 768 8 2Wisconsin-breast-cancer 699 9 2Credit approval 690 15 2Soybean 683 36 19Balance-scale 625 5 3Vote 435 16 2Horse colic 368 28 2Ionosphere 351 34 2Primary-tumor 339 18 22Heart-c 303 14 5Breast cancer 286 9 2Heart-statlog 270 13 2Audiology 226 70 24Glass 214 10 7Sonar 208 61 2Autos 205 26 7Hepatitis domain 155 19 2Iris 150 5 3Lymph 148 19 4Zoo 101 18 7Labor 57 16 2

896 H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899

choose the ‘‘best’’ attribute as the root of the (sub)tree, splitthe associated data into disjoint subsets corresponding tothe values of the attribute, and then recur this process foreach subset until certain criteria are satisfied.

Notice as well, however, the difference between learninga CITree and learning a decision tree. In building a deci-sion tree, we are looking for a sequence of attributes thatleads to the least impurity in all leaves of the tree. Thekey in choosing an attribute is whether the resulting parti-tion of the examples is ‘‘pure’’ or not. It is natural, since themost common class of a leaf is used as the class of all theexamples in that leaf. However, such a selection strategydoes not necessarily lead to the truth of the local condi-tional independence assumption. In building a CITree,we intend to choose the attributes that make the local con-ditional independence among the rest of attributes true asmuch as possible. That means that, even though the impu-rity of its leaves is high, it could still be a good CITree, aslong as the leaf attributes are independent. Thus, tradi-tional decision tree learning algorithms are not directlysuitable for learning CITrees.

In learning a CITree, an attribute, given which all otherattributes have the maximum conditional independence,should be selected at each step. Thus, we should selectthe attribute with the greatest influence on other attributes.Our idea is to try each possible attribute as the root, eval-uate the resulting tree, and choose the attribute thatachieves the highest AUC.

Similar to C4.5, our learning algorithm has two steps:growing a tree and pruning. In growing a tree, each possi-ble attribute is evaluated at each step, and the attribute thatgives the most improvement in AUC is selected. The algo-rithm is depicted below.

Algorithm AUC-CITree (T,S,A)Input: CITree T, a set S of labeled examples, a set ofattributes AOutput: a CITree.(1) For all attributes A in A

• Partition S into S1, . . . ,Sk, each of which corre-sponds to a value of A.

• Create a leaf naive Bayes for each Si.• Evaluate the AUC on S of the resulting CITree.

(2) If the best AUC of the resulting CITrees is notsignificantly better than the one produced from thenaive Bayes on S, make the current node a leafand return.

(3) For all values a of Aopt that achieves the mostimprovement in AUC• CITree(Ta,Sa,A � {Aopt}).• Add Ta as a child of T.

(4) For each child A of the parent Ap of Aopt

• Make the node A a leaf and evaluate the resultingAUC on Ap.

• If it is not significantly worse than the originalAUC, make the node A a leaf.

(5) Return T.

In the preceding algorithm, we use the relative AUCincrease (or reduction) AUCc�AUCo

AUCcof 5% to define the signif-

icance for the improvement of the resulting CITree, whereAUCc and AUCo are the new AUC score and the originalone, respectively. Notice that the AUC for the children of anode is computed by putting the instances from all theleaves together rather than computing the AUC for eachleaf separately.

Our AUC-CITree algorithm is different from theNBTree algorithm (Kohavi, 1996) in several aspects:

(1) Our AUC-CITree algorithm is based on AUC,instead of accuracy.

(2) The AUC-CITree algorithm adopts the post-pruningstrategy, rather than early stop. It has been noticedthat pruning is detrimental for probability estimationfor traditional decision trees (Provost and Domingos,2003; Ferri et al., 2003). However, it is different inCITrees. Notice that a local model is deployed oneach leaf. Without pruning, the probability estimatesgiven by the local models would be not reliable, dueto that the number of training examples on each leafis small. Thus, a pruning is necessary for building a

H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899 897

good CITree. Another alternative strategy is earlystop, adopted in NBTree. That is, the tree growingprocess stops when the size of training data is smallerthan a threshold (30 in NBTree). However, from ourexperiments, pruning is more effective.

Notice that both Ling and Yan (2003) and Ferri et al.(2003) are essentially smoothing techniques that are basedon the structure of traditional decision trees. CITrees, how-ever, use the structure of a decision tree to represent condi-tional dependence and deploy a local model on each leaf toproduce the class probabilities. Intuitively, CITrees couldbe more powerful than smoothing techniques.

4. Experiments

We conduct experiments to compare our algorithmCITree with C4.4, naive Bayes, and NBTree. The imple-mentation of C4.4, naive Bayes, and NBTree is from Weka(Witten and Frank, 2000), and C4.4 is J48 in Weka withLaplace correction and turning off pruning and collapsing.Notice that C4.4 is designed specifically for improving the

Table 2Experimental results on AUC

Data set AUC-CITree NBTree

Letter 98.59 ± 0.15 98.49 ± 0.1Mushroom 100 ± 0 100 ± 0Waveform 95.29 ± 0.68 93.69 ± 0.9Sick 98.42 ± 1.43 94.27 ± 3.6Hypothyroid 88.26 ± 5.67 87.47 ± 6.3Chess end-game 99.79 ± 0.24 99.44 ± 0.6Splice 99.45 ± 0.28 99.43 ± 0.3Segment 99.33 ± 0.23 99.08 ± 0.3German credit 79.03 ± 4.2 77.49 ± 5.3Vowel 99.22 ± 0.62 98.46 ± 0.8Anneal 96.05 ± 2.03 96.23 ± 1.2Vehicle 86.18 ± 2.71 85.66 ± 3.4Pima Indians diabetes 82.47 ± 5.03 81.99 ± 5.1Wisconsin-breast-cancer 99.15 ± 0.85 99.25 ± 0.7Credit approval 91.04 ± 3.19 91.15 ± 3.4Soybean 99.75 ± 0.33 99.68 ± 0.4Balance-scale 84.08 ± 4.42 84.08 ± 4.4Vote 98.26 ± 1.73 98.51 ± 1.6Horse colic 84.59 ± 7.15 86.28 ± 6.9Ionosphere 95.33 ± 3.5 94.04 ± 4.4Primary-tumor 78.75 ± 1.72 78.12 ± 1.8Heart-c 84.05 ± 0.6 83.93 ± 0.6Breast cancer 66.94 ± 11.36 66.01 ± 10.Heart-statlog 90.78 ± 5.1 89.28 ± 6.2Audiology 70.8 ± 0.86 71.06 ± 0.6Glass 84.27 ± 5.15 82 ± 6.0Sonar 78.94 ± 10.44 77.54 ± 9.9Autos 93.36 ± 2.97 93.84 ± 3.1Hepatitis domain 85.65 ± 13.17 82.77 ± 13.Iris 98.42 ± 2.13 98.84 ± 2.0Lymph 89.92 ± 1.82 89.05 ± 2.4Zoo 89.44 ± 2.44 89.48 ± 2.3Labor 95.5 ± 15.55 97.42 ± 12.

Mean 90.31 ± 3.58 89.82 ± 3.7

AUC score of decision trees, but naive Bayes used in ourexperiments is not.

We used 33 UCI (Merz et al., 1997) data sets assigned byWeka, described in Table 1. Numeric attributes are discret-ized using ten-bin discretization implemented in Weka.Missing values are also processed using the mechanism inWeka. In our experiment, multi-class AUC has been calcu-lated by M-measure (Hand and Till, 2001), and the averageAUC on each data set is obtained by using 10-fold strati-fied cross validation 10 times. In our implementation, weused the Laplace estimation to avoid the zero-frequencyproblem. We conducted a two-tailed t-test with a 95% con-fidence level to compare each pair of algorithms on eachdata set.

Table 2 shows the average AUC obtained by the fouralgorithms. Our observations are summarized below.

(1) The CITree algorithm outperforms naive Bayes sig-nificantly in terms of AUC: It wins in 9 data sets, tiesin 24 data sets, and loses in 0 data set. The averageAUC for CITree is 90.31%, higher than the averageAUC 89.74% of naive Bayes.

C4.4 NB

7 95.26 ± 0.32 96.88 ± 0.21100 ± 0 99.79 ± 0.07

6 80.95 ± 1.47 95.29 ± 0.682 99.08 ± 0.52 95.83 ± 2.44 82.74 ± 7.58 87.78 ± 6.12

99.95 ± 0.06 95.16 ± 1.21 97.91 ± 0.72 99.45 ± 0.284 98.98 ± 0.38 98.5 ± 0.414 68.58 ± 4.67 79.02 ± 4.224 90.57 ± 2.33 95.58 ± 1.129 94.53 ± 2.31 96.1 ± 1.193 85.35 ± 3.07 80.31 ± 3.09

73.89 ± 5.33 82.51 ± 55 97.98 ± 1.44 99.25 ± 0.754 87.5 ± 3.75 91.67 ± 3.171 91.32 ± 1.58 99.73 ± 0.342 58.83 ± 5.31 84.08 ± 4.427 97.28 ± 2.53 96.95 ± 2.141 81.91 ± 7.32 83.32 ± 7.572 92.09 ± 5.2 93.4 ± 4.79

74.9 ± 2.37 78.88 ± 1.762 83.11 ± 0.83 84.05 ± 0.694 58.05 ± 9.93 68.24 ± 11.936 80.82 ± 9.39 90.85 ± 5.129 70.51 ± 0.72 71.08 ± 0.648 82.72 ± 5.24 80.89 ± 5.9

76.32 ± 9.05 84.17 ± 9.525 91.21 ± 3.33 89.84 ± 5.0996 75.6 ± 16.57 87.25 ± 11.931 96.86 ± 2.86 98.64 ± 2.177 86.33 ± 4.84 90.01 ± 1.717 88.43 ± 2.7 89.48 ± 2.3706 82.21 ± 20.45 97.5 ± 8.58

2 85.51 ± 4.37 89.74 ± 3.53

Table 4Results of two-tailed t-test on AUC

NB C4.4 NBTree

AUC-CITree 9-24-0 19-14-0 4-29-0NB 22-6-5 2-23-8C4.4 1-13-19

Note: Each entry w/t/l means that the algorithm in the corresponding rowwins in w data sets, ties in t data sets, and loses in l data sets, compared tothe algorithm in the corresponding column.

898 H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899

(2) The CITree algorithm also outperforms C4.4 signifi-cantly in terms of AUC: It wins in 19 data sets, tiesin 14 data sets, and loses in 0 data sets. The averageAUC for decision trees is 85.81%, lower thanCITree�s.

(3) The CITree algorithm perform better than NBTree interms of AUC: It wins in 4 data sets, ties in 29 datasets, and loses in 0 data sets. The average AUC forNBTree is 89.82%, lower than CITree�s.

Table 3 shows the tree size and training time obtained bythe three tree learning algorithms. Notice that C4.4 is muchmore efficient than both NBTree and CITree. Thus, we donot include the running time of C4.4. From Table 3, we cansee that the CITree learning algorithm is more efficientthan NBTree and the size of the CITrees is smaller thanNBTree and C4.4. Some detailed observations are summa-rized below (Table 4).

(1) The tree size for CITree is significantly smaller thanthe tree size for C4.4 over most of these data sets.Here the size of a tree is the number of nodes. The

Table 3Experimental results on the tree size and training time (s)

Data set CITree(S) NBTree(S)

Letter 18 1298Mushroom 20 26Waveform 1 57Sick 58 63Hypothyroid 10 5Chess end-game 56 47Splice 1 3Segment 12 121German credit 1 16Vowel 22 64Anneal 11 43Vehicle 94 123Pima Indians diabetes 1 8Wisconsin-breast-cancer 2 2Credit approval 8 14Soybean 4 37Balance-scale 1 1Vote 18 17Horse colic 29 28Ionosphere 28 13Primary-tumor 2 11Heart-c 1 11Breast cancer 7 11Heart-statlog 1 13Audiology 9 23Glass 12 29Sonar 11 13Autos 15 25Hepatitis domain 6 10Iris 7 7Lymph 2 9Zoo 6 7Labor 3 4

Total 477 2158

Note: CITree(S), NBTree(S), and C4.4(S) represent the tree size obtained by tcorresponding training time.

total tree size for CITree is 477, and for C4.4 it is28,356. Notice that pruning is not suitable to C4.4.The basic idea of C4.4 is to obtain a large tree andthen use the Laplace correction to smooth theprobability estimates. According to Provost andDomingos (2003), pruning damages the probabilityestimation of traditional decision trees.

(2) The tree size for CITree is also significantly smallerthan the tree size for NBTree over most of these datasets. The total tree size for NBTree is 2158. ForNBTree, it avoids to produce a large tree by earlystop, instead of pruning. But it essentially prefers asmall tree.

C4.4(S) CITree(T) NBTree(T)

14,162 52.63 246.1130 5.01 6.73

4161 7.36 64.26359 7 16.26

1463 7 6.2988 12.73 20.71

588 9.61 13.49759 2.73 12.35800 0.34 2.53899 0.99 8.4100 2.3 9.52937 0.88 9.57689 0.06 0.55149 0.07 0.56429 0.27 2.21131 4.44 24.04308 0.02 0.147 0.36 1.67

210 0.45 5.42164 0.63 7.69196 0.53 2.49203 0.08 1.3230 0.05 0.52271 0.05 0.8393 5.21 28.88

276 0.09 0.81154 0.78 15.56216 0.37 4.1276 0.12 1.7455 0.01 0.1573 0.08 1.4223 0.11 0.9221 0.02 0.51

28,356 122.38 517.71

he corresponding algorithm, and CITree(T) and NBTree(T) represent the

H. Zhang, J. Su / Pattern Recognition Letters 27 (2006) 892–899 899

(3) The training time for CITree is significantly fasterthan NBTree over most of these data sets. The totaltraining time for CITree is 123 s, and for NBTree itis 517 s.

5. Conclusions

In this paper, we extend the traditional decision treemodel to represent accurate probabilities in order to yieldaccurate ranking or high AUC. We propose a modelCITree, the structure of which explicitly represents condi-tional independencies among attributes. CITrees are moreexpressive than naive Bayes and more compact than deci-sion trees. We present and implement a novel AUC-basedlearning algorithm AUC-CITree to build a CITree forranking by exploring the conditional independenciesamong attributes, different from traditional decision treelearning algorithms. Our experiments show that theAUC-CITree algorithm performs better than C4.4, naiveBayes, and NBTree in AUC. In addition, the AUC-CITreealgorithm is more efficient and produces smaller trees com-pared to NBTree.

CITree can be viewed as a bridge between probabilisticmodels, such as Bayesian networks, and non-parametricmodels, such as decision trees. However, a more effectiveCITree learning algorithm is desired. Currently, our learn-ing algorithm is based on cross-validation. We believe thatif a better learning algorithm is found, a CITree will benefitmuch from its structure, and thus will be a good model forapplications.

References

Buntine, W., 1991. Learning Classification Trees. Artificial IntelligenceFrontiers in Statistics. Chapman and Hall, London, pp. 182–201.

Domingos, P., Pazzani, M., 1997. Beyond independence: conditions forthe optimality of the simple Bayesian classifier. Machine Learn. 29,103–130.

Ferri, C., Flach, P.A., Hernandez-Orallo, J., 2002. Learning decision treesusing the area under the ROC curve. In: Proc. of the 19th Internat.

Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA, pp.139–146.

Ferri, C., Flach, A.P., Hernandez-Orallo, J., 2003. Improving the AUC ofprobabilistic estimation trees. In: Proc. of the 14th European Conf. onMachine Learning. Springer, Berlin, pp. 121–132.

Hand, D.J., Till, R.J., 2001. A simple generalisation of the area under theROC curve for multiple class classification problems. Machine Learn.45, 171–186.

Kohavi, R., 1996. Scaling up the accuracy of naive-Bayes classifiers: adecision-tree hybrid. In: Proc. of the Second Internat. Conf. onKnowledge Discovery and Data Mining (KDD-96). AAAI Press,pp. 202–207.

Ling, C.X., Yan, R.J., 2003. Decision tree with better ranking. In: Proc. ofthe 20th Internat. Conf. on Machine Learning. Morgan Kaufmann,Los Altos, CA, pp. 480–487.

Merz, C., Murphy, P., Aha, D., 1997. UCI repository of machine learningdatabases. Dept of ICS, University of California, Irvine. Availablefrom: <http://www.ics.uci.edu/mlearn/MLRepository.html>.

Pagallo, G., Haussler, D., 1990. Boolean feature discovery in empiricallearning. Machine Learn. 5 (1), 71–100.

Pazzani, M., Merz, C., Murphy, P., Ali, K., Hume, T., Brunk, C., 1994.Reducing misclassification costs. In: Proc. of the 11th Internat. Conf. onMachine Learning. Morgan Kaufmann, Los Altos, CA, pp. 217–225.

Provost, F.J., Domingos, P., 2003. Tree induction for probability-basedranking. Machine Learn. 52 (3), 199–215.

Provost, F., Fawcett, T., 1997. Analysis and visualization of classifierperformance: comparison under imprecise class and cost distribution.In: Proc. of the Third Internat. Conf. on Knowledge Discovery andData Mining. AAAI Press, pp. 43–48.

Provost, F., Fawcett, T., Kohavi, R., 1998. The case against accuracyestimation for comparing induction algorithms. In: Proc. of theFifteenth Internat. Conf. on Machine Learning. Morgan Kaufmann,Los Altos, CA, pp. 445–453.

Swets, J., 1988. Measuring the accuracy of diagnostic systems. Science 240,1285–1293.

Symth, P., Gray, A., Fayyad, U., 1996. Retrofitting decision tree classifiersusing kernel density estimation. In: Proc. of the Twelfth Internat.Conf. on Machine Learning. Morgan Kaufmann, Los Altos, CA,pp. 506–514.

Witten, I.H., Frank, E., 2000. Data Mining—Practical Machine LearningTools and Techniques with Java Implementation. Morgan Kaufmann,Los Altos, CA.

Further reading

Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. MorganKaufmann, San Mateo, CA.