geometric decision tree.docx

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012 181 Geometric Decision Tree Naresh Manwani and P. S. Sastry, Senior Member, IEEE

AbstractIn this paper, we present a new algorithm for learning oblique decision trees. Most of the current decision tree algorithms rely on impurity measures to assess the goodness of hyperplanes at each node while learning a decision tree in top-down fashion. These impurity measures do not properly capture the geometric structures in the data. Motivated by this, our algorithm uses a strategy for assessing the hyperplanes in such a way that the geometric structure in the data is taken into account. At each node of the decision tree, we find the clustering hyperplanes for both the classes and use their angle bisectors as the split rule at that node. We show through empirical studies that this idea leads to small decision trees and better performance. We also present some analysis to show that the angle bisectors of clustering hyperplanes that we use as the split rules at each node are solutions of an interesting optimization problem and hence argue that this is a principled method of learning a decision tree. Index TermsDecision trees, generalized eigenvalue problem, multiclass classification, oblique decision tree.

I. INTRODUCTION

general situations, we have to approximate even arbitrary linear segments in the class boundary with many axis-parallel pieces; hence, the size of the resulting tree becomes large. The oblique decision trees, on the other hand, use a decision function that depends on a linear combination of all feature components. Thus, an oblique decision tree is a binary tree where we associate a hyperplane with each node. To classify a pattern, we follow a path in the tree by taking the left or right child at each node based on which side of the hyperplane (of that node) the feature vector falls in. Oblique decision trees represent the class boundary as a general piecewise linear surface. Oblique decision trees are more versatile (and hence are more popular) when features are real valued. The approaches for learning oblique decision trees can be classified into two broad categories. In one set of approaches, the structure of the tree is fixed beforehand, and we try to learn the optimal tree with this fixed structure. This methodology has been adopted by several researchers, and different optimization algorithms have been proposed [3]-[8]. The problem with these

T

HE DECISION tree is a well-known and widely used method for classification. The popularity of the decision

approaches is that they are applicable only in situations where

tree is because of its simplicity and easy interpretability as a classification rule. In a decision tree classifier, each nonleaf node is associated with a so-called split rule or a decision function, which is a function of the feature vector and is often binary valued. Each leaf node in the tree is associated with a class label. To classify a feature vector using a decision tree, at every nonleaf node that we encounter (starting with the root node), we branch to one of the children of that node based on the value assumed by the split rule of that node on the given feature vector. This process follows a path in the tree, and when we reach a leaf, the class label of the leaf is what is assigned to that feature vector. In this paper, we address the problem of learning an oblique decision tree, given a set of labeled training samples. We present a novel algorithm that attempts to build the tree by capturing the geometric structure of the class regions. Decision trees can be broadly classified into two types, i.e., axis parallel and oblique [1]. In an axis-parallel decision tree, the split rule at each node is a function of only one of the components of the feature vector. Axis-parallel decision trees are particularly attractive when all features are nominal; in such cases, we can have a nonbinary tree where, at each node, we test one feature value, and the node can have as many children as the values assumed by that feature [2]. However, in more

Manuscript received November 2, 2010; revised March 30, 2011 and June 22, 2011; accepted July 10, 2011. Date of current version December 7, 2011. This paper was recommended by Associate Editor M. S. Obaidat. The authors are with the Department of Electrical Engineering, Indian Institute of Science, Bangalore 560012, India (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TSMCB.2011.2163392

we know the structure of the tree a priori, which is often not the case. The other class of approaches learns the tree in a top- down manner. Top-down approaches have been more popular because of their versatility. Top-down approaches are recursive algorithms for building the tree in a top-down fashion. We start with the given training data and decide on the "best" hyperplane, which is assigned to the root of the tree. Then, we partition the training examples into two sets that go to the left child and the right child of the root node using this hyperplane. Then, at each of the two child nodes, we repeat the same procedure (using the appropriate subset of the training data). The recursion stops when the set of training examples that come to a node is pure, that is, all these training patterns are of the same class. Then, we make it a leaf node and assign that class to the leaf node. (We can also have other stopping criteria such as we make a node a leaf node if, for example, 95% of the training examples reaching that node belongs to one class.) A detailed survey of top-down decision tree algorithms is available in [9]. There are two main issues in top-down decision tree learning algorithms: 1) given the training examples at a node, how to rate different hyperplanes that can be associated with this node and, 2) given a rating function, how to find the optimal hyperplane at each node. One way of rating hyperplanes is to look for hyperplanes that are reasonably good classifiers for the training data at that node. In [10], two parallel hyperplanes are learned at each node such that one side of each hyperplane contains points of only one class and the space between these two hyperplanes contains the points that are not separable. A slight variant of the aforementioned algorithm is proposed in [11], where only

1083-4419/$26.00 2011 IEEE

182 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B: CYBERNETICS, VOL. 42, NO. 1, FEBRUARY 2012

one hyperplane is learned at each decision node in such a way that one side of the hyperplane contains points of only one class. However, in many cases, such approaches produce very large trees that have poor generalization performance. Another approach is to learn a good linear classifier (e.g., least mean square error classifier) at each node (see, e.g., [12]). Decision tree learning for multiclass classification problems using linear- classifier-based approaches is discussed in [13], [14]. Instead of finding a linear classifier at each node, Cline [15], which is a family of decision tree algorithms, uses various heuristics to determine hyperplanes at each node. However, they do not provide any results to show why these heuristics help or how one chooses a method. A Fisher-linear-discriminant-based decision tree algorithm is proposed in [16]. All the previously discussed approaches produce crisp decision boundaries. The decision tree approach giving probabilistic decision boundary is discussed in [17]. Its fuzzy variants are discussed in [18] and [19]. In a decision tree, each hyperplane at a nonleaf node should split the data in such a way that it aids further classification; the hyperplane itself need not be a good classifier at that stage. In view of this, many classical top-down decision tree learning algorithms are based on rating hyperplanes using the so-called impurity measures. The main idea is given as follows: Given the set of training patterns at a node and a hyperplane, we know the set of patterns that go into the left and right children of this node. If each of these two sets of patterns have predominance of one class over others, then, presumably, the hyperplane can be considered to have contributed positively to further classification. At any stage in the learning process, the level of purity of a node is some measure of how skewed is the distribution of different classes in the set of patterns landing at that node. If the class distribution is nearly uniform, then the node is highly impure; if the number of patterns of one class is much larger than that of all others, then the purity of the node is high. The impurity measures used in the algorithms give higher rating to a hyperplane, which results in higher purity of child nodes. The Gini index, entropy, and twoing rule are some of the frequently used impurity measures [9]. A slightly different measure is the area under the ROC curve [20], which is also called AUCsplit and is related inversely to the Gini index. Some of the popular algorithms that learn oblique decision trees by optimizing some impurity measures are discussed in [9]. Many of the impurity measures are not differentiable with respect to the hyperplane parameters. Thus, the algorithms for decision tree learning using impurity measures need to use some search techniques for finding the best hyperplane at each node. For example, CART-LC [1] uses a deterministic hill-climbing algorithm; OC1 [21] uses a randomized search. Both of these approaches search in one dimension at a time, which becomes compu- tationally cumbersome in high-dimensional feature spaces. In contrast to these approaches, evolutionary approaches are able to optimize in all dimensions simultaneously. Some examples of decision tree algorithms in which the rating function is optimized using evolutionary approaches are in [18], [22], and [23]. Evolutionary approaches are tolerant to noisy evaluations of the rating function and also facilitate optimizing multiple rating functions simultaneously [24], [25].

A problem with all impurity measures is that they depend only on the number of (training) patterns of different classes on either side of the hyperplane. Thus, if we change the class regions without changing the effective areas of class regions on either side of a hyperplane, the impurity measure of the hyperplane will not change. Thus, the impurity measures do not really capture the geometric structure of class regions. In [26], a different approach is suggested, where the function for rating hyperplanes gives high values to hyperplanes, which promote the "degree of linear separability" of the set of patterns landing at the child nodes. It has been found experimentally that the decision trees learned using this criterion are more compact than those using impurity measures. In [26], a simple heuristic is used to define what is meant by "degree of linear separability." This function also does not try to capture the geometry of pattern classes. Again, the cost function is not differentiable with respect to the parameters of the hyperplanes, and the method uses a stochastic search technique called Alopex [27] to find the optimal hyperplane at each node. In this paper, we present a new decision tree learning algorithm, which is based on the idea of capturing, to some extent, the geometric structure of the underlying class regions. For this, we borrow ideas from some recent variants of the support vector machine (SVM) method, which are quite good at capturing the (linear) geometric structure of the data. For a two-class classification problem, the multisurface proximal SVM (GEPSVM) algorithm [28] finds two clustering hyperplanes, i.e., one for each class. Each hyperplane is close to patterns of one class while being far from patterns of the other class. Then, new patterns are classified based on the nearness to the hyperplanes. In problems where one pair of hyperplanes like this does not give sufficient accuracy, Mangasarian and Wild [28] suggests the idea of using the kernel trick of (effec- tively) learning the pair of hyperplanes in a high-dimensional space to which the patterns are transformed. Motivated by GEPSVM, we derive our decision tree approach as follows: At each node of the tree, we find the clustering hyperplanes as in GEPSVM. After finding these hyperplanes, we choose the split rule at this node as the angle bisectors of the two hyperplanes. Then, we split the data based on the angle bisector and recursively learn the left and right subtrees of this node. Since, in general, there will be two angle bisectors, we select that which is better based on an impurity measure. Thus, the algorithm combines the ideas of linear tendencies in data and purity of nodes to find better decision trees. We also present some analysis to bring out some interesting properties of our angle bisectors that can explain why this may be a good technique to learn decision trees. The rest of this paper is organized as follows:1 We describe our algorithm in Section II. Section III presents some analysis that brings out some properties of the angle bisectors of the clustering hyperplanes. Based on these results, we argue that our angle bisectors are a good choice for split rule at a node while learning the decision tree. Experimental results are given in Section IV. We conclude this paper in Section V.

1A preliminary version of this work has been presented in [29], where some experimental results for the two-class case are presented without any analysis.

MANWANI AND SASTRY: GEOMETRIC DECISION TREE 183

Fig. 1. Example to illustrate the proposed decision tree algorithm. (a) Hyperplane learned at the root node using an algorithm (OC1) that relies on the impurity measure of Gini index. (b, solid line) Angle bisectors of (dashed line) the clustering hyperplanes at the root node on this problem, which were obtained using our method.

II. GEOMETRIC DECISION TREE The performance of any top-down decision tree algorithm depends on the measure used to rate different hyperplanes at each node. The issue of having a suitable algorithm to find the hyperplane that optimizes the chosen rating function is also important. For example, for all impurity measures, the optimization is difficult because finding the gradient of the impurity function with respect to the parameters of the hyperplane is not possible. Motivated by these considerations, here, we propose anew criterion function to assess the suitability of a hyperplane at a node that can capture the geometric structure of the class regions. For our criterion function, the optimization problem can also be solved more easily. We first explain our method by considering a two-class problem. Given the set of training patterns at a node, we first find two hyperplanes, i.e., one for each class. Each hyperplane is such that it is closest to all patterns of one class and is farthest from all patterns of the other class. We call these hyperplanes as the clustering hyperplanes (for the two classes). Because of the way they are defined, these clustering hyperplanes capture the dominant linear tendencies in the examples of each class that are useful for discriminating between the classes. Hence, a hyperplane that passes in between them could be good for splitting the feature space. Thus, we take the hyperplane that bisects the angle between the clustering hyperplanes as the split rule at this node. Since, in general, there would be two angle bisectors, we choose the bisector that is better, based on an impurity measure, i.e., the Gini index. If the two clustering hyperplanes happen to be parallel to each other, then we take a hyperplane midway between the two as the split rule. Before presenting the full algorithm, we illustrate it through an example. Consider the 2-D classification problem shown in Fig. 1, where the two classes are not linearly separable. The hyper-

problem. Fig. 1(b) shows the two clustering hyperplanes for the two classes and the two angle bisectors, obtained through our algorithm, at the root node on this problem. As can be seen, choosing any of the angle bisectors as the hyperplane at the root node to split the data results into linearly separable classification problems at both child nodes. Thus, we see here that our idea of using angle bisectors of two clustering hyperplanes actually captures the right geometry of the classification problem. This is the reason we call our approach "geometric decision tree (GDT)." We also note here that neither of our angle bisectors scores high on any impurity based measure; if we use either of these hyperplane as the split rule at the root, both child nodes would contain roughly equal number of patterns of each class. This example is only for explaining the motivation behind our approach. Not all classification problems have such a nice symmetric structure in class regions. However, in most problems, our approach seems to be able to capture the geometric structure well, as seen from the results in Section IV.

A. GDT Algorithm: Two-Class Case

Let S =(xi,yi) : xi d;yi1,1, i = 1. . .n, be the training data set. Let C+ be the set of points for which yi = 1. In addition, let C be the set of points for which yi =1. For an oblique decision tree learning algorithm, the main computational task is given as follows: Given a set of data points at a node, find the best hyperplane to split the data. Let St be the set of points at node t. Let nt+ and nt denote the number of patterns of the two classes at that node. Let A nt+d be the matrix containingt points of class C+ at node t as rows.2 Similarly, let B nd be the matrix whose rows contain points of class C at node t. Let h1(w1, b1) :

plane learned at the root node using OC1, which is an oblique wT x + b1 = 0 and h2(w2, b2) : wT x + b2 = 0 be the two 12

decision tree algorithm that uses the impurity measure of Gini index, is shown in Fig. 1(a). As can be seen, although this hyperplane promotes the (average) purity of child nodes, it does not really simplify the classification problem; it does not capture the symmetric distribution of class regions in this

clustering hyperplanes. Hyperplane h1 is to be closest to all points of class C+ and farthest from points of class C.

2We use C+ and C to denote the sets of examples of the two classes and the label of the two classes; the meaning would be clear from context.


Similarly, hyperplane h2 is to be closest to all points of and (w2, b2) of two clustering hyperplanes gets reduced to find- class C and farthest from points of class C+. To find the ing eigenvectors corresponding to the maximum and minimum clustering hyperplanes, we use the idea as in GEPSVM [28]. eigenvalues of the generalized eigenvalue problem described by The nearness of a set of points to a hyperplane is repre- (3). It is easy to see that, if w1 is a solution of problem (3), sented by the average of squared distances. The average of kw1 also happens to be a solution for any k. Here, for our squared distances of points of class C+ from a hyperplane purpose, we choose k = 1/ w1 . That is, the clustering hyper- wT x + b = 0 is D+(w, b) = (1/nt+ w 2) xiC+ wT xi + planes [obtained as the eigenvectors corresponding to the max-

b2, where . denotes the standard Euclidean norm. Let w =

imum and minimum eigenvalues of the generalized eigenvalue

[ wT b ]T d+1 and = [ xT 1 ]T d+1. Then, wT xi + x

problem (3)] are w1 = [ wT b1 ]T and w2 = [ wT b2 ]T ,

1

2

b = wTi. Note that, by the definition of matrix A, we x have xiC+wT xi + b2 = Aw + bent+ 2, where ent+ is an nt+-dimensional column vector3 of ones. Now, D+(w, b) can be further simplified as

and they are scaled such that w1 = w2 = 1. We solve this generalized eigenvalue value problem using the standard LU-decomposition-based method [30] in the following way: Let matrix G have full rank. Let G = F F T , which can be done using LU factorization. Now, from (3), we get the following:

D+(w, b) = nt 1w +

2

wT i x

2

Hw =F F T w

xiC+

1

F1HF1T =

y

y

= nt 1w

2

wT

x xiiT w

2

wT Gw

+

xiC+

=w

where = F T w, which means that y is an eigenvector of

y

F1HF1T . Since F1HF1T is symmetric, we can find or-

where

G = (1/nt+) xiC+iT = x xi

thonormal eigenvectors of F1HF1T . If w is an eigenvector

(1/n+

t )[ A ent ]T [ A ent ].

+

+

corresponding to the generalized eigenvalue problem Hw =

Similarly, the average of the squared distances of points of class C from h will be D(w, b) =

Gw , then F T w will be the eigenvector of F1HF1T . Once we find clustering hyperplanes, the hyperplane we

(1/nt w 2)wT Hw, where H = (1/nt) xiCiT =

x xi

associate with the current node will be one of the angle bisectors

(1/n

t )[ B

ent ]T [ B

ent ] and ent is the nt -dimensional

of these two hyperplanes. Let wT x + b3 = 0 and wT x + b4 =

3

4

vector of ones. To find each clustering hyperplane, we 0 be the angle bisectors of wT x + b1 = 0 and wT x + b2 = 0. 12

need to find h such that one of D+ or D is maximized while minimizing the other. Hence, the two clustering

Assuming w1 = w2, it is easily shown that (note that w1, w2 are such that w1 = w2 = 1)

hyperplanes, which are specified by w1 = [ wT b1 ]T and

1

w2 = [ wT b2 ]T , can be formalized as the solution of

2

w3 = w1 + w2

w4 = w1 w2. (4) optimization problems as follows: As we mentioned earlier, we choose the angle bisector that has

w1 = arg minw=0 D+(w,, b) = arg minw=0 wT Gw T

D(w b)

w Hw

a lower value of the Gini index. Let wt be a hyperplane that is

= arg maxw=0 wTHw T

used for dividing the set of patterns St in two parts Stl and Str . Let nt+l and ntl denote the number of patterns of the two classes

w Gw

(1)

in set Stl, and let nt+r and ntr denote the number of patterns of the two classes in set Str . Then, the Gini index of hyperplane

w2 = arg minw=0 D(w,, b) = arg minw=0 wTHw . (2) T

D+(w b)

w Gw

wt is given by [1]

2

It can easily be verified that G = (1/nt+) xiC+ xixT is a i

ntl1 nt+l

ntl 2

(d + 1) (d + 1) symmetric positive semidefinite matrix. G is strictly positive definite when matrix A has full column rank. Similarly, matrix H is also a positive semidefinite matrix, and it

Gini(wt) = nt

ntl

ntl

nt+r

2

ntr

2

is strictly positive definite when matrix B has full column rank. The problems given by (1) and (2) are standard optimization problems, and their solutions essentially involve solving the

ntr1+ nt

ntr

ntr

(5)

following generalized eigenvalue problem [30]:

where nt = nt+ + nt is the number of points in St. In addition, ntl = nt+l + ntl is the number of points falling in set Stl , and

Hw =Gw. (3) ntr = nt+r + ntr is the number of points falling in set Str . We choose w3 or w4 to be the split rule for St based on which of

It can be shown [30] that any w that is a local solution of the optimization problems given by (1) and (2) will satisfy (3) and

the two gives lesser value of the Gini index. When the clustering hyperplanes are parallel (that is, when

the value of the corresponding objective functions is given by eigenvalue. Thus, the problem of finding parameters (w1, b1)

w1 = w2), we choose a hyperplane given by w = (w, b) = (w1, (b1 + b2)/2) as the splitting hyperplane. As is easy to see, in our method, the optimization problem

3Unless stated otherwise all vectors are assumed to be column vectors. of finding the best hyperplane at each node is solved exactly

MANWANI AND SASTRY: GEOMETRIC DECISION TREE 185

rather than by relying on a search technique. The clustering implies that, for each w , which satisfies wT Hw = 0, we hyperplanes are obtained by solving the eigenvalue problem. want to maximize (wT Hw)/(wT Gw + wT Hw). However, After that, to find the hyperplane at the node, we need to simply maximizing the modified ratio (wT Hw)/(wT (G + compare only two hyperplanes based on the Gini index. H)w) is not sufficient because wT Hw may not reach to its The complete algorithm for learning the decision tree is given maximum value [31]. Selecting w enforces G = 0. as follows: At any given node, given the set of patterns St, we find the two clustering hyperplanes (by solving the generalized eigenvalue value problem) and choose one of the two angle Algorithm 1: Multiclass GDT bisectors, based on the Gini index, as the hyperplane to be associated with this node. We then use this hyperplane to split

St into two sets, i.e., those that go into the left and right child nodes of this node. We then recursively do the same at the two child nodes. The recursion stops when the set of patterns at a node are such that the fraction of patterns belonging to the minority class of this set are below a user-specified threshold or the depth of the tree reaches a prespecified maximum limit.

B. Handling the Small-Sample-Size Problem

Input: S =(xi, yi)n=1, Max-Depth, 1 iOutput: Pointer to the root of a decision tree begin Root = GrowTreeMulticlass(S); return Root; end GrowTreeMulticlass (St) Input: Set of patterns at node t (St) Output: Pointer to a subtree begin

In our method, we solve the generalized eigenvalue value Divide set St in two parts, i.e., St and St ; +problem using the standard LU-decomposition-based tech- St contains points of the majority class, and St

nique. In the optimization problem (1), the LU-decomposition- based method is applicable only when matrix G has full rank

+contains points of the remaining classes; Find matrix A corresponding to the points of St ; +

(which happens when matrix A has full column rank). In

Find matrix B corresponding to the points of St ;

general, if there are a large number of examples, then (under Find w1 and w2, which are the solutions of optimiza- the usual assumption that no feature is a linear combination tion problems (1) and (2); of others) we would have full column rank for A. (This is the Find angle bisectors w3 and w4 using (4); case, for example, in the proximal SVM method [28], which Choose the angle bisectors having lesser Gini index

also finds the clustering hyperplanes like this.) However, in our

[cf. (5)] value. Call it w;

decision tree algorithm, as we go down in the tree, the number

Let wt denotes the split rule at node t. Assign

of points falling at nonleaf nodes will keep decreasing. Hence, wt w; there might be cases where matrix G becomes rank deficient. Let Stl =xi StwT< 0 and Str =xit x

We describe a method of handling this problem of small sample size by adopting the technique presented in [31]. Suppose that matrix G has rank r < d + 1. Let be the null space of G. Let Q = [1 . . . d+1r] be the matrix whose columns are an orthonormal basis for . Ac- cording to the method given in [31], we first project all the points in class C to the null space of G. Every

StwT 0; t x Define1(St) = (max(nt1, . . . , ntK ))/(nt); if (T ree-Depth = M ax-Depth) then Get a node tl, and make tl a leaf node; Assign the class label associated to the majority class to tl; Make tl the left child of t;

vector belonging to class C after projection will be- x

else if (1(Stl )> 1 1) then

come QQT. Let the matrix corresponding to H after pro- x

Get a node tl, and make tl a leaf node;

jection be H. Then, H = (1/n) xC QQTT QQT = xx Assign the class label associated to the majority class in set Stl to tl;

QQ T HQQT . Note that G = QQT GQQT would be zero be- Make tl the left child of t; cause columns of Q span the null space of G. Now, the

eigenvector corresponding to the largest eigenvalue of H is selected as the desired vector (clustering hyperplane).

else

tl = GrowTreeMulticlass(Stl );

We now explain how this approach works. The whole analy-

end

Make tl the left child of t;

sis is based on the following result: Theorem 1 [31]: Suppose that R is a set in the d-dimensional space andx R, f (x) 0, g(x) 0, and f (x) + g(x)>0. Let h1(x) = f (x)/g(x) and h2(x) = f (x)/(f (x) + g(x)). Then, h1(x) has a maximum (including positive infinity) at point x0 in R if h2(x) has a maximum at point x0.

if (T ree-Depth = M ax-Depth) then Get a node tr, and make tr a leaf node; Assign the class label associated to the majority class to tr; Make tl the right child of t; else if (1(Str )> 1 1) then

Using Theorem 1, it is clear that w, which maximizes Get a node tr, and make tr a leaf node; the ratio (wT Hw)/(wT Gw + wT Hw), will also maximize (wT Hw)/(wT Gw). It is obvious that (wT Hw)/(wT Gw +

Assign the class label associated to the majority

class in the set Str to tr;

wT Hw) = 1 if and only if wT Hw = 0 and wT Gw = 0. This


Make tl the right child of t; Case 1: (+ = = ) We have the following result: else Theorem 2: Let S be a set of feature vectors with equal tr = GrowTreeMulticlass(Str ); sample covariance matrices of the two classes. Then, the angle Make tr the right child of t; bisector of two clustering hyperplanes will have the same end orientation as the Fisher linear discriminant hyperplane. return t Proof: Given any arbitrary w d, b , we can show4 end through simple algebra that

To summarize, when matrix G becomes rank deficient, we

1n+ Aw + ben+

2

=2 +2 +

find the null space of it and project all the feature vectors of class C on to this null space. The clustering hyperplane for

1 B w + be n

n

2

=2 +2

class C+ is chosen as the principal eigen vector of matrix H described earlier. The small-sample-size problem can occur only when matrix G becomes rank deficient. The rank deficiency of matrix H does not affect the solution of the optimization

where2 = wTw,+ = wT+ + b, and = wT + b. Let f1(w1, b1), f2(w2, b2) be the objective functions of optimization problems (1) and (2), respectively. Then, we have

problem given by (1). 2 +2

f1(w1, b1) =12 +1+2

1

1

C. GDT for Multiclass Classification

2 +2

f2(w2, b2) =2 +222

The algorithm presented in the previous section can

2

2+

be easily generalized to handle the case when we have

where2 = wTwj,j+ = wT+ + bj, andj =

more than two classes. Let S =(xi, yi) : xi d; yi

j

j

j

1, . . . , K i = 1 . . . n be the training data set, where K is the number of classes. At a node t of the tree, we divide the set of points St at that node in two subsets, i.e.,

wT + bj, j = 1, 2. Now, by taking the derivatives of jfi(wi, bi) with respect to (wi, bi), i = 1, 2 and equating them to zero, we get

St and St . St contains points of the majority class in St, (w ++

whereas St contains the rest of the points. We learn the tree

2fi(wi2, bi) wi =fi i, bi) (+).

as in the binary case discussed earlier. The only difference

i

i

here is that we use the fraction of the points of the majority class to decide whether a given node is a leaf node or not. A complete description of the decision tree method for multiclass classification is given in Algorithm 1. Algorithm 1 recursively calls the procedure GrowTreeMulticlass(St), which will learn a split rule for node t and return a subtree at that node.

III. ANALYSIS In this section, we present some analysis of our algorithm. We consider only the binary classification problem. We prove some interesting properties of the angle bisector hyperplanes to indicate why angle bisectors may be a good choice (in a decision tree) for the split rule at a node.

The preceding set of equations will give us

w1 =11(+) w2 =21(+) where1 and2 are some scalars. This means that both clustering hyperplanes are parallel to each other and (w3, b3) is such that w31(+). This is the same as the Fisher linear discriminant, thus proving the theorem. Case 2: (+ = =) Next, we discuss the case of the data distribution, where both classes have the same mean. Theorem 3: If the sample mean of two classes are the same, then the clustering hyperplane found by solving optimization problems (1) and (2) will pass through the common mean. Proof: Optimization problem (1), which finds the clustering hyperplane for C+, is

Let S be a set of n patterns (feature vectors) of which n+

: (w,b)=0 max

1n

xiC

xT w + b i

2

= max wTHw T

.

are of class C+ and n are of class C. Recall that, as per our

1

xT w + b

2

w=0 w Gw

notation, A is a matrix whose rows are feature vectors of class C+ and B is a matrix whose rows are feature vectors of class

n+

xiC+

i

(8)

C. Let+, d be the sample means. Let +, be the This problem can be equivalently written as a constrained sample covariance matrices. Then, we have optimization problem in the following way:

+ = n1 A en+T ++

T

A en+T +

(6)

max (w,b)=0

1n

xiC

xT w + b i

2

where en+ is an n+-dimensional vector having all elements one. Similarly, we will have

s.t.

1n+

xiC+

xT w + b i

2

= 1.

= n1 B enT

T

B enT .

(7)

4To

see the complete calculations in the proof, please refer [32].

MANWANI AND SASTRY: GEOMETRIC DECISION TREE

Equating the derivative of the Lagrangian of the preceding problem, with respect to b to zero, we get (with as the Lagrange multiplier)

187

The difference above will be maximum when the first term is maximized and the second term is minimized. For any real symmetric matrices and , if we want to maximize wTw subject to the constraint wT w = constant, the solution is

2n

xiC

xT w + b 2in+

xiC+

xT w + b = 0 i

b = w T.

the eigenvector corresponding to the maximum eigenvalue of the generalized eigenvalue problem w = w. Similarly, to minimize wTw subject to the same constraint, the solution is the eigenvector corresponding to the minimum eigenvalue.

This means that the clustering hyperplane for class C+ passes through the common mean. Similarly, we can show that the clustering hyperplane for class C also passes through the common mean. When+ = =, Theorem 3 says that b =wT. Now, putting this value of b in (8), we get the optimization problem for finding w as

Hence, the optimal solution to (11) is obtained when (wa + wb) is the eigenvector corresponding to the maximum eigenvalue and (wa wb) is the eigenvector corresponding to the minimum eigenvalue of the generalized eigenvalue problem w =+w. Thus, wa = w1 + w2 and wb = w1 w2 constitute the solution to optimization problem (11). Now, we try to interpret this optimization problem to argue that this is a good optimization problem to solve when we

: max

1n

xiC

xT w w Ti

2

T

want to find the best hyperplane to split the data at a node

w=0

1

xT w w T

2

= max wTw .

n+

xiC+

i

w=0 w w +

while learning a decision tree. Let X and Y be random variables denoting the feature vectors from classes C+ and C,

respectively. Define new random variables Xa, Xb, Ya, and Yb

Hence, w1 and w2 will be the eigenvectors corresponding

as X a = wT X, Xb = wT X, Ya = wT Y, and Yb = wT Y.

a

b

a

b

to the maximum and minimum eigenvalues of the generalized eigenvalue problem w =+w, respectively. Since the eigenvector can be determined only up to a scale factor, under our notation, we take w1 = w2 = 1. Since the ratio wTw/wT+w is invariant to the scaling of vector w,

Now, let us assume that we have enough samples from both classes, so that we can assume that the empirical averages are close to the expectations. We can rewrite the objective function in the optimization problem given by (11) as

we can maximize or minimize the ratio by constraining the denominator to have any constant value. Thus, we can write w1 and w2 as

wTwb = n1 a

xC

wT (x)(x)T wb a

EYa E[Ya] Yb E[Yb]

w1 = arg maxw wTw w2 = arg minw wTw

s.t. wT+w =s.t. wT+w =

(9) (10)

= cov(Ya, Yb).

where the value of can be chosen, so that it is consistent with our scaling of w1 and w2. Now, the parameters of the two angle

Similarly, we can rewrite the constraints of that problem as

2

wT+wa a

E

Xa E[Xa]

bisectors can be written as (w3, b3) = (w1 + w2, b1 + b2) and (w4, b4) = (w1 w2, b1 b2). We show that the pair of vec-

2

= var(Xa)

tors w3 and w4 are the solution to the following optimization

wT+wb b

E

Xb E[Xb]

problem:

= var(Xb)

wT+wb EXa E[Xa] Xb E[Xa]

max wTwb a

a

w ,w a

b

= cov(Xa, Xb).

s.t. wT+wa = 2 = wT+wb,

a

b

wT+wb = 0. (11) a

Hence, the angle bisectors, which are the solution of (11),

Consider the possible solution to the optimization problem (11) would be the solution of the optimization problem given as given by wa = w1 + w2 and wb = w1 w2. We know that

w1 and w2 are feasible solutions to problems (9) and (10), respectively. In addition, because w1 and w2 are eigenvectors

max cov(Ya, Yb) w a ,w b

corresponding to the maximum and minimum eigenvalues of

s.t. var(Xa) = var(Xb) = 2,

the generalized eigen value problem w =+w, they sat-

cov(Xa, Xb) = 0.

(12)

isfy wT+w2 = 0. Thus, we see that the pair of vectors wa = 1w1 + w2 and wb = w1 w2 satisfies all the constraints of the optimization problem (11) and hence is a feasible solution. We

This optimization problem seeks to find wa and wb (which would be our angle bisectors) such that the covariance between

can rewrite the objective function of problem (11) as Ya and Yb is maximized while keeping Xa and Xb uncor-

wTwb = 1 (wa + wb)T(wa + wb)

related. (The constraints on the variances are needed only to

a

ensure that the optimization problem has a bounded solution.)

4

Ya and Yb represent random variables that are projections of a

(wa wb)T(wa wb) .

class C feature vector onto wa and wb, respectively, and Xa


and Xb are projections of the class C+ feature vector on wa and wb. Thus, we are looking for two directions such that one class pattern becomes uncorrelated when projected onto these two directions, whereas the correlation between projections of the other class feature vector becomes maximum. Thus, our angle bisectors give us directions that are good for discriminating between two classes; hence, we feel that our choice of the angle bisectors as split rule is a sound choice while learning a decision tree. Case 3: (+ = and+ =) We next consider the general case of different covariance matrices and different means of two classes. Recall that the parameters of the two

TABLE I DETAILS OF REAL-WORLD DATA SETS USED FROM UCI ML REPOSITORY

clustering hyperplanes are w1 and w2, which are eigenvectors corresponding to the maximum and minimum eigenvalues of with GEPSVM [28] on binary classification problems. The

the generalized eigenvalue problem Hw =Gw. Then, using

experimental comparisons are presented on four synthetic data

similar arguments as in case 2, one can show that the angle bisectors are the solution of the following optimization problem:

sets and ten "real" data sets from the UCI ML repository [33]. Data Set Description: We generated four synthetic data sets in different dimensions, which are described here.

max wT Hwb aw a ,w b 1) 2 2 checkerboard data set: 2000 points are sampled uniformly from [1 1] [1 1]. A point is labeled +1 if

s.t.

wT Gwa a

= 2 =

wT Gwb, b

wT Gwb a

= 0. (13)

it is in ([1 0] [0 1]) ([0 1] [0 1]); otherwise, it

Again, consider X as a random feature vector coming from class C+ and Y as a random feature vector coming from class

is labeled1. Out of 2000 sampled points, 979 points are labeled +1, and 1021 points are labeled1. Now, all the

C. We define new random variables Xa, Xb, Ya, and Yb points are rotated by an angle of/6 with respect to the

as X a = wT X + ba, Xb = wT X + bb, Ya = wT Y + ba, and

a

b

a

first axis in counterclockwise direction to form the final

Yb = wT Y + bb. As earlier, we assume that there are enough bexamples from both the classes, so that the empirical averages can be replaced by expectations. Then, as in the earlier case, we can rewrite the optimization problem given by (13) as

training set. 2) 4 4 checkerboard data set: 2000 points are sampled uniformly from [0 4] [0 4]. This whole square is divided into 16 unit squares having unit length in both dimensions. These squares are given indexes ranging

max E[YaYb]

from 1 to 4 on both axes. If a point falls in a unit square

w a ,w b s.t. E X2 = 2 = E X2 ,

such that the sum of its two indices is even, then we assign

a

b

E[XaXb] = 0.

label +1 to that point; otherwise, we assign label1 to it. Out of 2000 sampled points, 997 points are labeled +1,

This is very similar to the optimization problem we derived in the previous case, with the only difference being that the covariances are replaced by cross expectation or correlation. Thus, finding the two angle bisector hyperplanes is the same

and 1003 points are labeled1. Now, all the points are rotated by an angle of/6 with respect to the first axis in counterclockwise direction. 3) Ten-dimensional synthetic data set: 2000 points are

as finding two vectors wa and wb in d+1 such that the

sampled uniformly from [1 1]10. Consider three

cross expectation of the projection of class C points on these hyperplanes in 10 whose parameters are given by

vectors is maximized while keeping the cross expectation of the

projection of class C+ points on these vectors at zero. Again,

vectors

w1 = [1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0]T ,

w2 =

a

a

[1,1, 0, 0, 1, 0, 0, 1, 0, 0, 0]T

and

w3 =

E[X2] and E[X2] are kept constant to ensure that the solutions of the optimization problem are bounded. Once again, we feel that the preceding discussion shows that the angle bisectors are a good choice as the split rule at a node in the decision tree.

[0, 1,1, 0,1, 0, 1, 1,1, 1, 0]T . Out of 2000 sampled points, 1020 points are labeled +1, and 980 points are labeled1 as follows:

1,

if wT 0 & wT 0 & wT 0

1 x

2 x

3 x

IV. EXPERIMENTS

wT 0 & wT 0 & wT 0

1 x

2 x

3 x

In this section, we present empirical results to show the

y=

w1 x

T 0 & wT 0 & wT 0

2 x

3 x

effectiveness of our decision tree learning algorithm. We test

wT 0 & wT 0 & wT 0

the performance of our algorithm on several synthetic and real data sets. We compare our approach with OC1 [21] and

1,

1 x else.

2 x

3 x

CART-LC [1], which are among the best state-of-art oblique de- 4) 100-dimensional synthetic data set: 2000 points are cision tree algorithms. We also compare our approach with the sampled uniformly from [1 1]100. Consider two

recently proposed linear-discriminant-analysis-based decision

hyperplanes in

100 whose parameters are given

tree (LDDT) [15] and with the SVM classifier, which is among by vectors w1 = [ 2eTeT 3 ]T and w2 =

50

50

the best generic classifiers today. We also compare our approach [eT eT 5 ]T , where e50 is a 50-dimensional 50 50


TABLE II COMPARISON RESULTS BETWEEN GEOMETRIC DECISION TREE AND OTHER DECISION TREE APPROACHES

vector, whose elements are all 1. Now, the points are labeled as follows:

189

repository [33]. The ten data sets that we used are described in Table I. The U.S. Congressional Votes data set available on the UCI ML repository has many observations with the missing values of some features. For our experiments, we choose only those observations for which there are no missing values for any feature. We also do not use all the observations in the Magic data set. It has a total of 19 020 samples of both classes. However, for our experiments, we randomly choose total 6000 points, with 3000 from each class. Experimental Setup: We implemented GDT and LDDT in MATLAB. For OC1 and CART-LC, we have used the downloadable package available on Internet [34]. To learn SVM classifiers, we use the libsvm [35] code. Libsvm-2.84 [35] uses the one-versus-rest approach for multiclass classification. We have implemented GEPSVM in MATLAB. GDT has only one user-defined parameter, which is 1 (the threshold on a fraction of the points to decide on any node being a leaf node). For all our experiments, we have chosen 1 using tenfold cross validation. SVM has two user-defined parameters, i.e., penalty parameter C and the width parameter for Gaussian kernel. The best values for these parameters are found using fivefold cross validation, and the results reported are with these parameters. Both OC1 and CART use 90% of the total number of points for training and 10% points for pruning. OC1 needs two more user-defined parameters. These parameters are the number of restarts R and the number of random jumps J. For our experiments, we have set R = 20 and J = 5, which are the default values suggested in the package. For the cases where we use GEPSVM with the Gaussian kernel, we found the best width parameter using fivefold cross validation. Simulation Results: We now discuss the performance of GDT in comparison with other approaches on different data sets. The results provided are based on ten repetitions of tenfold cross validation. We show the average values and standard deviation (computed over the ten repetitions). Table II shows the comparison results of GDT with other decision tree approaches. In the table, we show the average and standard deviation5 for the accuracy, size, and depth of tree and the time taken for each of the algorithms on each of the problems. We can intuitively take the confidence interval of the estimated accuracy of an algorithm to be one standard deviation on either side of the average. Then, we can say that, on a problem, one algorithm has significantly better accuracy than another if the confidence interval for the accuracy of the first is completely to the right of that of the second. From Table II, we see that the average accuracy of GDT is better than all the other decision tree algorithms, except for the Wine, Votes, and Heart data sets, where LDDT has the same or better average accuracy. In terms of the confidence interval of the average accuracy, the performance of GDT is

1,

if wT 0 & wT 0

1 x

2 x

comparable to the best of other decision tree algorithms on the

y=

wT 0 & wT 0

Breast Cancer, Bupa Liver, Magic, Heart, Votes, and Wine data

1, else.

1 x

2 x

sets. On the remaining eight data sets, the performance of GDT is significantly better than all the other decision tree approaches.

Out of 2000 sampled points, 862 points are labeled +1, Thus, overall, in terms of accuracy, the performance of the GDT and 1138 points are labeled1. is quiet good. Apart from these four data sets, we also tested the GDT on

several benchmark data sets downloaded from the UCI ML

5We

do not show the standard deviation if it is less than 0.001.


Fig. 2. Comparison of GDT with OC1 on 4 4 checkerboard data (a) Hyperplanes learned at the root node and its left child using GDT. (b) Hyperplane learned at the root node and its left child node using OC1 (oblique) decision tree.

Fig. 3. Sensitivity of the performance of GDT to the parameter 1. The first column shows how the average cross-validation accuracy changes with 1, and the second column shows the change of the average number of leaves with 1.

In majority of the cases, GDT generates trees with smaller depth with lesser number of leaves, compared with other decision tree approaches. This supports the idea that our algorithm better exploits the geometric structure of the data set while generating decision trees.

Timewise GDT algorithm is much faster than OC1 and CART, as can be seen from the results in the table. In most cases, the time taken by GDT is less by at least a factor of ten. We feel that this is because the problem of obtaining the best split rule at each node is solved using an efficient linear algebra


TABLE III COMPARISON RESULTS OF GEOMETRIC DECISION TREE WITH SVM AND GEPSVM

algorithm in case of GDT, whereas these other approaches have to resort to search techniques because optimizing impurity- based measures is tough. In all cases, the time taken by GDT is comparable to that of LDDT. This is also to be expected because LDDT uses similar computational strategies. We next consider comparisons of the GDT algorithm with SVM and GEPSVM. Table III shows these comparison results. GEPSVM with linear kernel performs the same as GDT for the 2 2 checkerboard problem because, for this problem, the two approaches work in a similar way. However, when there are more than two hyperplanes required, GEPSVM with Gaussian kernel performs worse than our decision tree approach. More- over, with Gaussian kernel, GEPSVM solves the generalized eigenvalue problem of the size of the number of points, whereas our decision tree solves the generalized eigenvalue problem of the dimension of the data at each node (which is the case with GEPSVM only when it uses the linear kernel). This gives us an extra advantage in computational cost over GEPSVM. For all binary classification problems, GDT outperforms GEPSVM. The performance of GDT is comparable to that of SVM in terms of accuracy. GDT performs significantly better than SVM on 10 and 100-dimensional synthetic data sets and the Balance Scale data set. GDT performs comparable to SVM on the 2 2 checkerboard, Bupa Liver, Pima Indian, Magic, Heart,

191

and Votes data sets. GDT performs worse than SVM on the 4 4 checkerboard and the Breast Cancer, Vehicle, and Wave- forms data sets. In terms of the time taken to learn the classifier, GDT is faster than SVM on majority of the cases. At every node of the tree, we are solving a generalized eigenvalue problem that takes time on the order of (d + 1)3, where d is the dimension of the feature space. On the other hand, SVM solves a quadratic program whose time complexity is O(nk), where k is between 2 and 3 and n is the number of points. Thus, in general, when the number of points is large compared to the dimension of the feature space, GDT learns the classifier faster than SVM. Finally, in Fig. 2, we show the effectiveness of our algorithm in terms of capturing the geometric structure of the classification problem. We show the first two hyperplanes learned by our approach and OC1 for 4 4 checkerboard data. We see that our approach learns the correct geometric structure of the classification boundary, whereas the OC1, which uses the Gini index as impurity measure, does not capture that. Although GDT gets the correct decision boundary for the 4 4 chessboard data set, as shown in Fig. 2, its cross- validation accuracy is lesser than that of SVM. This may be because the data here are dense, and hence, numerical round-off errors can affect the classification of points near the boundary. On the other hand, if we allow some margin between the data points and the decision boundary (by ensuring that all the sampled points are at least 0.05 distance away from the decision boundary), then we observed that SVM and GDT both achieve 99.8% cross-validation accuracy. In the GDT algorithm described in Section II, 1 is a parameter. If more than (1 1) fraction of the points fall into the majority class, then we declare that node as a leaf node and assign the class label of the majority class to that node. As we increase 1, chances of any node to become a leaf node will increase. This leads to smaller sized decision trees, and the learning time also decreases. However, the accuracy will suffer. To understand the robustness of our algorithm with respect to this parameter, we show, in Fig. 3, variation in cross- validation accuracy and the average number of leaves with 1. The range of values of 1 is chosen to be 0.05-0.35. We see that the cross-validation accuracy does not change too much with 1. However, with increasing 1, the average number of leaves decrease. Thus, even though the tree size decreases with 1, the cross-validation accuracy remains in a small interval. This happens because, for most of the points, the decision is governed by nodes closer to the root node. Few remaining examples, which are tough to classify, lead the decision tree to grow further. However, as the value of 1 increases, only nodes containing these tough-to-classify points become leaf nodes. From the results in Fig. 3, we can say that 1 in the range of 0.1-0.3 would be appropriate for all data sets.

V. CONCLUSION In this paper, we have presented a new algorithm for learning oblique decision trees. The novelty is in learning hyperplanes that captures the geometric structure of the class regions. At each node, we have found the two clustering hyperplanes and


chosen one of the angle bisectors as the split rule. We have presented some analysis to derive the optimization problem for which the angle bisectors are the solution. Based on this, we argued that our method of choosing the hyperplane at each node is sound. Through extensive empirical studies, we showed that the method performs better than the other decision tree approaches in terms of accuracy, size of the tree, and time. We have also shown that the classifier obtained with GDT is as good as that with SVM, whereas it is faster than SVM. Thus, overall, the algorithm presented here is a good and novel classification method.

REFERENCES [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, CA: Wadsworth and Brooks, 1984, ser. Statistics/Probability Series. [2] J. Quinlan, "Induction of decision trees," Mach. Learn., vol. 1, no. 1, pp. 81-106, 1986. [3] R. O. Duda and H. Fossum, "Pattern classification by iteratively determined linear and piecewise linear discriminant functions," IEEE Trans. Electron. Comput., vol. EC-15, no. 2, pp. 220-232, Apr. 1966. [4] M. I. Jordan, "A statistical approach to decision tree modeling," in Proc. 7th Annu. Conf. COLT, New Brunswick, NJ, Jul. 1994, pp. 13-20. [5] K. P. Bennett, "Global tree optimization: A non-greedy decision tree algorithm," Comput. Sci. Statist., vol. 26, pp. 156-160, 1994. [6] K. P. Bennett and J. A. Blue, "Optimal decision trees," Dept. Math. Sci., Rensselaer Polytech. Inst., Troy, NY, Tech. Rep. R.P.I. Math Report No. 214, 1996. [7] K. P. Bennett and J. A. Blue, "A support vector machine approach to decision trees," in Proc. IEEE World Congr. Comput. Intell., Anchorage, AK, May 1998, vol. 3, pp. 2396-2401. [8] A. Suarez and J. F. Lutsko, "Globally optimal fuzzy decision trees for classification and regression," IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 12, pp. 1297-1311, Dec. 1999. [9] L. Rokach and O. Maimon, "Top-down induction of decision trees classifiersA survey," IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 35, no. 4, pp. 476-487, Nov. 2005. [10] O. L. Mangasarian, "Multisurface method of pattern separation," IEEE Trans. Inf. Theory, vol. IT-14, no. 6, pp. 801-807, Nov. 1968. [11] T. Lee and J. A. Richards, "Piecewise linear classification using seniority logic committee methods, with application to remote sensing," Pattern Recognit., vol. 17, no. 4, pp. 453-464, 1984. [12] P. E. Utgoff and C. E. Brodley, "Linear machine decision trees," Dept. Comput. Sci., Univ. Massachusettes, Amhersts, MA, Tech. Rep. 91-10, Jan. 1991. [13] M. Lu, C. L. P. Chen, J. Huo, and X. Wang, "Multi-stage decision tree based on inter-class and inner-class margin of SVM," in Proc. IEEE Int. Conf. Syst., Man, Cybern., San Antonio, TX, 2009, pp. 1875-1880. [14] R. Tibshirani and T. Hastie, "Margin trees for high dimensional classification," J. Mach. Learn. Res., vol. 8, pp. 637-652, Mar. 2007. [15] M. F. Amasyali and O. Ersoy, "Cline: A new decision-tree family," IEEE Trans. Neural Netw., vol. 19, no. 2, pp. 356-363, Feb. 2008. [16] A. Koakowska and W. Malina, "Fisher sequential classifiers," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 5, pp. 988-998, Oct. 2005. [17] D. Dancey, Z. A. Bandar, and D. McLean, "Logistic model tree extraction from artificial neural networks," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 4, pp. 794-802, Aug. 2007. [18] W. Pedrycz and Z. A. Sosnowski, "Genetically optimized fuzzy decision trees," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 35, no. 3, pp. 633-641, Jun. 2005. [19] B. Chandra and P. P. Varghese, "Fuzzy SLIQ decision tree algorithm," IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 38, no. 5, pp. 1294- 1301, Oct. 2008. [20] C. Ferri, P. Flach, and J. Hernndez-Orallo, "Learning decision trees using the area under the ROC curve," in Proc. 19th ICML, San Francisco, CA, Jul. 2002, pp. 139-146. [21] S. K. Murthy, S. Kasif, and S. Salzberg, "A system for induction of oblique decision trees," J. Artif. Intell. Res., vol. 2, no. 1, pp. 1-32, Aug. 1994. [22] S.-H. Cha and C. Tappert, "A genetic algorithm for constructing compact binary decision trees," J. Pattern Recognit. Res., vol. 4, no. 1, pp. 1-13, Feb. 2009.

[23] Z. Fu, B. L. Golden, S. Lele, S. Raghavan, and E. A. Wasil, "A genetic algorithm-based approach for building accurate decision trees," INFORMS J. Comput., vol. 15, no. 1, pp. 3-22, Jan. 2003. [24] E. Cant-Paz and C. Kamath, "Inducing oblique decision trees with evolutionary algorithms," IEEE Trans. Evol. Comput., vol. 7, no. 1, pp. 54-68, Feb. 2003. [25] J. M. Pangilinan and G. K. Janssens, "Pareto-optimality of oblique decision trees from evolutionary algorithms," J. Global Optim., pp. 1-11, Oct. 2010. [26] S. Shah and P. S. Sastry, "New algorithms for learning and pruning oblique decision trees," IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 29, no. 4, pp. 494-505, Nov. 1999. [27] P. S. Sastry, M. Magesh, and K. P. Unnikrishnan, "Two timescale analysis of the alopex algorithm for optimization," Neural Comput., vol. 14, no. 11, pp. 2729-2750, Nov. 2002. [28] O. L. Mangasarian and E. W. Wild, "Multisurface proximal support vector machine classification via generalized eigenvalues," IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 1, pp. 69-74, Jan. 2006. [29] N. Manwani and P. S. Sastry, "A geometric algorithm for learning oblique decision trees," in Proc. 3rd Int. Conf. PReMI, New Delhi, India, Dec. 2009, pp. 25-31. [30] G. H. Gulub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [31] L. F. Chen, H. Y. M. Liao, M. T. Ko, J. C. Lin, and G. J. Yu, "A new LDA-based face recognition system which can solve the small sample size problem," Pattern Recognit., vol. 33, no. 10, pp. 1713-1726, Oct. 2000. [32] N. Manwani and P. S. Sastry, "Geometric decision tree," CoRR, vol. abs/1009.3604, 2010. [Online]. Available: http://arxiv.org/abs/ 1009.3604 [33] D. N. A. Asuncion,UCI Machine Learning Repository, School Inf. Com- put. Sci., Univ. California, Irvine, Irvine, CA, 2007. [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html [34] S. K. Murthy, S. Kasif, and S. Salzberg, The OC1 Decision Tree Software System, 1993. [Online]. Available: http://www.cbcb.umd.edu/ salzberg/announce-oc1.html [35] C.-C. Chang and C.-J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/ libsvm

Naresh Manwani received the B.E. degree in electronics and communication from Rajasthan University, Jaipur, India, in 2003 and the M.Tech. degree in information and communication technology from Dhirubhai Ambani Institute of Informa- tion and Communication Technology, Gandhinagar, India, in 2006. He is currently working toward the Ph.D. degree in the Department of Electrical En- gineering, Indian Institute of Science, Bangalore, India. His research interests are machine learning and pattern recognition.

P. S. Sastry (S'82-M'85-SM'97) received the B.Sc. (Hons.) degree in physics from Indian Institute of Technology, Kharagpur, India, and the B.E. degree in electrical communications engineering and the Ph.D. degree from the Department of Electrical Engineer- ing, Indian Institute of Science, Bangalore, India. Since 1986, he has been a faculty member with the Department of Electrical Engineering, Indian Institute of Science, where he is currently a Profes- sor. He has held visiting positions at the University of Massachusetts, Amherst; University of Michigan, Ann Arbor; and General Motors Research Labs, Warren. His research interests include machine learning, pattern recognition, data mining, and computational neuroscience. Dr. Sastry is a Fellow of Indian National Academy of Engineering. He is an Associate Editor for the IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART B and the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING. He was the recipient of the Sir C.V.Raman Young Scientist Award from the Goverment of Karnataka; the Hari Om Ashram Dr. Vikram Sarabhai Research Award from Physical Resaerch Laboratory, Ahmedabad, India; and the Most Valued Collegue Award from General Motors Corporation, Detroit, MI.