evaluating classifiers' performance kdd2002

8
Evaluating Classifiers’ Performance In A Constrained Environment Anna Olecka [email protected] Fleet Boston Financial Database Marketing Department 1075 Main St. Waltham, MA 02451 RUTCOR Rutgers University 640 Bartholomew Road Piscataway, NJ 08854-8003 ABSTRACT In this paper, we focus on methodology of finding a classifier with a minimal cost in presence of additional performance constraints. ROCCH analysis, where accuracy and cost are intertwined in the solution space, was a revolutionary tool for two-class problems. We propose an alternative formulation, as an optimization problem, commonly used in Operations Research. This approach extends the ROCCH analysis to allow for locating optimal solutions while outside constraints are present. Similarly to the ROCCH analysis, we combine cost and class distribution while defining the objective function. Rather than focusing on slopes of the edges in the convex hull of the solution space, however, we treat cost as an objective function to be minimized over the solution space, by selecting the best performing classifier(s) (one or more vertex in the solution space). The Linear Programming framework provides a theoretical and computational methodology for finding the vertex (classifier) which minimizes the objective function. 1. INTRODUCTION Consider a problem, where classifiers’ performance has to be evaluated taking into account additional constraints related to error rates. Such constraints often arise from implementation. They could, for example, involve a limited workforce to resolve cases of suspected fraud, a limited size of a direct mail campaign, or restrictions in cost of incentives for responders. An application example used throughout this paper involves an attrition model for a bank. A bank plans a calling campaign to lower attrition rate among its customers. Naturally, one of the implementation concerns is limited availability of resources, such as phone representatives. Traditionally, a model is selected first and summarized by a modeler into several performance buckets. Then, the implementation team will “eyeball” the thresholds of the performance buckets and pick the threshold that matches constrained resources the closest. Such business practice often results in a sub-optimal solution being selected. If the constraints are known a-priori, they could be built into the system evaluating classifiers’ performance. If they are not known till the implementation time, a system analogous to the ROCCH could be built to select the best classifier at that time. To this end, we are proposing an evaluation system that can deal with additional constraints related to prediction errors. We will also show how to apply such system in a business scenario described above. Provost and Fawcett have shown in [3] that some specific metrics frequently used in Machine Learning (eg, workforce constraints, and the Neyman-Pearson decision criterion), are optimized by the ROCCH method. Optimization approach proposed here extends their results to any linear constraint. In fact, our approach can be applied to an unlimited number of constraints, as long as they remain linear. Finding numerically intersection points of such additional constraints with ROCCH can be computationally tedious. A mathematical programming approach provides efficient tools for finding optimal solutions without explicitly calculating all the intersection points. ROCCH analysis is a powerful and widely accepted tool for visualizing and optimizing classification problems with two classes. It plots performance of all classifiers under consideration in a two-dimensional space, with false positive rate on one axis and true positive rate on the other. It measures classifiers’ performance for various costs of misclassification and under various class distributions. It allows for visual representation of classifiers and enables quick decisions in choosing the right classifier for the given costs. The main advantage of this methodology is its flexibility under varying conditions. One drawback of this methodology is that it is somewhat rigid in looking for an optimal classifier. It forms an optimal slope by considering an optimal combination of class probabilities and error costs, and then looks for one of the two scenarios. Either an edge of the convex hull with a slope equal to the optimal one, or for a vertex between two edges where a difference between the edges slope and the optimal slope changes sign. Similarly to the ROCCH, we construct the convex hull of all classifiers in the error space. Convexity of the error space is Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada. Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.

Upload: anna-olecka

Post on 20-Feb-2017

47 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evaluating Classifiers' Performance KDD2002

Evaluating Classifiers’ Performance In A Constrained Environment

Anna Olecka

[email protected]

Fleet Boston Financial Database Marketing Department

1075 Main St. Waltham, MA 02451

RUTCOR Rutgers University

640 Bartholomew Road Piscataway, NJ 08854-8003

ABSTRACT In this paper, we focus on methodology of finding a classifier with a minimal cost in presence of additional performance constraints. ROCCH analysis, where accuracy and cost are intertwined in the solution space, was a revolutionary tool for two-class problems. We propose an alternative formulation, as an optimization problem, commonly used in Operations Research. This approach extends the ROCCH analysis to allow for locating optimal solutions while outside constraints are present. Similarly to the ROCCH analysis, we combine cost and class distribution while defining the objective function. Rather than focusing on slopes of the edges in the convex hull of the solution space, however, we treat cost as an objective function to be minimized over the solution space, by selecting the best performing classifier(s) (one or more vertex in the solution space). The Linear Programming framework provides a theoretical and computational methodology for finding the vertex (classifier) which minimizes the objective function. 1. INTRODUCTION Consider a problem, where classifiers’ performance has to be evaluated taking into account additional constraints related to error rates. Such constraints often arise from implementation. They could, for example, involve a limited workforce to resolve cases of suspected fraud, a limited size of a direct mail campaign, or restrictions in cost of incentives for responders. An application example used throughout this paper involves an attrition model for a bank. A bank plans a calling campaign to lower attrition rate among its customers. Naturally, one of the implementation concerns is limited availability of resources, such as phone representatives. Traditionally, a model is selected first and summarized by a

modeler into several performance buckets. Then, the implementation team will “eyeball” the thresholds of the performance buckets and pick the threshold that matches constrained resources the closest. Such business practice often results in a sub-optimal solution being selected. If the constraints are known a-priori, they could be built into the system evaluating classifiers’ performance. If they are not known till the implementation time, a system analogous to the ROCCH could be built to select the best classifier at that time. To this end, we are proposing an evaluation system that can deal with additional constraints related to prediction errors. We will also show how to apply such system in a business scenario described above. Provost and Fawcett have shown in [3] that some specific metrics frequently used in Machine Learning (eg, workforce constraints, and the Neyman-Pearson decision criterion), are optimized by the ROCCH method. Optimization approach proposed here extends their results to any linear constraint. In fact, our approach can be applied to an unlimited number of constraints, as long as they remain linear. Finding numerically intersection points of such additional constraints with ROCCH can be computationally tedious. A mathematical programming approach provides efficient tools for finding optimal solutions without explicitly calculating all the intersection points. ROCCH analysis is a powerful and widely accepted tool for visualizing and optimizing classification problems with two classes. It plots performance of all classifiers under consideration in a two-dimensional space, with false positive rate on one axis and true positive rate on the other. It measures classifiers’ performance for various costs of misclassification and under various class distributions. It allows for visual representation of classifiers and enables quick decisions in choosing the right classifier for the given costs. The main advantage of this methodology is its flexibility under varying conditions. One drawback of this methodology is that it is somewhat rigid in looking for an optimal classifier. It forms an optimal slope by considering an optimal combination of class probabilities and error costs, and then looks for one of the two scenarios. Either an edge of the convex hull with a slope equal to the optimal one, or for a vertex between two edges where a difference between the edges slope and the optimal slope changes sign. Similarly to the ROCCH, we construct the convex hull of all classifiers in the error space. Convexity of the error space is

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’02, July 23-26, 2002, Edmonton, Alberta, Canada. Copyright 2002 ACM 1-58113-567-X/02/0007…$5.00.

Page 2: Evaluating Classifiers' Performance KDD2002

accomplished in the following steps. First we note that a convex combination of two classifiers is also a viable classifier. Then, we remove dominated classifiers. Finally, we impose additional performance constraints if any. Those additional constraints also form a convex hull. Intersection of the two regions, is a new convex hull, the potential solution space. Finally, similarly to the ROCCH, we treat a combination of costs and class probabilities as an objective function to be minimized over the solution space. Due to existence of additional constraints, however, iterating slopes of the edges of the convex hull is no longer computationally efficient because not all vertices of the convex hull are known. The theory of linear programming provides computational tools for finding the optimal solution(s) without explicit knowledge of all vertices. 2. ROC CONVEX HULL & HYBRID CLASSIFIERS Suppose we construct a series of k-1 classifiers by varying a positive decision threshold from ”never”, and gradually increasing probability of decision “Yes”. By allowing more instances to be classified as positive, each new threshold will increase the number of correctly classified positive instances, but it may also increase frequency of false positive classification. In the space (FP rate, TP rate), each new classifier will be positioned up and to the right from the previous one. Connecting each pair of points with a line segment generates a convex set, where each vertex represents a classifier (Figure1). .

Table 1. Neural net model for bank attrition

Table 1 represents a set of classifiers obtained from a neural networks model for a bank attrition problem. Figure 1 shows the resulting convex set

Figure 1. ROC curve for the neural net attrition model

Figure 2 shows a set of classifiers obtained by varying thresholds for two logistic regression models for the attrition data. In this representation, we can visually recognize dominated classifiers. They are positioned inside the convex region, while potential

candidates for optimal solution are on the boundary. It is easy to see [3], that under any cost structure, there is a classifier on the boundary, which will outperform a dominated classifier.

Figure 2. Two competing logistic regression models Figure 3 shows a hybrid classifier obtained by removing dominated points. Any point positioned inside the bounded region, can be outperformed by some point on the boundary. The boundary forms a hybrid classifier, or a set of potential candidates for an optimal solution.

Figure 3. ROCCH for the two logistic regression models

2.1 Remark on convexity The hybrid classifier obtained from our two logistic regression models formed a convex set in the (FP rate, TP rate) plane. In general, this doesn’t have to be the case. Figure 4 provides such an example. Point B, where the region boundary transitions from the neural networks model to the logistic model. The boundary “dips” below the line from A to C. This can be “fixed” by creating a new classifier B’ on the (A, C) line. For example, if we want a point half way between A and C, the new classifier is obtained by using classifiers A and B randomly, each with probability 0.5. Provost and Fawcett describe a similar scheme, using random sampling to create a new classifier. In general any convex combination of two classifiers becomes a new classifier. For any point X on a line segment created by classifiers π1, π2, , we can always construct a new classifier, which would correspond to such point. We start by describing such point as convex combination

ROC surves for two logistic regression models

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1FP Rate

TPRate

Logistic2

Logistic1

Hybrid classifier for two logistic regression models

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1FP Rate

TP R

ate

Logistic2

Logistic1

Cutoff /Threshold

FP ratep (Y | n)

TP ratep (Y | y)

Never 0 00.54 0.06 0.250.48 0.14 0.430.42 0.23 0.580.36 0.33 0.700.30 0.42 0.820.24 0.53 0.90

Always 1 1

ROC Curve for Attrition Data

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

FP Rate

TP

Ra

te

Page 3: Evaluating Classifiers' Performance KDD2002

X=απ1+ (1−α)π2 where 0 ≤ α ≤ 1 Then, we randomly partition the population being classified into 2 groups in proportions α, 1- α and apply an appropriate model for each group.

Figure 4. Boundary of ROC space for two classifiers is not always convex

A new, convex boundary for the attrition problem is shown in Figure 5.

Figure 5. Convexity is obtained by creating convex combinations of existing classifiers

3. BASIC TERMINOLOGY The following terminology is used throughout the remaining sections. − Two classes: positive (y) and negative (n) with probabilities

respectively p and 1-p − Classification decision: positive (Y) and negative (N) − FP, TP, FN, TN represent number of instances of each kind:

false positive, true positive, false negative and true negative respectively

− Rates of those instances are represented as follows: FP_rate = p(Y|n) TP_rate = p(Y|y) FN_rate = p (N|y) TN_rate = p(N|n)

− In the ROC space, the horizontal axis represents FP_rate, and the vertical axis represents TP_rate. We will use x to denote FP_rate and y to denote TP_rate

− Unit cost of a false positive error = c(Y|n)

− Unit cost of a false negative error = c(N|y)

In defining the cost, we need to start with the expected number of errors and transform the resulting formula into the ROC space terms, where the variables are error rates. The following scheme visualizes interdependencies between terms.

Table 2. Two class classification scheme

Let M be the total number of instances being classified. For given p, x and y we can calculate the expected number of classified cases of each kind as follows.

TP = M*p*y FN = M*p*(1-y) FP = M* (1-p)*x TN = M*(1-p)*(1-x)

4. COST FUNCTION We now apply the cost function to the attrition problem and look for a classifier on the boundaries of the convex hull, which will minimize the total cost. A false negative classification means that we won’t recognize an attriting customer and loose the account. The cost of loosing a customer is tied to net income after taxes (NIAT) this customer brings to the bank. In addition, both types of error generate labor costs related preventive action. The line of business decided to assign error costs as shown in Table3.

Table 3. Misclassification costs assignment

Given cost of a false negative error c(N | y), and a false positive error c(Y | n), the total expected cost EC = c(Y | n) * E(FP) + c(N | y)* E(FN) Where E(FN) is the expected number of false negatives and E(FP) is the expected number of false positives.

ROC Curves for Logistic and Neural Nets models

C

BA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1FP Rate

TP

Ra

te Logistic2

Neural Net 1

False positivec(Y|n)

False negative

c(N|y)

30$ 2,575$

Error Costper unit

Two class classification schem e

PopulationM

Class n(1-p)*M

Class yp*M

Decision Y

(TP)

Decision N

(FN)

Decision Y(FP)

Decision N(TN)

FN rate1-y

TP ratey

TN rate1-x

FP ratex

p 1-p

Hybrid classifier for Logistic2 and Neural Nets models

B' CA

00.10.20.30.40.50.60.70.80.9

1

0 0.2 0.4 0.6 0.8 1FP Rate

TP

Rate

Logistic2

Neural Net 1

Page 4: Evaluating Classifiers' Performance KDD2002

We want to minimize the expected total cost in term of decision variables x and y. Minimize EC = c(Y | n) * E(FP) + c(N | y)* E(FN) = c(Y | n) * (1-p) * M * x + c(N | y) * p* M * (1 – y ) = c(Y | n) * (1-p) * M * x + c(N | y) * p * M - c(N | y)*p*M*y After subtracting the constant (not dependent on the decision variables) term and dividing by M, this is equivalent to: Minimize (EC - c(N | y)*p )/M =

= c(Y | n)*(1-p)*x - c(N | y)*p*y

This is equivalent to maximizing the opposite function Maximize C’ = - c(Y | n)*(1-p)*x + c(N | y)*p*y C’ in space (x,y) is a collection of parallel lines with slopes depending on misclassification costs and the a-priori probability of the positive class. In the attrition example we are analyzing, p=3.75%. We are now ready to calculate slopes for the cost lines.

m = [c(Y | n)*(1-p)] / [c(N | y)*p] = = 30*(1-0.0375) / (2575*0.0375) = 0.3

Intercepts of those lines vary with the position of each line on the plane. The higher the position of the line on the plane in relation to y, the higher will the value of C’ be. Since our objective is to maximize C’, we can start at the bottom of the convex hull and move the line up, as far as possible. Each point of the region touching the line in a given position has the same cost. This defines iso-performing lines. We want to find a point in the convex region that the line touches in its highest position possible. Figure 6 shows the convex hull for the attrition problem, and cost function moving up through the region. The highest position of the cost line is obtained at vertex A. The theory of Linear Programming shows, that for a convex and bounded region, the objective function is maximized either at one of the vertices, or on a line segment joining two vertices Bazaara [1]. Thus, we can always pick one or more best performing classifiers.

Figure 6. Cost functions traversing the ROC convex hull

As in the ROCCH analysis, we have iso-performing lines, which help visualize performance of classifiers under various cost structures. In the Linear Programming setting, however, the feasible region can incorporate any additional constraints. We are gaining flexibility and control. 5. ADDING ADDITIONAL CONSTRAINTS In the attrition example, some implicit constraints are already built into the ROC space. Error rates are non-negative and cannot exceed 1. Those constraints bound the convex set and guarantee existence of an optimal solution. In addition, a planned calling campaign is limited by customer service resources availability. A phone call scenario included approaching a customer to find out if indeed their intention was to leave the bank, and to attempt to entice them to stay. Naturally, there are limited resources bank can devote to this undertaking. Traditionally, modelers would select the best model, and any additional constraints would be imposed at the implementation time, based on the pre-selected model. But if a modeler is aware of the constraints, they can be built into the system evaluating model’s performance. The line of business, for which these classifiers were developed, determined that they could handle calling 20% of their customer base. This resulted in the constraint FP + TP ≤ 0.2*M In order to plot the constraint in the (x, y) error rate space, we need to formulate it in terms of the decision variables x and y.

(1-p)*x - p*y ≤ 0.2

Geometrically, this inequality is represented by a half plane in the (x,y) space. Figure 7 shows a case of the attrition problem with an additional constraint. The new convex set is the intersection of the original one with the half plane. The feasible region is now an intersection of the previous convex set with the new one: the area to the left (down) of the dotted line.

Figure 7. The capacity constraint bounds the feasible region Our optimal classifier is no longer feasible, since it is outside of the new feasible region.

We need to “slide” the cost line back down to the feasible region. It will touch the feasible region at the point where the hybrid classifier intersects the constraint line. As noted earlier, this point

A(.64,.96)

Optimalsolution

00.10.20.30.40.50.60.70.80.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

FP rate

TPrate

Attrition model with workforce capacity constraints

0

0.2

0.4

0.6

0.8

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

FP rate

TP r

ate

Hybrid classifier Capacity constraint

Page 5: Evaluating Classifiers' Performance KDD2002

defines a new classifier. The new classifier is now optimal, as shown in Figure 8.

Figure 8. New optimal solution In the next section we show how to find the new vertex. Here, we will point out an actual implementation of the new solution as outlined by Fawcett and Provost in [3]. Assume that the intersection point divides the line segment between A and B at ratio α. To obtain the new classifier, we can proceed as follows. 1. With probability α use classifier A 2. With probability 1- α use classifier B If A and B were obtained from the same model by varying the decision threshold, this process can be simplified by finding an appropriate threshold between A and B. 6. LINEAR PROGRAMMING FORMULATION A linear program is a constrained optimization problem, where the objective function, as well as all constraints, are linear. We need to select values for all decision variables so that all constraints are satisfied and the objective function is minimized (or maximized). In this case, we need to pick a point(s) in the ROC space which will minimize the cost function (or maximize the modified cost function). Decision variables are then the coordinates (x,y) of points in the ROC space. Some of the constraints arise from the classifiers’ performance. In the ROC space, points under consideration need to be below the boundary of the convex region. Additional constraints are usually related to implementation and/or quality issues. Finally, we have non-negativity constraints and bounding constraints, since the ROC variables have to be between 0 and 1. A canonical form of a linear (maximization) program takes the following format.

Maximize C =∑j

cj x j

Subject to ∑j

aij xj ≤ bi i = 1,…, k.

xj ≥ 0 j = 1,…, d

The theory of linear programming assures us, that if the set of constraints forms a convex and bounded set (called the feasible region), then an optimal solution is found on the boundaries of the feasible region [1]. In our - two dimensional - case, a solution can be found at a vertex, or on an edge joining two vertices. Note that the feasible set is created as a conjunction of several linear inequalities. Not all vertices are known explicitly. A number of computational techniques have been designed to find optimal solutions without explicitly iterating over the vertices. More details can be found in [1] and [2]. The theory of Linear Programming also aids an analysis of a solution found, if we want to play some “what if” scenarios. At a point of optimality, some constraints will be satisfied as “binding”. That is, the left-hand-side will be equal to the right hand side. Others will be satisfied as an inequality, leaving “slack”, or room for improvement. In a two-dimensional case, if two constraints are found binding at an optimal solution, the classifier at the intersection of those constraints is optimal. An analysis of slack at neighboring points (called marginal analysis or sensitivity analysis), often provides insight into alternative solutions and how close they are to optimality. All commercially available optimization packages, including a module that comes with Excel, will provide slack information for all constraints. We are showing an example of Excel sensitivity analysis report in the Appendix. In that example, the capacity constraint, which was based on workforce availability, is binding. That means that any further improvement of classifier’s performance will require additional workforce resources. Additional resources will provide slack on the capacity constraint. This will allow the optimal solution to move along the convex hull boundary in a direction improving the objective function. We will now formulate the problem of looking for optimal classifier in the presence of additional constraints, as a linear programming problem. We already have a formal representation of the objective function. Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y Now we need to formulate the set of constraints related to the classifiers and add a set of additional, bounding and non-negativity constraints. Given two classifiers Pi and Pj on the boundary, with error rates (xj, yj) and (xi, yi). Assume that xi ≤ xj .A line segment through points Pi, Pj has the slope

m = (yj – yi)/ (xj – xi) and the equation y – yi = m*( x – xi) The feasible region is positioned below the line, so it is determined by a collection of inequalities y – yi ≤ ((yj – yi)/ (xj – xi) )*( x – xi) which can be rearranged as - (yj – yi) x

+ (xj – xi) y ≤ - xi (yj – yi) + yi (xj – xi)

Optimal solution change under capacity constraints

New optimalsolution

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

FP rate

TP ra

te

Page 6: Evaluating Classifiers' Performance KDD2002

The optimization problem can now be defined as follows. Given classifiers Pi with error rates (xi, yi), i = 1,2,…k, such that xi ≤ xi+1 Maximize C’(x, y) = - c(Y | n)*(1-p)*x + c(N | y)*p*y Such that a1

ij x + a2

ij y ≤ bij for i = 1,2,…k; j = i +1 π1

l x + π2

l y ≤ ql l = 1,2, l0

0 ≤ x ≤ 1 0 ≤ y ≤ 1

where a1ij = - (yj – yi) , a

2ij = (xj – xi)

bij = - xi (yj – yi) + yi (xj – xi) and π1

l , π2

l and ql form additional constraints

6.1 Computational considerations In the absence of additional constraints, all vertices are known a priori. In such case, it is computationally efficient, to explicitly calculate expected cost of each classifier and choose the one with the smallest cost, rather than actually slide the lines over the feasible region. When additional constraints are introduced, the situation changes. Finding new vertices, formed by the additional intersection points, can get cumbersome. Fortunately, the theory and practice of mathematical programming provides efficient algorithms for finding optimal solutions without a need to iterate over all vertices. There are also efficient solvers on the market, which can solve optimization problems. An add-on to Excel provides one such solver. It is efficient for problems in small dimensions as the ones described here, and it is available in almost any business setting. Figure 9 shows the original attrition problem, without the additional constraint, solved in Excel. Spreadsheet cell labeled Cost contains the cost formula. Value of that cell is maximized, by changing decision variables x and y, as long as all constraints are satisfied. Point (0.64, 0.96) minimizes the objective function.

Figure 9. Excel optimizer solves the optimization problem

Slopes of line segments on the boundaries are calculated in column 3. Note that at point (0.64, 0.96), slope changes from 0.55 to 0. 21. As noted earlier, slope of the cost line is 0.3. So point (0.64, 0.96) would have been selected as optimal by the traditional ROCCH analysis as well. Figure 10 shows a new solution, after the additional capacity constraint was added. The new solution is (0.15, 0.44).

Figure 10. Excel solution incorporating the capacity

constraint

6.2 Note on practical considerations For all practical purposes, a business setting often prefers speed and expediency of delivery, to an optimal solution, as classifier’s performance is near optimal. In case of the attrition problem, we were lucky, in that one of the existing classifiers (0.14, 0.43) was in close proximity of the optimal solution (0.15, 0.44) and still within the feasible region. A selection of near optimal solution was, in this case, a simple decision because this was a relatively simple problem with just one additional constraint.

6.3 New performance constraints While analyzing the new optimal solution, we notice that the true positive rate is 0.44. In other words, out of all instances of the positive class, only 44% (less than a half) is classified correctly. This is not a desirable performance. 54% of defecting customers remain unrecognized and without being contacted will leave the bank. Solution analysis (see Appendix) shows that the capacity constraint is binding at the optimal point. We cannot increase the True Positive rate y, without leaving the feasible region and violating the capacity constraint. Any further improvements in the attritors’ recognition rate will require additional resources so that the capacity constraint can be relaxed. The optimization template set up in Excel allowed us to play a number of “what if” scenarios. In a final compromise, the line of business decided to relax the capacity constraint to allow for 30% of the population to be classified as positive. In return, we added a new “missed

Cost

c(Y|n) c(N|y) Rate p Max x y

30$ 2,575$ 0.0375 38.33 0.15 0.44

Slopes

FP ratex1

TP ratey1 m - mx + y <= - m*x1 + y1

0 00.06 0.25 3.83 -0.14 <= 0.000.14 0.43 2.31 0.09 <= 0.100.23 0.58 1.66 0.19 <= 0.190.33 0.70 1.35 0.24 <= 0.260.42 0.82 1.28 0.25 <= 0.290.53 0.90 0.72 0.33 <= 0.520.64 0.96 0.55 0.36 <= 0.610.75 0.98 0.21 0.41 <= 0.820.88 0.99 0.08 0.43 <= 0.921.00 1 0.04 0.44 <= 0.96

(1-p)*x + p*y <= 0.200.16 <= 0.16

Capacity Constraint

Data Points

Maximizing Modified Cost Function-c(Y | n)*(1-p)*x + c(N | y)*p*y

Decision variablesError Costper unit

Constraints

Cost

c(Y|n) c(N|y)

AttritionRate p Max x y

30$ 2,575$ 0.0375 74.29 0.64 0.96

Slopes

FP ratex1

TP ratey1 m - mx + y <= - m*x1 + y1

0 00.06 0.25 3.83 -1.48 <= 0.000.14 0.43 2.31 -0.51 <= 0.100.23 0.58 1.66 -0.09 <= 0.190.33 0.70 1.35 0.10 <= 0.260.42 0.82 1.28 0.15 <= 0.290.53 0.90 0.72 0.50 <= 0.520.64 0.96 0.55 0.61 <= 0.610.75 0.98 0.21 0.82 <= 0.820.88 0.99 0.08 0.91 <= 0.921.00 1 0.04 0.93 <= 0.96

Data Points

Maximizing Modified Cost Function-c(Y | n)*(1-p)*x + c(N | y)*p*y

Decision variablesError Cost

per unit

Constraints

Page 7: Evaluating Classifiers' Performance KDD2002

opportunity” constraint, which required that the True Positive rate is no less than 60%. The optimal solution for this problem with two additional constraints is shown in Figure 11. The feasible region is now the triangle contained between the lines: the classifier constraint, the performance constraint y ≥ .6 and the relaxed capacity constraint (1-p)*x + p*y ≤ .3

Figure 11. Feasible region with two additional constraints The optimal solution (0.29, 0.65) found by the Excel Optimizer is shown in Figure 12.

Figure12. Excel solution with two additional constraints

Other examples of frequently used performance constraints could include: − limited size of a direct mail campaign − restrictions in processing capacity for responders to a

campaign − limited expense of incentives for responders − limited capacity of response processing systems (indirectly

resulting in limiting a campaign size)

In each case, we need to express the constraints in terms of decision variables to create the new convex hull to intersect with the one created by the classifiers set. Note, that in this formulation we do not need to know the intersection points of ROCCH with the constrains convex hull. The optimal solution can be found with no explicit knowledge of vertices of the combined convex hull. 7. CONCLUSION Linear Programming is a part of a broad field of constrained optimizations called Mathematical Programming. Mathematical Programming is a rich discipline and its applications have been growing fast in the past several decades. Once a problem has been framed as a mathematical programming problem, we gain a powerful tool. More importantly, we can draw from that field’s rich legacy. Some future developments may, for example, include multi-dimensional problems where a classification problem has more than two classes. Computational difficulties grow fast with problem’s dimensions, but techniques and methods of Linear Programming can help overcome computational issues. We could also consider problems involving non-linear cost function. The theory of Non-linear Programming may guide us in minimizing non-linear costs, such as piece wise linear or quadratic. Other areas of pursuit may include methods to deal with classifier and/or constraint uncertainty. Stochastic Programming methods could perhaps be employed to deal with optimal solutions if the standard errors of classifiers are need to be taken into account. At the same time we are not losing the main benefits of ROCCH. We maintain the benefit of visualization, except the cost function now “slides” along the convex hull. The benefit of being able to choose the optimal classifier at the run time if constraints change, can be maintained as well. The selection process would proceed as follows:

- Input new costs and constraints - Intersect a constraints space with the ROC space - Input all competing points - Solve the optimization problem

8. ACKNOWLEDGEMENTS Thanks to all who helped to inspire, develop and formulate the above thoughts. In particular my gratitude goes to Endre Boros, Stan Matwin and Vera Helman who generously shared their knowledge and experience and provided feedback throughout the various stages of this work. 9. REFERENCES [1] Bazaraa, Moktar S. at al. Linear Programming and Network Flows. John Wiley & Sons, Inc., 1997. [2] Murty, Katta G. Operations Research: Deterministic Optimization Models. Prentice Hall, Inc. 1995. [3] Provost, F., Fawcett, T. Robust Classification Systems for Imprecise Environment. Proceedings of the Fifteenth International Conference on Artificial Intelligence ’98. [4] Provost, F., Fawcett, T. Robust Classification for Imprecise Environment; Machine Learning, Vol. 42, No.3, 2001, pp.203-231

Attrition model with capacity and performance constraints

0.55

0.6

0.65

0.7

0.2 0.25 0.3 0.35FP rate

TP

rat

e

Cost

c(Y|n) c(N|y) Rate p Max x y

30$ 2,575$ 0.0375 54.39 0.29 0.65

Slopes

FP ratex1

TP ratey1 m - mx + y <= - m*x1 + y1

0 0 0.000.06 0.25 3.83 -0.45 <= 0.000.14 0.43 2.31 -0.01 <= 0.100.23 0.58 1.66 0.17 <= 0.190.33 0.70 1.35 0.26 <= 0.260.42 0.82 1.28 0.28 <= 0.290.53 0.90 0.72 0.44 <= 0.520.64 0.96 0.55 0.49 <= 0.610.75 0.98 0.21 0.59 <= 0.820.88 0.99 0.08 0.63 <= 0.921.00 1 0.04 0.64 <= 0.96

(1-p)*x + p*y <= 0.300.30 <= 0.30

y >= 0.600.65 >= 0.60

Capacity Constraint

Data Points

Missed Opportunity Constraint

Decision variablesError Costper unit

Constraints

Page 8: Evaluating Classifiers' Performance KDD2002

10. APPENDIX

Answer report and sensitivity analysis report produced by an Excel Optimizer for the attrition problem with capacity constraint.

Worksheet: [ROCCH1.xls]OptimizationReport Created: 2/28/2002 6:16:12 PM

Target Cell (Max)Cell Name Original Value Final Value

$D$4 Max 0.00 38.33

Adjustable CellsCell Name Original Value Final Value

$E$4 x 0.00 0.15$F$4 y 0.00 0.44

ConstraintsCell Name Cell Value Formula Status Slack

$D$7 - mx + y 0.00 $D$7<=$F$7 Binding 0$D$8 - mx + y -0.14 $D$8<=$F$8 Not Binding 0.14$D$9 - mx + y 0.09 $D$9<=$F$9 Not Binding 0.01$D$10 - mx + y 0.19 $D$10<=$F$10 Binding 0$D$11 - mx + y 0.24 $D$11<=$F$11 Not Binding 0.02$D$12 - mx + y 0.25 $D$12<=$F$12 Not Binding 0.04$D$13 - mx + y 0.33 $D$13<=$F$13 Not Binding 0.19$D$14 - mx + y 0.36 $D$14<=$F$14 Not Binding 0.25$D$15 - mx + y 0.41 $D$15<=$F$15 Not Binding 0.41$D$16 - mx + y 0.43 $D$16<=$F$16 Not Binding 0.49$D$20 (1-p)*x + p*y 0.16 $D$20<=$F$20 Binding 0$F$4 y 0.44 $F$4<=1 Not Binding 0.56$E$4 x 0.15 $E$4<=1 Not Binding 0.85

Microsoft Excel 8.0e Answer Report

Worksheet: [ROCCH1.xls]OptimizationReport Created: 2/28/2002 6:20:01 PM

Final Reduced Objective Allowable AllowableCell Name Value Cost Coefficient Increase Decrease

$E$4 x 0.15 0 -28.9 2507.31 131.01$F$4 y 0.44 0 96.6 1E+30 79.12

Final Shadow Constraint Allowable AllowableCell Name Value Price R.H. Side Increase Decrease

$D$7 - mx + y 0.00 0 0 1E+30 0$D$8 - mx + y -0.14 0 0 1E+30 0.14$D$9 - mx + y 0.09 0 0.10 1E+30 0.01$D$10 - mx + y 0.19 91.77 0.19 0.01 0.47$D$11 - mx + y 0.24 0 0.26 1E+30 0.02$D$12 - mx + y 0.25 0 0.29 1E+30 0.04$D$13 - mx + y 0.33 0 0.52 1E+30 0.19$D$14 - mx + y 0.36 0 0.61 1E+30 0.25$D$15 - mx + y 0.41 0 0.82 1E+30 0.41$D$16 - mx + y 0.43 0 0.92 1E+30 0.49$D$20 (1-p)*x + p*y 0.16 127.87 0.16 0.08 0.01

Adjustable Cells

Constraints

Microsoft Excel 8.0e Sensitivity Report