geco: quality counterfactual explanations in real time
TRANSCRIPT
GeCo: Quality Counterfactual Explanations in Real TimeMaximilian SchleichUniversity of Washington
Zixuan GengUniversity of Washington
Yihong ZhangUniversity of Washington
Dan SuciuUniversity of Washington
ABSTRACTMachine learning is increasingly applied in high-stakes decisionmaking that directly affect peopleβs lives, and this leads to an in-creased demand for systems to explain their decisions. Explanationsoften take the form of counterfactuals, which consists of conveyingto the end user what she/he needs to change in order to improvethe outcome. Computing counterfactual explanations is challeng-ing, because of the inherent tension between a rich semantics ofthe domain, and the need for real time response. In this paper wepresent GeCo, the first system that can compute plausible and feasi-ble counterfactual explanations in real time. At its core, GeCo relieson a genetic algorithm, which is customized to favor searchingcounterfactual explanations with the smallest number of changes.To achieve real-time performance, we introduce two novel optimiza-tions: Ξ-representation of candidate counterfactuals, and partialevaluation of the classifier. We compare empirically GeCo againstfour other systems described in the literature, and show that it isthe only system that can achieve both high quality explanationsand real time answers.
Artifact Availability:The source code, data, and/or other artifacts have been made available athttps://github.com/mjschleich/GeCo.jl.
1 INTRODUCTIONMachine learning is increasingly applied in high-stakes decisionmaking that directly affects peopleβs lives. As a result, there is ahuge need to ensure that the models and their predictions are in-terpretable by their human users. Motivated by this need, therehas been a lot of recent interest within the machine learning com-munity in techniques that can explain the outcomes of models.Explanations improve the transparency and interpretability of theunderlying model, they increase userβs trust in the model predic-tions, and they are a key facilitator to evaluate the fairness of themodel for underrepresented demographics. The ability to explainis no longer a nice-to-have feature, but is increasingly required bylaw; for example, the GDPR regulations grant users the right toexplanation to automated decision algorithms [30]. In addition tosupporting the end user, explanations can also be used by modeldevelopers to debug and monitor their ever more complex models.
In this paper we focus on local explanations, which provide post-hoc explanations for one single prediction, and are in contrast toglobal explanations, which aim to explain the entire model (e.g. fordebugging purposes). In particular we study counterfactual expla-nations: given an instance π , on which the machine learning modelpredicts a negative, βbadβ outcome, the explanation says what needs
Loan?
π (π) + Explanation
Factual πAge Income Debt Accounts28 800 200 5
M( ) = no loan
Counterfactual ππ πAge Income Debt Accounts28 1000 200 3
M( ) = loan
Explanation: Your loan will be approved,if you increase income by $200 and close two accounts.
Figure 1: Example of a counterfactual explanation scenario.
to change in order to get the positive, βgoodβ outcome, usually rep-resented by a counterfactual example ππ π . For example, a customerapplies for a loan with a bank, the bank denies the loan application,and the customer asks for an explanation; the system responds byindicating what features need to change in order for the loan to beapproved, see Fig. 1. In the AI literature counterfactual explanationsare currently considered the most attractive types of local expla-nations, because they offer actionable feedback to the customers,and have been deemed satisfactory by some legislative bodies, forexample they are deemed to satisfy the GDPR requirements [30].
The major challenge in computing a counterfactual explanationis the tension between a rich semantics on one hand, and the needfor real-time, interactive feedback on the other hand. The seman-tics needs to be rich in order to reflect the complexities of the realworld. We want ππ π to be as close as possible to π , but we alsowant ππ π to be plausible, meaning that its features should makesense in the real world. We also want the transition from π to ππ πto be feasible, for example age should only increase. The plausibil-ity and feasibility constraints are dictated by laws, societal norms,application-specific requirements, and may even change over time;an explanation system must be able to support constraints with arich semantics. Moreover, the search space for counterfactuals ishuge, because there are often hundreds of features, and each cantake values from some large domain. On the other hand, the com-putation of counterfactuals needs to be done at interactive speed,because the explanation system is eventually incorporated in a userinterface. Performance has been identified as the main challengefor deployment of counterfactual explanations in industry [6, 31].The golden standard for interactive response in data visualization
arX
iv:2
101.
0129
2v1
[cs
.LG
] 5
Jan
202
1
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
Limited Completesearch space search space
Non-interactive CERTIFAI [25] MACE [12]
Interactive What-If [31] DiCE [17] GeCo (this paper)
Figure 2: Taxonomy of Counterfactual Explanation Systems
is 500ms [11, 15], and we believe that a similar bar should be set forexplanations. The tension between performance and rich seman-tics is the main technical challenge in counterfactual explanations.Previous systems either explore a complete search space with richsemantics, or answer at interactive speed, but not both: see Fig. 2 anddiscussions in Section 6. For example, on one extreme MACE [12]enforces plausibility and feasibility by using a general-purposeconstraint language, but the solver often takes many minutes tofind a counterfactual. At the other extreme, Googleβs What-if Tool(WIT) [31] restricts the search space to a fixed dataset of examples,ensuring fast response time, but poor explanations.
In this paper we present GeCo, the first interactive system forcounterfactual explanations that supports a complex, real-life se-mantics of counterfactuals, yet provides answers in real time. Atits core, GeCo defines a search space of counterfactuals using aplausibility-feasibility constraint language, PLAF, and a databaseπ· . PLAF is used to define constraints like βage can only increaseβ,while the database π· is used to capture correlations, like βjob-titleand salary are correlatedβ. By design, the search space of possiblecounterfactuals is huge. To search this space, we make a simpleobservation. A good explanation ππ π should differ from π by only afew features; counterfactual examples ππ π that require the customerto change too many features are of little interest. Based on this ob-servation, we propose a genetic algorithm, which we customize tosearch the space of counterfactuals by prioritizing those that havefewer changes. Starting from a population consisting of just thegiven entity π , the algorithm repeatedly updates the population byapplying the operations crossover and mutation, and then selectingthe best counterfactuals for the new generation. It stops when itreaches a sufficient number of examples on which the classifierreturns the βgoodβ (desired) outcome.
The main performance limitation in GeCo is its innermost loop.By the nature of the genetic algorithm, GeCo needs to repeatedlyadd and remove counterfactuals to and from the current population,and ends up having to examine thousands of candidates, and applythe classifierπ (π β²) on each of them. We propose two novel opti-mizations to speedup the inner loop of GeCo: Ξ-representation andclassifier specialization via partial evaluation. In Ξ-representationwe group the current population by the set of features ΞπΉ by whichthey differ from π , and represent this entire subpopulation in a sin-gle relation whose attributes are only ΞπΉ ; for example all candidateexamples π β² that differ from π only by age are represented by asingle-column table, storing only the modified age. This leads tohuge memory savings over the naive representation (storing allfeatures of all candidates π β²), which, in turn, leads to performance
improvements. Partial evaluation specializes the code of the clas-sifier π to entities π β² that differ from π only in ΞπΉ ; for example,if π is a decision tree, then normally π (π β²) inspects all featuresfrom the root to a leaf, but if ΞπΉ is age, then after partial evaluationit only needs to inspect age, for example π can be age > 30, orperhaps 30 < age < 60, because all the other features are constantswithin this subpopulation. Ξ-representation and partial evaluationwork well together, and, when combined, allows GeCo to computethe counterfactual in interactive time. At a high level, our opti-mizations belong to a more general effort that uses database andprogram optimizations to improve the performance of machinelearning tasks (e.g., [4, 9, 14, 19, 23]).
We benchmarked GeCo against four counterfactual explanationsystems: MACE [12], DiCE [17], What-If Tool [31], and CERTI-FAI [25]. We show that in all cases GeCo produced the best qualityexplanation (using quality metrics defined in [12]), at interactivespeed (under 500ms). We also conduct micro-experiments showingthat the two optimizations, Ξ-representation and partial evaluation,account for a performance improvement of up to 5.2Γ, and thus arecritical for an interactive deployment of GeCo.
Discussion Early explanation systems were designed for a spe-cific class of models, e.g., DeepLift [26] is specific for neural net-works, hence are called white-box. While many systems today claimto be black-box, there is yet no commonly agreed definition of whatblack-box means. In this paper we adopt a strict definition: theclassifier π is black-box if it is available only as an oracle that,when given an input π β², returns the outcomeπ (π β²). A black-boxmodel makes it much more difficult to explore the search space ofcounterfactuals, because we cannot examine the code ofπ for hintsof how to quickly get from π to a counterfactual ππ π . A gray-boxmodel allows us to examine the code ofπ . Unlike white-box, thegray-box does not restrict whatπ does, but it gives us access to itssource code. Some previous systems that were called black box are,in fact, gray box: MACE [12] translates both the classifier logic andthe feasibility/plausibility constraints into a logical formula andthen solves for counterfactuals via multiple calls to an SMT solver.The explanation systems based on gradient descent [18, 30] arealso gray-box, since they need access to the code ofπ in order tocompute its gradient. Fully black-box model are CERTIFAI [25] andGoogleβs What-if Tool (WIT) [31]. CERTIFAI is based on a geneticalgorithm, but the quality of its explanations are highly sensitive tothe computation of the initial population; we discussion CERTIFAIin detail in Sec. 4.7. WIT severely limits its search space to a givendataset. A black-box classifier also makes it difficult for the systemto compute an explanation that is consistent with the underlyingclassifier, meaning that π (ππ π ) returns the desired, βgoodβ, out-come. As an example, LIME [21] uses a black-box classifier, learnsa simple, interpretable model locally, around the input data pointπ , and uses it to obtain an explanation; however, its explanationsare not always consistent with the predictions of the original classi-fier [22, 27]. Our system,GeCo, is black-box if we turn off the partialevaluation optimization. With the optimization enabled, GeCo isgray-box. Thus, GeCo differs from other systems in that it usesaccess to the code ofπ not to guide the search, but only to optimizethe execution time.
GeCo: Quality Counterfactual Explanations in Real Time
Contributions In summary, in this paper wemake the followingcontributions.
β’ We describe GeCo, the first explanation system that com-putes feasible and plausible explanations in interactive re-sponse time.
β’ We describe the search space of GeCo consisting of a data-base π· and a constraint language PLAF. Section 3.
β’ We describe a custom genetic algorithm for exploring thespace of candidate counterfactuals. Section 4.
β’ We describe two optimization techniques: Ξ-representationand partial evaluation. Section 5.
β’ We conduct an extensive experimental evaluation of GeCo,and compare it to MACE, Googleβs What-If Tool, and CER-TIFAI. Section 6.
2 COUNTERFACTUAL EXPLANATIONSWe considerπ features πΉ1, . . . , πΉπ , with domains dom(πΉ1), . . . , dom(πΉπ).We assume to have a black-box modelπ , i.e. an oracle that, whengiven a feature vector π = (π₯1, . . . , π₯π), returns a predictionπ (π).The prediction is a number between 0 and 1, where 1 is the desired,or good outcome, and 0 is the undesired outcome. For simplicity, wewill assume thatπ (π) > 0.5 is βgoodβ, and everything else is βbadβ.If the classifier is categorical, then we simply replace its outcomeswith the values {0, 1}. Given an instance π for whichπ (π) is βbadβ,the goal of the counterfactual explanation is to find a counterfactualinstance ππ π such that (1) π (ππ π ) is βgoodβ, (2) π₯π π is close to π₯ ,and (3) ππ π is both feasible and plausible. Formally, a counterfactualexplanation is a solution to the following optimization problem:
argminπππ
dist(π, ππ π ) (1)
π .π‘ . π (ππ π ) > 0.5ππ π β P // π₯π π is plausible
ππ π β F (π) // π₯π π is feasible
where dist is a distance function. The counterfactual explanationππ π ranges over the space dom(πΉ1) Γ Β· Β· Β· Γ dom(πΉπ), subject tothe plausibility and feasibility constraints P and F (π), which wediscuss in the next section. In practice, as we shall explain, GeCoreturns not just one, but the top π best counterfactuals ππ π .
The role of the distance function is to ensure that GeCo findsthe nearest counterfactual instance that satisfies the constraints.In particular, we are interested in counterfactuals that change thevalues of only a few features, which helps define concrete actionsthat the user can perform to achieve the desired outcome. Forthat purpose, we use the distance function dist(π,π) introduced byMACE [12], which we briefly review here.
We start by defining domain-specific distance functionsπΏ1, . . . , πΏπfor each of the π features, as follows. If dom(πΉπ ) is categorical, thenπΏπ (π₯π , π¦π )
def= 0 if π₯π = π¦π and πΏπ (π₯π , π¦π )
def= 1 otherwise. If dom(πΉπ )
is a continuous or an ordinal domain, then πΏπ (π₯π , π¦π )def= |π₯π βπ¦π |/π€ ,
whereπ€ is the range of the domain. We note that, when the rangeis unbounded, or unknown, then alternative normalizations arepossible, such as the Median Absolute Distance (MAD) [30], or thestandard deviation [31].
Next, we define the βπ -distance between π,π as:
distπ (π,π)def=
(βοΈπ
πΏπ
π(π₯π , π¦π )
)1/πand adopt the usual convention that the β0-distance is the numberof distinct features: dist0 (π,π)
def= |{π | πΏπ (π₯π , π¦π ) β 0}|. Finally, we
define our distance function as a weighted combination of the β0, β1,and ββ distances:
πππ π‘ (π,π) =πΌ Β· dist0 (π,π)π
+ π½ Β· dist1 (π,π)π
+ πΎ Β· distβ (π,π) (2)
whereπΌ, π½,πΎ β₯ 0 are hyperparameters, whichmust satisfyπΌ+π½+πΎ =
1. Notice that 0 β€ πππ π‘ (π,π) β€ 1. The intuition is that the β0-normrestricts the number of features that are changed, the β1 normaccounts for the average change of distance between π and π, andthe ββ norm restricts the maximum change across all features.Although [12] defines the distance function (2), the MACE systemhardwires the hyperparameters to πΌ = 0, π½ = 1, πΎ = 0; we discussthis, and the setting of different hyperparameters in Sec. 6.
3 THE SEARCH SPACEThe key to computing high quality explanations is to define a com-plete space of candidates that the system can explore. In GeCo thesearch space for the counterfactual explanations is defined by twocomponents: a database of entities π· = {π1, π2, . . .} and plausibilityand feasibility constraint language called PLAF.
The databaseπ· can be the training set, a test set, or historical dataof past customers for which the system has performed predictions.It is used in two ways. First, GeCo computes the domain of eachfeature as the active domain in π· : dom(πΉπ )
def= Ξ πΉπ (π·). Second, the
data analyst can specify groups of features with the command:
GROUP πΉπ1 , πΉπ2 , . . .
and in that case the joint domain of these features is restricted to thecombination of values found inπ· , i.e.Ξ πΉπ1πΉπ2 Β· Β· Β· (π·). Grouping is use-ful in several contexts. The first is when the attributes are correlated.For example, if we have a functional dependency like zip β city,then the data analyst would group zip, city. For another exampleof correlation, consider education and income. They are correlatedwithout satisfying a strict functional dependency; by grouping them,the data analyst ensures that GeCo considers only combinationsof values found in the dataset π· . As a final example, consider at-tributes that are the result of one-hot encoding: e.g. color may beexpanded into color_red, color_green, color_blue. By groupingthem together, the analyst restricts GeCo to consider only values(1, 0, 0), (0, 1, 0), (0, 0, 1) that actually occur in π· .
The constraint language PLAF allows the data analyst to specifywhich combination of features of ππ π are plausible, and which canbe feasibly reached from π . PLAF consists of statements of the form:
PLAF IF Ξ¦1 and Ξ¦2 and Β· Β· Β· THEN Ξ¦0 (3)
where each Ξ¦π is an atomic predicate of the form π1 op π2 forop β {=,β , β€, <, β₯, >}, and each expression π1, π2 is over the cur-rent features, denoted x.πΉπ , and/or counterfactual features, denotedx_cf.πΉπ . The IF part may be missing.
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
explain (instance π , classifierπ , dataset π· , PLAF (Ξ, C))Cπ = ground(π, C); DG = feasibleSpace(π·, Ξ, Cπ );POP = [ (π, β ) ]POP = mutate(POP, Ξ, DG, Cπ ,πinit) //initial population
POP = selectFittest(π, POP, π, π)do {
CAND = crossover(POP, Cπ ) βͺmutate(POP, Ξ, DG, Cπ ,πmut)POP = selectFittest(π, POP βͺ CAND, π, π)TOPK = POP[1 : π]
} until ( counterfactuals(TOPK, π) and TOPK β© CAND = β )return TOPK
Algorithm 1: Pseudo-code of GeCoβs custom genetic algo-rithm to generate counterfactual explanations.
Example 3.1. Consider the following PLAF specification:
GROUP education, income (4)PLAF x_cf.gender = x.gender (5)PLAF x_cf.age >= x.age (6)PLAF IF x_cf.education > x.education
THEN x_cf.age > x.age+4 (7)
The first statement says that education and income are correlated:GeCo will consider only counterfactual values that occur togetherin the data. Rule (5) says that gender cannot change, rule (6) saysthat age can only increase, while rule (7) says that, if we ask thecustomer to get a higher education degree, then we should alsoincrease the age by 4. The last rule (7) is adapted from [17], whohave argued for the need to restrict counterfactuals to those thatsatisfy causal constraints.
PLAF has the following restrictions. (1) The groups have to bedisjoint: if a feature πΉπ needs to be part of two groups then theyneed to be union-ed into a larger group. We denote by πΊ (πΉπ ) theunique group containing πΉπ (or πΊ (πΉπ ) = {πΉπ } if πΉπ is not explic-itly included in any group). (2) the rules must be acyclic, in thefollowing sense. Every consequent Ξ¦0 in (3) must be of the formx_cf.πΉπ op π , i.e. must βdefineβ a counterfactual feature πΉπ , andthe following graph must be acyclic: the nodes are the groupsπΊ (πΉ1),πΊ (πΉ2), . . ., and the edges are (πΊ (πΉ π ),πΊ (πΉπ )) whenever thereis a rule (3) that defines x_cf.πΉπ and that also contains x_cf.πΉ π .We briefly illustrate with Example 3.1. The three rules (5)-(7) βde-fineβ the features gender, age, and age respectively, and there thereis a single edge {education, income} β {age}, resulting fromRule (7). Therefore, the PLAF program is acyclic.
4 GECOβS CUSTOM GENETIC ALGORITHMIn this section, we introduce the custom genetic algorithm thatGeCo uses to efficiently explore the space of counterfactual ex-planations. A genetic algorithm is a meta-heuristic for constraintoptimization that is based on the process of natural selection. Thereare four core operations. First, it defines an initial populationof candidates. Then, it iteratively selects the fittest candidatesin the population, and generates new candidates viamutate andcrossover on the selected candidates, until convergence.
While there are many optimization algorithms that could be usedto solve our optimization problem (defined in Eq (1)), we chose a ge-netic algorithm for the following reasons: (1) The genetic algorithmis easily customizable to the problem of finding counterfactual ex-planations; (2) it seamlessly supports the rich semantics of PLAFconstraints, which are necessary to ensure that the explanations arefeasible and plausible; (3) it does not require any restrictions on theunderlying classifier and data, and thus is able to provide black-boxexplanations; and (4) it returns a diverse set of explanations, whichmay provide different actions that can lead to the desired outcome.
In GeCo, we customized the core operations of the genetic algo-rithm based on the following key observation: A good explanationππ π should differ from π by only a few features; counterfactual ex-amples ππ π that require the customer to change too many featuresare of little interest. For this reason, GeCo first explores counter-factuals that change only a single feature group, before exploringincreasingly more complex action spaces in subsequent generations.
In the following, we first overview of the genetic algorithm usedby GeCo and then provide additional details for the core operations.
4.1 OverviewGeCoβs pseudocode is shown in Algorithm 1. The inputs are: aninstance π , the black-box classifierπ , a dataset π· , and PLAF pro-gram (Ξ,πΆ). Here Ξ = {πΊ1,πΊ2, . . .} are the groups of features, andπΆ = {πΆ1,πΆ2, . . .} are the PLAF constraints, see Sec. 3. The algorithmhas four integer hyperparameters π,πinit,πmut, π > 0, with thefollowing meaning: π represents the number of counterfactualsthat the algorithm returns to the user; π is the size of the populationthat is retained from one generation to the next; andπinit,πmutcontrol the number of candidates that are generated for the initialpopulation, and during mutation respectively. We always set π < π.
As explained in Sec. 3, the active domain of each attribute arevalues found in the database π· . More generally, for each groupπΊπ β Ξ, its values must be sampled together from those in thedatabase π· . The GeCo algorithm starts by grounding (specializing)the PLAF program πΆ to the entity π (the ground function), thencalls the feasibleSpace operator, which computes for each groupπΊπ a relation π·πΊπ with attributes πΊπ representing the sample spacefor the group πΊπ ; we give details in Sec. 4.2.
Next, GeCo computes the initial population. In our customizedalgorithm, the initial population is obtained simply by applying themutate operator to the given entity π . Throughout the executionof the algorithm, the population is a set of pairs (π β²,Ξβ²), where π β²is an entity and Ξβ² is the set of features that were changed from π .Throughout this section we call the entities π β² in the populationcandidates, or examples, but we donβt call them counterfactuals,unless π (π β²) is βgoodβ, i.e. π (π β²) > 0.5 (see Sec. 2). In fact, it ispossible that none of the candidates in the initial population are clas-sified as good. The goal of GeCo is to find at least π counterfactualsthrough a sequence of mutation and crossover operation.
The main loop ofGeCoβs genetic algorithm consists of extendingthe population with new candidates obtained by mutation andcrossover, then keeping only the π fittest for the next generation.The operators selectFittest, mutate, crossover are described inSec. 4.3, 4.4, and 4.5. The algorithm stops when the top π (fromthe select π fittest) candidates are all counterfactuals and are stable
GeCo: Quality Counterfactual Explanations in Real Time
from one generation to the next; the function counterfactualssimply tests that all candidates are counterfactuals, by checking1thatπ (π β²) is βgoodβ. We describe now the details of the algorithm.
4.2 Feasible Space OperatorsThroughout the execution of the genetic algorithm, GeCo ensuresthat all candidates π β² satisfy all PLAF constraintsπΆ . It achieves thisefficiently through three functions: ground, feasibleSpace, andactionCascade. We describe these functions here, and omit theirpseudocode (which is straightforward).
The function ground(π,πΆ) simply instantiates all features of πwith constants, and returns βgroundedβ constraints πΆπ . All candi-dates π β² will need to satisfy these grounded constraints.
Example 4.1. LetπΆ be the three constraints of the PLAF programin Example 3.1. Assume that the instance is:
π =(gender = female, age = 22, education = 3, income = 80π)
Then the grounded set of rules is πΆπ is obtained from the rules (5)-(7). For example, x_cf.age >= x.age becomes age β₯ 22. Thethree grounded rules are:
gender = female (8)age β₯ 22 (9)education > 3 β age > 26 (10)
Every candidate π β² must satisfy all three rules.
A naive strategy to generate candidates π β² that satisfy πΆπ isthrough rejection sampling: after each mutation and/or crossover,we verify πΆπ , and reject π β² if it fails some constraint in πΆπ . GeCoimproves over this naive approach in two ways. First, it computesfor each group πΊπ β Ξ a set of values π·πΊπ that, in isolation, sat-isfy πΆπ ; this is precomputed at the beginning of the algorithm byfeasibleSpace. Second, once it generates candidates π β² that dif-fer in multiple feature groups πΊπ1 ,πΊπ2 , . . . from π , then it enforcesthe constraint by possibly applying additional mutation, until allconstraints hold: this is done by the function actionCascade.
The function feasibleSpace(π·, Ξ, Cπ ) computes, for each groupπΊπ β Ξ, the relation:
π·πΊπdef= π
πΆ(π )π
(Ξ πΊπ
(π·))
where πΆ (π)π consists of the conjunction of all rules in πΆπ that refer
only to features in the groupπΊπ . We call the relationπ·πΊπ the samplespace for the groupπΊπ . Thus, the selection operator π
πΆ(π )π
rules outvalues in the sample space that violate of some PLAF rule.
Example 4.2. Continuing Example 4.1, there are three groups,πΊ1 = {gender},πΊ2 = {education, income},πΊ3 = {age}, and theirsample spaces are computed as follows:
π·πΊ1def= πgender=female (Ξ gender (π·))
π·πΊ2def= Ξ education,income (π·))
π·πΊ3def= πageβ₯22 (Ξ age (π·))
1In our implementation we store the valueπ (πβ²) together with πβ², so we donβt haveto compute it repeatedly. We omit some details for simplicity of the presentation.
Notice that we could only check the grounded rules (8) and (9). Therule (10) refers to two different groups, and can only be checked byexamining features from two groups; this is done by the functionactionCascade.
The function actionCascade(π β²,Ξβ², Cπ ) is called during muta-tion and crossover, and its role is to enforce the grounded con-straints πΆπ on a candidate π β² before it is added to the population.Recall from Sec. 3 that the PLAF rules are acyclic. actionCascadechecks each rule, in topological order of the acyclic graph: if therule is violated, then it changes the feature defined by that rule to avalue that satisfies the condition, and it adds the updated feature(or group) to Ξβ². The function returns the updated candidate π β²,which now satisfies all rules Cπ , as well as its set of changed featuresΞβ². For a simple illustration, referring to Example 4.2, when GeCoconsiders a new candidate π β², it checks the rule (10): if the rule isviolated, then it replaces age with a new value from π·πΊ3, subjectto the additional condition age > 26, and adds age to Ξβ².
4.3 Selecting Fittest CandidatesAt each iteration, GeCoβs genetic algorithm extends the currentpopulation (through mutation and crossover), then retains onlythe βfittestβ π candidates π β² for the next generation using theselectFittest(π, POP, π, π) function. The function first evaluatesthe fitness of each candidate π β² in POP; then sorts the examples π β²by their fitness score; and returns the top π candidates.
We describe here how GeCo computes the fitness of each candi-date π β². For that,GeCo takes into account two pieces of information:whether π β² is counterfactual, i.e.π (π β²) > 0.5, and the distance fromπ to π β² which is given by the function πππ π‘ (π, π β²) defined in Sec. 2(see Eq. (2)). It combines these two pieces of information into asingle numerical value, π ππππ (π β²) defined as:
π ππππ (π β²) ={πππ π‘ (π, π β²) ifπ (π β²) > 0.5(πππ π‘ (π, π β²) + 1)+(1 βπ (π β²)) otherwise
(11)
The rationale for this score is that we want every counterfactualcandidate to be better than every non-counterfactual candidate. Ifπ (π β²) > 0.5, then π β² is counterfactual, in that case it remains tominimize πππ π‘ (π, π β²). But ifπ (π β²) β€ 0.5, then π β² is not counterfac-tual, and we add 1 to πππ π‘ (π, π β²) ensuring that this term is largerthan any distance πππ π‘ (π, π β²β²) for any counterfactual π β²β² (becauseπππ π‘ β€ 1, see Sec. 2); we also add a penalty 1 β π (π β²), to favorcandidates π β² that are closer to becoming counterfactuals.
In summary, the function selectFittest computes the fitnessscore (Eq. (11)) for each candidate in the population, then sortsthe population in decreasing order by this score, and returns the πfittest candidates. Notice that all counterfactuals examples π β² willprecede all non-counterfactuals in the sorted order.
4.4 Mutation OperatorAlgorithm 2 provides the pseudo-code of the mutation operator.The operatormutate(POP, Ξ, DG, Cπ ,π) takes as input the currentpopulation POP, the list of feature groups Ξ = {πΊ1,πΊ2, . . .}, theirassociated sample spaces DG = {π·πΊ1, π·πΊ2, . . .}, the grounded con-straints Cπ , and an integerπ > 0. The function generates, for eachcandidate π β² in the population,π new mutations for each featuregroupπΊπ . We briefly explain the pseudo-code. For each candidate
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
mutate (POP, Ξ, DG, Cπ ,π )
CAND = β foreach (πβ,Ξβ) β POP do {
foreach πΊ β Ξ s.t. πΊ β Ξβ do {π β² = πβ; π = sample(DG[πΊ],π) //without replacement
foreach π£ β π do {π β².πΊ = π£ ; Ξβ² = Ξβ βͺ {πΊ}(π β²,Ξβ²) = actionCascade(π β²,Ξβ², Cπ )CAND βͺ= {(π β²,Ξβ²)}
}}
}return CAND
Algorithm 2: Pseudo-code for GeCoβs mutate operator.
crossover (POP, Cπ )
CAND = β ; {Ξ1, . . . ,Ξπ } = {Ξ | (π,Ξ) β POP}foreach (Ξπ ,Ξ π ) β {Ξ1, . . . ,Ξπ } s.t. π < π do {
let (ππ ,π«π ) = best instance in POP with Ξ = Ξπlet (π π ,π« π ) = best instance in POP with Ξ = Ξ π
//combine actions from ππ and π ππ β² = ππ ; π«
β² = π«π βͺ π« π
foreach πΊ β π«β² do {
if πΊ β Ξπ \ Ξ π do π β².πΊ = ππ .πΊelseif πΊ β Ξ π \ Ξπ do π β².πΊ = π π .πΊelse π β².πΊ = rand(Bool) ? ππ .πΊ : π π .πΊ
}(π β²,Ξβ²) = actionCascade(π β²,Ξβ², Cπ )CAND βͺ= {(π β²,Ξβ²)}
}return CAND
Algorithm 3: Pseudo-code for GeCoβs crossover operator.
(πβ,Ξβ) β POP, and for each feature group πΊ β Ξ that has notbeen mutated yet (πΊ β Ξβ), we sampleπ values without replace-ment from the sample space associated to πΊ , and construct a newcandidate π β² obtained by changingπΊ to π£ , for each value π£ in thesample. As explained earlier (Sec. 4.2), before we insert π β² in thepopulation, we must enforce all constraints in πΆπ , and this is doneby the function actionCascade.
In GeCo, the mutation operator also generates the initial popu-lation. In this case, the current population POP contains only theoriginal instance π with Ξ = β . Thus, GeCo first explores candi-dates in the initial population that differ from the original instanceπ only in a change for one group πΊ β Ξ. This ensures that weprioritize candidates that have few changes. Subsequent calls tothe mutation operator than change one additional feature groupfor each candidate in the current population.
4.5 Crossover OperatorThe crossover operator generates new candidates by combining theactions of two instances ππ , π π in POP. For example, consider candi-dates ππ and π π that differ from π in the feature groups {age},
and {education, income} respectively. Then, the crossover op-erator generates a new candidate π β² that changes all three fea-tures age, education, income, such that that π β².age = ππ .age andπ β².{education, income} = π π .{education, income}. The pseudo-code of the operator is given by Algorithm 3.
In a conventional genetic algorithm, the candidates for crossoverare selected at random. In GeCo, however, we want to combine thecandidates that (1) change a distinct set of features, and (2) are thebest candidates amongst all candidates in POP that change the sameset of features. Recall that, for each candidate π β² in the population,we also store the set Ξβ² of feature groups where π β² differs from π .To achieve our customized crossover, we first collect all distinctsets Ξ in POP. Then, we consider any combination of two distinctsets (Ξπ ,Ξ π ), and find the candidates ππ and π π which have the bestfitness score in the subset of POP for which Ξ = Ξπ and respectivelyΞ = Ξ π . The fitness score is given in Eq. (11) and, since POP is alreadysorted by the fitness score, the best candidate with Ξ = Ξπ is alsothe first candidate in POP with Ξ = Ξπ . The operator then combinesthe candidates ππ and π π into a new candidate π β², which inheritsboth the mutations from ππ and π π ; in case a feature group πΊ wasmutated in both ππ and π π , then we choose at random to assignto π β².πΊ either ππ .πΊ or π π .πΊ . Finally, we check if π β² requires anycascading actions, and apply them in order to satisfy the constraintsπΆπ . We add π β² and the corresponding feature set Ξβ² to the collectionof new candidates CAND.
4.6 DiscussionIn this section, we discuss implementation details and potentialtradeoffs that we face for the performance optimization of GeCo.
Hyperparameters. We use the following defaults for the hyperpa-rameters: π = 100, π = 5,πinit = 20,πmut = 5. The first parametermeans that each generation retains only the fittest π = 100 candi-dates. The value π = 5means that the algorithm runs until the top 5candidates of the population are all counterfactuals (i.e.π (π) > 0.5)and they remain in the top 5 during two consecutive generations.The initial population consists of 20 candidates π β² per feature group,and, during mutation, we create 5 new candidates per feature group.These defaults ensure that selected candidates of the initial pop-ulation will change a minimum of five distinct feature groups. Insubsequent generations, we then sample multiple actions for eachfeature group, in order to mitigate the potential issue that the initialsamples did not capture the best action for each feature group.
Representation of Ξ. For each candidate (π β²,Ξβ²) β POP, we repre-sent the feature set Ξβ² as a compact bitset, which allows for efficientbitwise operations.
Sampling Operator. In the mutation operator, we use weightedsampling to sample actions from the sample space π·πΊπ associatedto the groupπΊπ . The weight is given by the frequency of each valueπ£ β π·πΊπ in the input dataset π· . As a result, GeCo is more likely tosample those actions that frequently occur in the dataset π· , whichhelps to ensure that the generated counterfactual is plausible.
Number of Mutated Candidates. By default the mutation operatormutates every candidate in the current population, which ensuresthat (1) we explore a large set of candidates, and (2) we samplemany values from the feasible space for each group. The downsideis that this operation can take a long time, in particular if we return
GeCo: Quality Counterfactual Explanations in Real Time
many selected candidates (π is large), and the mutation operatorcan become a performance bottleneck. In this case, it is possibleto mutate only a selected number of candidates. We propose togroup the candidates by the set Ξ, and then to mutate only the topcandidates in each group (just like we only do crossover on thetop candidate in each group). This approach ensures that we stillexplore candidates with the Ξ sets as the default, but limits the totalnumber of candidates that are generated.
Large Sample Spaces. If there is a sample space π·πΊπ that is verylarge, then it is possible that GeCo does not sample the best actionfrom this space. In this case, we designed a variant of the mutateoperator, which we call selectiveMutate. Whereas the normal mu-tation operator mutates candidates by changing feature groupsthat have not been changed, selective mutation mutates the fea-ture groups that were previously changed by sampling actions thatwould decrease the in a way to distance between the candidate andthe original instance. This additional operation ensures that GeCois more likely to identify the best action in large sample spaces.
4.7 Comparison with CERTIFAICERTIFAI [25] also computes counterfactual explanations using agenetic algorithm. The main difference that distinguishes CERTIFAIfrom GeCo is that CERTIFAI assumes that the initial population isa random sample of only βgoodβ counterfactuals. Once this initialpopulation is computed, the goal of its genetic algorithm is to findbetter counterfactuals, whose distance to the original instance issmaller. However, it is unclear how to compute the initial populationof counterfactuals, and the quality of the final answer of CERTIFAIdepends heavily on the choice of this initial population. CERTIFAIalso does not emphasize the exploration of counterfactuals thatchange only few features. In contrast, GeCo starts from only theoriginal instance π as seed, whose outcome is βbadβ, and relies onthe hypothesis that some βgoodβ counterfactual should be nearby,i.e. with few changed features.
Since CERTIFAI is not publicly available, we implemented ourown variant based on the description in the paper. Since it is notclear how the initial population is computed, we consider all in-stances in the database π· that are classified with the good outcomeand satisfy all feasibility and plausibility constraints. This is themost efficient way to generate a sample of good counterfactuals,but it has the downside is that the quality of the initial population,and, hence, of the final explanation, varies widely with the instanceπ . In fact, there are cases where π· contains no counterfactual thatsatisfies the feasibility constraints w.r.t. π . In Section 6, we showthat GeCo is able to compute explanations that are closer to theoriginal instance significantly faster than CERTIFAI.
5 OPTIMIZATIONSThe main performance limitation in GeCo are the repeated callsto selectFittest and mutate. Between the two operations, GeCorepeatedly adds tens of thousands of candidates to the population,applies the classifierπ on each of them, and then removes thosethat have a low fitness score. In Section 6, we show that selectionandmutation account for over 95% of the overall runtime. In order toapply GeCo in interactive settings, it is thus important to optimize
Age Edu. Occup. Income28 HS Service 10,00030 HS Service 10,00032 HS Service 10,00022 PhD Service 10,00022 BSc Service 10,00025 HS Service 15,00023 HS Service 20,00022 HS Sales 10,00022 HS Student 10,000
Ξ Sets{Age}{Edu.}{Age, Income}{Occup.}
Age283032
Edu.PhDBSc
Age Income25 15,00023 20,000
Occup.Sales
Student
Figure 3: Example candidate population for instance π =
(22,HS,Service,10000) using the naive listing representation(left) and the equivalent Ξ-representation (right).
the performance of these two operators. In this section, we presenttwo optimizations which significantly improve their performance.
To optimizemutate, we present a loss-less, compressed data rep-resentation, called Ξ-representation, for the candidate populationthat is generated during the genetic algorithm. In our experiments,the Ξ-representation is up to 25Γ more compact than an equiva-lent naive listing representation, which translates to a performanceimprovement of up to 3.9Γ for mutate.
To optimize selectFittest, we draw on techniques from the PLcommunity, in particular partial evaluation, to optimize the eval-uation of the classifier. The optimizations exploit the fact that weknow the values for subsets of the input features before the eval-uation, which allows us to translate the model into a specializedequivalent model that pre-evaluates the static components. Theseoptimizations improve the runtime for selectFittest by up to 3.2Γ.
The Ξ-representation and partial evaluation complement eachother, and together decrease the end-to-end runtime of GeCo by afactor of 5.2Γ. Next, we provide more details for our optimizations.
5.1 Ξ-RepresentationThe naive representation for the candidate population of the geneticalgorithm is a listing of the full feature vectors π β² for all candidates(π β²,Ξβ²). This representation is highly redundant, because mostvalues in π β² are equal to the original instance π ; only the featuresin Ξβ² are different. The Ξ-representation can represent each thecandidate (π β²,Ξβ²) compactly by storing only the features in Ξβ².This is achieve by grouping the candidate population by the set offeatures Ξβ², and then representing the entire subpopulation in asingle relation π Ξβ² whose attributes are only Ξβ².
Example 5.1. Figure 3 presents a candidate population for theinstance π = (Age=22, Edu=HS, Occup=Service, Income=10000) us-ing (left) the naive listing representation and (right) the equivalentΞ-representation. For simplicity, we highlight the changed valuesin the listing representation, instead of enumerating all Ξ sets,
Most values stored in the listing representation are values fromπ . In contrast, the Ξ-representation only represents the values thatare different from π . For instance, the first three candidates, whichchange only Age, are represented in a relation with attributes Ageonly, and without repeating the values for Edu, Occup, Income.
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
In our implementation, we represent the Ξ sets as bitsets, andthe Ξ-representation is a hashmap of that maps the distinct Ξsets to the corresponding relation π Ξ, which is represented as aDataFrame. We provide wrapper functions so that we can applystandard DataFrame operations directly on the Ξ-representation.
TheΞ-representation has a significantly lowermemory footprint,which can lead to significant performance improvements over thenaive representation, because it is more efficient to add candidatesto the smaller relations, and it simplifies garbage collection.
There is, however, a potential performance tradeoff for the selec-tion operator, because the classifierπ typically assumes as inputthe full feature vector π β². In this case, we copy the values in theΞ-representation to a full instantiation of π β². This transformationcan be expensive, but, in our experiments, it does not outweigh thespeedup for mutation. In the next section, we show how we canuse partial evaluation to avoid the construction of the full π β².
5.2 Partial Evaluation for ClassifiersWe show how to adapt PL techniques, in particular code special-ization via partial evaluation, to optimize the evaluation of a givenclassifierπ , and thus speedup the performance of selectFittest.
Consider a program π : π Γ π β π which maps two inputs(π,π ) into output π . Assume that we know π = π¦ at compile time.Partial evaluation takes program π and input π = π¦ and generates amore efficient program π β²β¨π¦β©π β π , which precomputes the staticcomponents. Partial evaluation guarantees that π β²β¨π¦β© (π₯) = π (π₯,π¦)for all π₯ β dom(π ). See [10] for more details on partial evaluation.
We next overview how we use partial evaluation in GeCo. Con-sider a classifierπ with features πΉ . During the evaluation of can-didate (π β²,Ξβ²), we know that the values for all features πΉ \ Ξβ² areconstants taken from the original instance π . Thus, we can partiallyevaluate the classifierπ to a simplified classifierπΞβ² that precom-putes the static components related to features πΉ \ Ξβ². Once πΞβ²
is generated, we cache the model so that we can apply it for allcandidates in the population that change the same feature set Ξβ².Note that by using partial evaluation, GeCo no longer explains ablack-box, since it requires access to the code of the classifier.
Example 5.2. Consider a decision tree classifier π . For an in-stance (π β²,Ξβ²),π (π β²) typically evaluates the decisions for all fea-tures along a root to leaf path. Since we know the values for thefeatures πΉ \ Ξβ², we can precompute and fold all nodes in the treethat involve the features πΉ \ Ξβ². If Ξβ² is small, then partial evalu-ation can generate a very simple tree. For instance, if Ξβ² = {Age}thenπΞβ² only needs to evaluate decisions of the form age > 30 or30 < age < 60 to classify the input.
In addition to optimizing the evaluation ofπ , a partially evalu-ated classifierπΞ can be directly evaluated over the partial relationπ Ξ in the Ξ-representation, and thus we mitigate the overhead re-sulting from the need to construct the full entity for the evaluation.
Partial evaluation has been studied and applied in various do-mains, e.g., in databases, it has been used to optimize query evalu-ation (see e.g., [24, 28]). We are, however, not aware of a general-purpose partial evaluator that be applied in GeCo to optimize arbi-trary classifiers. Thus, we implemented our own partial evaluator
Table 1: Key characteristics for each considered dataset.
Credit Adult Allstate YelpData Points 30K 45K 13.2M 22.4MVariables 14 12 37 34Features (one-hot enc.) 20 51 864 1064
for two model classes: (1) tree-based models, which includes de-cision trees, random forests, and gradient boosted trees, and (2)neural networks and multi-layered perceptrons.
In the following, we briefly introduce the partial evaluation weuse for tree-based models and neural networks.Tree-based models. We optimize the evaluation of tree-basedmodels in two steps. First, we use existing techniques to turn themodel into a more optimized representation for evaluation. Then,we apply partial evaluation on the optimized representation.
Tree-based models face performance bottlenecks during eval-uation because, by nature of their representation, they are proneto cache misses and branch misprediction. For this reason, the MLsystems community has studied how tree-based models can be rep-resented so that they can be evaluated without random lookupsin memory and repeated if-statements (see e.g., [5, 16, 32]). InGeCo, we use the representation proposed by the QuickScoreralgorithm [16], which we briefly overview next.
Instead of evaluating a decision tree π following a root to leafpath, QuickScorer evaluates all decision nodes inπ and keeps trackof which leaves cannot be reached whenever a decision fails. Onceall decisions are evaluated, the prediction is guaranteed to be givenby the first leaf that can be reached. All operations in QuickScoreruse efficient, cache-conscious bitwise operations, and avoid branchmispredictions. This makes the QuickScorer very efficient, even ifit evaluates many more decisions than the naive evaluation.
For partial evaluation, we exploit the fact that the first phase inQuickScorer computes the decision nodes for each feature indepen-dently of all other features. Thus, given a set of fixed feature values,we can precompute all corresponding decisions, and significantlynumber of decision that are evaluated at runtime.Neural Networks and MLPs. We consider neural networks forstructured, relational data (as opposed to images or text). In this set-ting, each hidden nodeπ in the first layer of the network is typicallya linear model [3, 17]. Given an input vector π , parameter vectorπ , and bias term π, the node π thus computes: π¦ = π (πβ€π + π),where π is an activation function. If we know that some values inπ are static, then we can apply partial evaluation for π by precom-puting the product between π andπ for all static components andadding the partial product to the bias term π. During evaluation,we then only need to compute the product between π andπ for thenon-static components.
Typically, the first layer is fully-connected, so that we can applypartial evaluation only on the first layer of the network. If thelayers are not fully-connected, then we can also partially evaluatesubsequent layers. Note that this framework for partial evaluationis also applicable to multi-layered perceptrons and linear models.
GeCo: Quality Counterfactual Explanations in Real Time
6 EXPERIMENTSIn this section, we present the results for our experimental eval-uation of GeCo on four real datasets. We conduct the followingexperiments:
(1) We investigate whether GeCo is able to compute counterfac-tual explanations for one end-to-end example, and comparethe explanation with four existing systems.
(2) We benchmark all considered systems on 5,000 instances,and investigate the tradeoff between the quality of the ex-planations and the runtime for each system.
(3) We conduct microbenchmarks for GeCo. In particular, webreakdown the runtime into individual components and in-vestigate the impact of the optimizations from Section 5.
The code and scripts to run the experiments are publicly availableon: https://github.com/mjschleich/GeCo.jl.
6.1 Experimental SetupIn this section, we present the considered datasets and systems, aswell as the setup used for all our experiments.Datasets. We consider four real datasets: (1) Credit [33] is usedpredict customerβs default on credit card payments in Taiwan; (2)Adult [13] is used to predict whether the income of adults exceeds$50K/year using US census data from 1994; (3) Allstate is a Kaggledataset for the Allstate Claim Prediction Challenge [2], used topredict insurance claims based on the characteristics of the insuredβsvehicle; (4) Yelp is based on the public Yelp Dataset Challenge [34]and is used to predict review ratings that users give to businesses.
Table 1 presents key statistics for each dataset. Credit and Adultare from the UCI repository [8] and commonly used to evaluateexplanations (e.g., [12, 18, 29]). For all datasets, we one-hot encodethe categorical variables. For the evaluation with existing systems,we further apply the same preprocessing that was proposed by theexisting system, in order to ensure that our evaluation is fair.
For all datasets, we form one feature group for the featuresderived from one-hot encoding. In addition, we use the followingPLAF constraints. For Adult, we enforce that Age and Educationcan only increase, and MaritalStatus, Relationship, Gender,and NativeCountry cannot change. For Credit, we enforce thatHasHistoryOfOverduePayments can only increase, while isMaleand isMarried cannot change. We do not specify any constrainsfor Allstate and Yelp. We also do not use more complex constraintswith implications because they are not supported by all systems.Considered Systems. We benchmark GeCo against four existingsystems: (1) MACE [12] solves for counterfactuals using multipleruns of an SMT solver; (2) DiCE [17] generates counterfactual ex-planations with a variational auto-encoder; (3)WIT is our imple-mentation of the counterfactual reasoning approach in GoogleβsWhat-if Tool [31]. WIT look-ups the closest counterfactual thatsatisfies the PLAF constraints in the database π· . We implementedour own version, because the What-if Tool does not support feasi-bility constraints; (4) CERT is our implementation of the geneticalgorithm that is used in CERTIFAI [25] (see Sec. 4.7 for details).We reimplemented the algorithm because CERTIFAI is not public.Evaluation Metrics.We use the following three metrics to eval-uate the quality of the explanation: (1) The consistency of the ex-planations, i.e., does the classifier return the good outcome for the
Table 2: Examples of counterfactual explanations by GeCo,MACE, and DiCE for one instance in Adult. We show se-lected features, all others are not changed. MACE and DiCEuse different models, so we show GeCoβs explanation foreach model. The neural network does not use CapitalGainand CapitalLoss. (β) means the value is not changed.
Decision Tree Neural NetworkAttribute Instance GeCo MACE GeCo DiCEAge 49 β β 53 54Education School β β β DoctorateOccupation Service β β β BlueCollarCapitalGain 0.0 4,687.0 4,826.0CapitalLoss 0.0 β 19.9Hours/week 16 β β 98 40Race Other β β β WhiteGender Female β β β MaleIncome (>50K) False True True True True
counterfactual ππ π ; (2) The distance between ππ π and the originalinstance π ; (3) The number of features changed in ππ π .
For the comparison with existing systems, we use the β1 normto aggregate the distances for each feature (i.e., π½ = 1, πΌ = πΎ = 0in Eq. (2)), because MACE and DiCE do not support combiningnorms. We examine other choices of these hyperparameters in themicrobenchmarks. To compare runtimes, we report the averagewall-clock time it takes to explain a single instance.
GeCo and CERT return multiple counterfactuals for each in-stance. We only consider the best counterfactual in this evaluation.Classifiers. We benchmark the systems on tree-based models andmulti-layered perceptrons (MLP).
Comparison with Existing Systems. For the comparison of withexisting systems we use the classifiers proposed byMACE and DiCEfor all systems to ensure a fair comparison. For tree-based models,we attempted to compute explanations for random forest classi-fiers, but MACE took on average 30 minutes to compute a singleexplanation, which made it infeasible to compute explanations formany instances. Thus, we consider a single decision tree for thisevaluation (computed in scikit-learn, default parameters). DiCEdoes not support decision trees, since it requires a differentiableclassifier. For the neural net, we use the classifier proposed by DiCE,which is a two-layered neural network with 20 (fully connected)hidden units and ReLU activation.
Microbenchmarks. In the micro benchmarks, we consider a ran-dom forest with 500 trees and maximum depth of 10 from the JuliaMLJ library [7], the MLPClassifier from the scikit-learn library [20]with the default settings (one hidden layer with 100 nodes).Setup.We implemented GeCo, WIT, and CERT in Julia 1.5.2. Allexperiments are run on an Intel Xeon CPU E7-4890/2.80GHz/64bitwith 108GB RAM, Linux 4.16.0, and Ubuntu 16.04.
We use the default hyperparameters for GeCo (c.f. Sec. 4.6) andMACE (π = 10β3). CERT runs for 300 generations, as in the originalCERTIFAI implementation. In GeCo, we precompute the active do-main of each feature group, which is invariant for all explanations.
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
Credit Dataset
Adult Dataset
Figure 4: Comparison of the average distance (β1 norm) (left), and average runtime (right) for the explanations by GeCo, WIT,CERT, and MACE for 5000 instances on the Credit and Adult datasets. Error bars represent one standard deviation.
6.2 End-to-end ExampleWe consider one specific instance in the Adult dataset that is classi-fied as βbadβ (Income <$50K) and illustrate the differences betweenthe the explanations for each considered system.
Table 2 presents the instance π and the counterfactuals returnedby GeCo, MACE, and DiCE. WIT and CERT fail to return an ex-planation, because the Adult dataset has no instance with income>$50K that also satisfies all PLAF constraints for π . MACE computesthe explanation over a decision tree, and DiCE considers a neuralnetwork. We present GeCoβs explanation for each model.
For the decision tree, GeCo is able to find a counterfactual thatchanges only Capital Gains. In contrast, the explanation by MACErequires a change in Capital Gains and Capital Loss. Remarkably,the change in Capital Gains is larger than the one required byGeCo.Thus, GeCo is able to find better explanations than MACE.
The neural net used for the comparison with DiCE does not usethe features Capital Gains and Capital Loss. For this model, GeCorequires a change in Age and Hours/week to find a counterfac-tual explanation. In contrast, DiCE changes the values for all 8 (!)features, which is implausible.
In addition to the top explanation in Table 2, GeCo also returnsalternative counterfactuals that satisfy all constraints. For instance,for the neural net, GeCo also proposes (1) changing Age to 76 andHours/Week to 80, or (2) increasing the education level from aSchool to a Masters degree. These alternatives provide differentpathways for the user to achieve the good outcome. Only GeCoand CERT return multiple counterfactual explanations by default.
6.3 Quality and Runtime TradeoffIn this section, we investigate the tradeoff between the quality andthe runtime of the explanations for all considered systems on Credit
and Adult. As explained in Sec. 6.1, we evaluate the systems usinga single decision tree and a neural network.Takeaways for Evaluation with Decision Trees. Figures 4 and5 show the results of our evaluation with decision trees on Credit(top) and Adult (bottom). For each dataset, we explain a sample of5,000 instances for which the classifier returns the negative outcome.Figure 4 presents the average distance (left) and runtime (right) foreach competitor. Figure 5 presents the distribution of the numberof features changed by MACE, WIT and CERT in comparison withthe number of features changed by GeCo.
Quality comparison. GeCo and MACE are always able to find afeasible and plausible explanation. WIT and CERT, however, failto find an explanation in 2.1% of the cases for Adult (representedby theβ bar in Figure 5). This is because the two techniques arerestricted by the databaseπ· , whichmay not contain an instance thatis classified as good and represents feasible and plausible actions.
GeCoβs explanations are on average the closest to the originalinstance. This can be explained by the fact that GeCo is able to findthese explanations by changing significantly fewer features. For theCredit dataset, for instance, GeCo can find explanations with 1.27changes on average, while MACE (the best competitor) changes onaverage 3.39 features. WIT and CERT change on average 3.97 andrespectively 3.15 features.
Like GeCo, CERT uses a genetic algorithm, but it takes up 3.Γlonger to compute explanations that do not have the same quality asGeCoβs. Thus, GeCoβs custom genetic algorithm, which is designedto explore counterfactuals with few changes, is very effective.
Runtime comparison. GeCo is able to compute each explanationin less than the 500ms required for interactive analysis. On average,GeCo is up to 23Γ faster than MACE. This performance is onlymatched by WIT, which performs simple looks-ups in the databaseπ· and does not return explanations with the same quality asGeCoβs.
GeCo: Quality Counterfactual Explanations in Real Time
Credit Dataset
Adult Dataset
Figure 5: Distribution of the number of features changed in the counterfactual for GeCo, MACE, WIT and CERT.
Figure 6: Comparison of GeCo and DiCE for the neural net-work on 5000 instance from Adult. (left) Average distanceof the counterfactual. (right) Distribution of number of fea-tures changed. Error bars represent one standard deviation.
Table 3: Microbenchmarks results for tree-based models.
Credit Adult Allstate YelpGenerations 4.22 4.66 5.26 3.26Explored Candidates 18.1K 15.3K 62.8K 50.8KSize of Naive Rep. 62.23K 117.41K 8.03M 12.74MSize of Ξ-Rep. 18.64K 21.88K 412K 502KCompression 3.3Γ 5.4Γ 19.5Γ 25.4Γ
Takeaways for Evaluation with Neural Net. Figure 6 presentsthe key results for our evaluation on the MLP classifier. We onlyshow the comparison of GeCo and DiCE, because the comparisonwith WIT and CERT is similar to the one for the decision tree.MACE requires an extensive conversion of the classifier into alogical formula, which is not supported for the considered model.
Since DiCE computes the counterfactual explanation in one passover a variational auto-encoder, it is able to compute the explana-tions very efficiently. In our experiments, DiCE was able to findan explanation in 5 milliseconds, whereas GeCo required 250 mil-liseconds on average. The counterfactuals that DiCE generates,however, have poor quality. Whereas GeCo is again able to findexplanations with only few required changes, DiCE changes onaverage 5.3 features. As a result, GeCoβs explanation is on average4.4Γ closer to the original instance. Therefore, we consider GeCoβsexplanations much more suitable for real-world applications.
6.4 MicrobenchmarksIn this section, we present the results for the microbenchmarks.Breakdown of GeCoβs Components. First, we analyze the run-time of each operator presented in Sec. 4, as well as the impactof the Ξ-representation and partial evaluation (see Sec. 5) on twotree-based models and a multi-layered perceptron (MLP) over theAllstate and Yelp datasets. For each benchmark, we run GeCo for 5generations on 1,000 instances that have been classified as bad.
Figure 7 presents the results for this benchmark. Init capturesthe time it takes to compute the feasible space, and to generateand select the fittest candidates for the init population. The timesfor selection, crossover, and mutation are accumulated over allgenerations. For each considered scenario, we first present theruntime for: (1) without the Ξ-representation and partial evaluationenabled, (2) with each optimization individually, and (3) with both.
The results show that the selection and mutation operators arethe most time consuming operations of the genetic algorithm. Thisis not surprising since they operate on tens of thousands of candi-dates, whereas crossover combines only a few selected candidates.
Partial evaluation of the classifier decreases the runtime of theselection operator by up to 3.2Γ (Allstate with random forest),which translates into an overall speedup of 1.7Γ. Partial evaluationis more effective for the random forest model, because we can onlypartially evaluate the first layer of the MLP.
The Ξ-representation decreases the runtime of the mutationoperator by 3.9Γ for Allstate and 3.7Γ for Yelp (with MLP). Thisspeedup is due to the compression achieved by the Ξ-representation(see below). If the classifier is not partially evaluated, then there is atradeoff in the runtime for the selection operator, because it requiresthe materialization of the full feature vector. This materializationincreases the runtime of select by up to 2.1Γ (Yelp, random forest).
The best performance is achieved if the Ξ-representation andpartial evaluation of the classifier are used together. In this case,there is a significant runtime speedup for both the mutation opera-tor and selection operators. Overall, this can lead to a performanceimprovement of 5.2Γ (Yelp with random forest).Number of Generations and Explored Candidates. Table 3shows for each dataset how many generations GeCo needed on
Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu
Allstate Dataset w. Random Forest Yelp Dataset w. Random Forest Yelp Dataset w. MLP
Figure 7: Breakdown of GeCoβs runtime into the main operators on Allstate and Yelp using random forest and MLP classifiers.We present the runtimes (1) without the Ξ-representation and partial evaluation, (2) with partial evaluation, (3) with the Ξ-representation, and (4) with both optimizations. The runtime is averaged over 1,000 instances.
average to converge, and how many candidates it explored. Themajority (up to 97%) of the candidates were generated by mutate.Compression by Ξ-representation. Table 3 compares the sizesfor the naive listing representation and the Ξ-representation. Wemeasure size in terms of the number of represented values, andtake the average over all generations for the size of the candidatepopulation after the mutate and crossover operations. Overall, theΞ-representation can represent the candidate population up to 25Γmore compactly than the naive listing representation.Effect of Distance Function. Recall our distance function (Eq. (2))and its parameters πΌ, π½,πΎ . In Sec. 6.3, we show the explanationquality for GeCo on Credit and Adult using the parameters (πΌ =
0, π½ = 1, πΎ = 0). When setting (πΌ = 0.5, π½ = 0.5, πΎ = 0) and (πΌ =
0.33, π½ = 0.34, πΎ = 0.33), we observe that using the β0 norm (πΌ >
0) further reduces the number of features that GeCo changes. Inthis case, GeCo always returns explanations that change only asingle feature, for both Credit and Adult. The ββ-norm has nosignificant effect our in experiments, which may be because theβ1-norm already restricts the maximum change.
6.5 SummaryThe experiments show that GeCo is the only system that can pro-vide high-quality explanations in real time. On the Credit and Adultdatasets, MACE is the best competitor in terms of quality, but takesorders of magnitude longer to run. The explanations for the fastestcompetitors (WIT and DiCE) have poor quality.
The Ξ-representation and partial evaluation of the classifier cansignificantly reduce GeCoβs end-to-end runtime. For our evaluationon the real Allstate and Yelp datasets, the two optimizations improvethe end-to-end runtime by a factor of up to 5.2Γ.
7 CONCLUSIONSWe described GeCo, the first interactive system for counterfactualexplanations that supports a complex, real-life semantics of coun-terfactuals, yet provides answers in real time. GeCo defines a richsearch space for counterfactuals, by considering both a dataset ofexample instances, and a general-purpose constraint language. Ituses a genetic algorithm to search for counterfactuals, which is cus-tomized to favor counterfactuals that require the smallest number of
changes. We described two powerful optimization techniques thatspeed up the inner loop of the genetic algorithm: Ξ-representationand partial evaluation. We demonstrated that, among four othersystems reported in the literature, GeCo is the only one that canboth compute quality explanations and find them in real time.
At a high level, our work belongs to the more general effortof applying database and compiler optimizations to improve theperformance of machine learning tasks [1, 4, 9, 14, 23].
GeCo has two limitations that we plan to address in future work.The first is that we have implemented partial evaluation manu-ally, and it is only supported for random forests and simple neuralnetwork classifiers. We plan to extend this optimization to othermodels. Ideally, we would like to leverage work from the compilerscommunity, and use an out-of-the-box partial compiler, but a majoris that GeCo is written in Julia for which no partial compilers exists.
The second limitation is related to privacy. GeCo uses a datasetto define its sample spaces, and that may leak private informationabout the users whose data is stored in that dataset. Notice thatGeCo does not compute aggregates on the dataset, but uses valuesthat exists in the data; hence techniques from differential privacydo not apply out of the box. A technical solution to this problemwould be to generate a different instance, using a differentiallyprivate (and, hence, randomized) algorithm, then use that to com-pute the sample space; inevitably, this will lead to a degradationin the quality of the explanations. The solution that we favor isregulatory: certain combinations of attributes are not consideredto be pseudo-identifiers, and, hence, their values are consideredpublic. Explanation systems like GeCo are likely to be deployed inhighly regulated settings that legislate what explanations the usersare entitled to, and also likely to legislate what combinations ofattributes are private.
ACKNOWLEDGMENTSThis work was partially supported by NSF IIS 1907997, NSF IIS1954222, and a generous gift from RelationalAI.
GeCo: Quality Counterfactual Explanations in Real Time
REFERENCES[1] Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and
Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. In PODS.325β340.
[2] Allstate. 2011. Allstate Claim Prediction Challenge. https://www.kaggle.com/c/ClaimPredictionChallenge
[3] Sercan Arik and Tomas Pfister. 2019. Tabnet: Attentive interpretable tabularlearning. arXiv preprint arXiv:1908.07442 (2019).
[4] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, andMatei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD.Association for Computing Machinery, New York, NY, USA, 1383β1394.
[5] Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations fortree-based machine learning models. IEEE transactions on Knowledge and DataEngineering 26, 9 (2013), 2281β2292.
[6] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, YunhanJia, Joydeep Ghosh, Ruchir Puri, JosΓ© MF Moura, and Peter Eckersley. 2020.Explainable machine learning in deployment. In FAT*. 648β657.
[7] Anthony D. Blaom, Franz Kiraly, Thibaut Lienart, Yiannis Simillides, DiegoArenas, and Sebastian J. Vollmer. 2020. MLJ: A Julia package for composablemachine learning. arXiv:2007.12285 [cs.LG]
[8] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[9] Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jer-maine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS:Or, Why You Should Use a Database for Distributed Machine Learning. PVLDB12, 7 (March 2019), 822β835.
[10] Neil D Jones, Carsten K Gomard, and Peter Sestoft. 1993. Partial evaluation andautomatic program generation. Peter Sestoft.
[11] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012.Enterprise Data Analysis and Visualization: An Interview Study. IEEE Trans. Vis.Comput. Graph. 18, 12 (2012), 2917β2926.
[12] Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. 2020. Model-agnostic counterfactual explanations for consequential decisions. In AISTATS.895β905.
[13] Ron Kohavi. 1996. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.. In KDD, Vol. 96. 202β207.
[14] Arun Kumar, Jeffrey Naughton, and Jignesh M Patel. 2015. Learning generalizedlinear models over normalized data. In SIGMOD. 1969β1984.
[15] Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on ex-ploratory visual analysis. IEEE transactions on visualization and computer graphics20, 12 (2014), 2122β2131.
[16] Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego,Nicola Tonellotto, and Rossano Venturini. 2015. Quickscorer: A fast algorithm torank documents with additive ensembles of regression trees. In SIGIR. 73β82.
[17] Divyat Mahajan, Chenhao Tan, and Amit Sharma. 2019. Preserving CausalConstraints in Counterfactual Explanations for Machine Learning Classifiers. In
CausalML @ NeurIPS.[18] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining
machine learning classifiers through diverse counterfactual explanations. InFAT*. 607β617.
[19] Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases. SIGMOD Rec.45, 2 (2016), 5β16.
[20] Fabian Pedregosa, GaΓ«l Varoquaux, Alexandre Gramfort, and et al. 2011. Scikit-learn: Machine Learning in Python. J. Machine Learning Research 12 (2011),2825β2830.
[21] Marco TΓΊlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should ITrust You?": Explaining the Predictions of Any Classifier. In SIGKDD. 1135β1144.
[22] Cynthia Rudin. 2019. Stop explaining black box machine learning models forhigh stakes decisions and use interpretable models instead. Nature MachineIntelligence 1, 5 (2019), 206β215.
[23] Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, andXuanLong Nguyen. 2019. A Layered Aggregate Engine for Analytics Workloads.In SIGMOD. 1642β1659.
[24] Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018. Building EfficientQuery Engines in a High-Level Language. ACM Trans. Database Syst. 43, 1, Article4 (2018), 45 pages.
[25] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. 2020. CERTIFAI: ACommon Framework to Provide Explanations and Analyse the Fairness andRobustness of Black-box Models. In Proceedings of the AAAI/ACM Conference onAI, Ethics, and Society. 166β172.
[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning impor-tant features through propagating activation differences. In ICML. 3145β3153.
[27] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju.2020. Fooling lime and shap: Adversarial attacks on post hoc explanationmethods.In AIES. 180β186.
[28] Ruby Y. Tahboub, GrΓ©gory M. Essertel, and Tiark Rompf. 2018. How to Architecta Query Compiler, Revisited. In SIGMOD. 307β322.
[29] Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable recourse inlinear classification. In FAT*. 10β19.
[30] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactualexplanations without opening the black box: Automated decisions and the GDPR.Harv. JL & Tech. 31 (2017), 841.
[31] JamesWexler, Mahima Pushkarna, Tolga Bolukbasi, MartinWattenberg, FernandaViΓ©gas, and Jimbo Wilson. 2019. The what-if tool: Interactive probing of machinelearning models. IEEE transactions on visualization and computer graphics 26, 1(2019), 56β65.
[32] Ting Ye, Hucheng Zhou, Will Y Zou, Bin Gao, and Ruofei Zhang. 2018. Rapid-scorer: fast tree ensemble evaluation by maximizing compactness in data levelparallelization. In SIGKDD. 941β950.
[33] I-Cheng Yeh and Che-hui Lien. 2009. The comparisons of data mining techniquesfor the predictive accuracy of probability of default of credit card clients. ExpertSystems with Applications 36, 2 (2009), 2473β2480.
[34] Yelp. 2017. Yelp Dataset Challenge. https://www.yelp.com/dataset/challenge/