geco: quality counterfactual explanations in real time

13
GeCo: ality Counterfactual Explanations in Real Time Maximilian Schleich University of Washington Zixuan Geng University of Washington Yihong Zhang University of Washington Dan Suciu University of Washington ABSTRACT Machine learning is increasingly applied in high-stakes decision making that directly affect people’s lives, and this leads to an in- creased demand for systems to explain their decisions. Explanations often take the form of counterfactuals, which consists of conveying to the end user what she/he needs to change in order to improve the outcome. Computing counterfactual explanations is challeng- ing, because of the inherent tension between a rich semantics of the domain, and the need for real time response. In this paper we present GeCo, the first system that can compute plausible and feasi- ble counterfactual explanations in real time. At its core, GeCo relies on a genetic algorithm, which is customized to favor searching counterfactual explanations with the smallest number of changes. To achieve real-time performance, we introduce two novel optimiza- tions: Ξ”-representation of candidate counterfactuals, and partial evaluation of the classifier. We compare empirically GeCo against four other systems described in the literature, and show that it is the only system that can achieve both high quality explanations and real time answers. Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/mjschleich/GeCo.jl. 1 INTRODUCTION Machine learning is increasingly applied in high-stakes decision making that directly affects people’s lives. As a result, there is a huge need to ensure that the models and their predictions are in- terpretable by their human users. Motivated by this need, there has been a lot of recent interest within the machine learning com- munity in techniques that can explain the outcomes of models. Explanations improve the transparency and interpretability of the underlying model, they increase user’s trust in the model predic- tions, and they are a key facilitator to evaluate the fairness of the model for underrepresented demographics. The ability to explain is no longer a nice-to-have feature, but is increasingly required by law; for example, the GDPR regulations grant users the right to explanation to automated decision algorithms [30]. In addition to supporting the end user, explanations can also be used by model developers to debug and monitor their ever more complex models. In this paper we focus on local explanations, which provide post- hoc explanations for one single prediction, and are in contrast to global explanations, which aim to explain the entire model (e.g. for debugging purposes). In particular we study counterfactual expla- nations: given an instance , on which the machine learning model predicts a negative, β€œbad” outcome, the explanation says what needs Loan? ( ) + Explanation Factual Age Income Debt Accounts 28 800 200 5 M( ) = no loan Counterfactual Age Income Debt Accounts 28 1000 200 3 M( ) = loan Explanation: Your loan will be approved, if you increase income by $200 and close two accounts. Figure 1: Example of a counterfactual explanation scenario. to change in order to get the positive, β€œgood” outcome, usually rep- resented by a counterfactual example . For example, a customer applies for a loan with a bank, the bank denies the loan application, and the customer asks for an explanation; the system responds by indicating what features need to change in order for the loan to be approved, see Fig. 1. In the AI literature counterfactual explanations are currently considered the most attractive types of local expla- nations, because they offer actionable feedback to the customers, and have been deemed satisfactory by some legislative bodies, for example they are deemed to satisfy the GDPR requirements [30]. The major challenge in computing a counterfactual explanation is the tension between a rich semantics on one hand, and the need for real-time, interactive feedback on the other hand. The seman- tics needs to be rich in order to reflect the complexities of the real world. We want to be as close as possible to , but we also want to be plausible, meaning that its features should make sense in the real world. We also want the transition from to to be feasible, for example age should only increase. The plausibil- ity and feasibility constraints are dictated by laws, societal norms, application-specific requirements, and may even change over time; an explanation system must be able to support constraints with a rich semantics. Moreover, the search space for counterfactuals is huge, because there are often hundreds of features, and each can take values from some large domain. On the other hand, the com- putation of counterfactuals needs to be done at interactive speed, because the explanation system is eventually incorporated in a user interface. Performance has been identified as the main challenge for deployment of counterfactual explanations in industry [6, 31]. The golden standard for interactive response in data visualization arXiv:2101.01292v1 [cs.LG] 5 Jan 2021

Upload: others

Post on 02-Apr-2022

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real TimeMaximilian SchleichUniversity of Washington

Zixuan GengUniversity of Washington

Yihong ZhangUniversity of Washington

Dan SuciuUniversity of Washington

ABSTRACTMachine learning is increasingly applied in high-stakes decisionmaking that directly affect people’s lives, and this leads to an in-creased demand for systems to explain their decisions. Explanationsoften take the form of counterfactuals, which consists of conveyingto the end user what she/he needs to change in order to improvethe outcome. Computing counterfactual explanations is challeng-ing, because of the inherent tension between a rich semantics ofthe domain, and the need for real time response. In this paper wepresent GeCo, the first system that can compute plausible and feasi-ble counterfactual explanations in real time. At its core, GeCo relieson a genetic algorithm, which is customized to favor searchingcounterfactual explanations with the smallest number of changes.To achieve real-time performance, we introduce two novel optimiza-tions: Ξ”-representation of candidate counterfactuals, and partialevaluation of the classifier. We compare empirically GeCo againstfour other systems described in the literature, and show that it isthe only system that can achieve both high quality explanationsand real time answers.

Artifact Availability:The source code, data, and/or other artifacts have been made available athttps://github.com/mjschleich/GeCo.jl.

1 INTRODUCTIONMachine learning is increasingly applied in high-stakes decisionmaking that directly affects people’s lives. As a result, there is ahuge need to ensure that the models and their predictions are in-terpretable by their human users. Motivated by this need, therehas been a lot of recent interest within the machine learning com-munity in techniques that can explain the outcomes of models.Explanations improve the transparency and interpretability of theunderlying model, they increase user’s trust in the model predic-tions, and they are a key facilitator to evaluate the fairness of themodel for underrepresented demographics. The ability to explainis no longer a nice-to-have feature, but is increasingly required bylaw; for example, the GDPR regulations grant users the right toexplanation to automated decision algorithms [30]. In addition tosupporting the end user, explanations can also be used by modeldevelopers to debug and monitor their ever more complex models.

In this paper we focus on local explanations, which provide post-hoc explanations for one single prediction, and are in contrast toglobal explanations, which aim to explain the entire model (e.g. fordebugging purposes). In particular we study counterfactual expla-nations: given an instance 𝒙 , on which the machine learning modelpredicts a negative, β€œbad” outcome, the explanation says what needs

Loan?

𝑀 (𝒙) + Explanation

Factual 𝒙Age Income Debt Accounts28 800 200 5

M( ) = no loan

Counterfactual 𝒙𝑐 𝑓Age Income Debt Accounts28 1000 200 3

M( ) = loan

Explanation: Your loan will be approved,if you increase income by $200 and close two accounts.

Figure 1: Example of a counterfactual explanation scenario.

to change in order to get the positive, β€œgood” outcome, usually rep-resented by a counterfactual example 𝒙𝑐 𝑓 . For example, a customerapplies for a loan with a bank, the bank denies the loan application,and the customer asks for an explanation; the system responds byindicating what features need to change in order for the loan to beapproved, see Fig. 1. In the AI literature counterfactual explanationsare currently considered the most attractive types of local expla-nations, because they offer actionable feedback to the customers,and have been deemed satisfactory by some legislative bodies, forexample they are deemed to satisfy the GDPR requirements [30].

The major challenge in computing a counterfactual explanationis the tension between a rich semantics on one hand, and the needfor real-time, interactive feedback on the other hand. The seman-tics needs to be rich in order to reflect the complexities of the realworld. We want 𝒙𝑐 𝑓 to be as close as possible to 𝒙 , but we alsowant 𝒙𝑐 𝑓 to be plausible, meaning that its features should makesense in the real world. We also want the transition from 𝒙 to 𝒙𝑐 𝑓to be feasible, for example age should only increase. The plausibil-ity and feasibility constraints are dictated by laws, societal norms,application-specific requirements, and may even change over time;an explanation system must be able to support constraints with arich semantics. Moreover, the search space for counterfactuals ishuge, because there are often hundreds of features, and each cantake values from some large domain. On the other hand, the com-putation of counterfactuals needs to be done at interactive speed,because the explanation system is eventually incorporated in a userinterface. Performance has been identified as the main challengefor deployment of counterfactual explanations in industry [6, 31].The golden standard for interactive response in data visualization

arX

iv:2

101.

0129

2v1

[cs

.LG

] 5

Jan

202

1

Page 2: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

Limited Completesearch space search space

Non-interactive CERTIFAI [25] MACE [12]

Interactive What-If [31] DiCE [17] GeCo (this paper)

Figure 2: Taxonomy of Counterfactual Explanation Systems

is 500ms [11, 15], and we believe that a similar bar should be set forexplanations. The tension between performance and rich seman-tics is the main technical challenge in counterfactual explanations.Previous systems either explore a complete search space with richsemantics, or answer at interactive speed, but not both: see Fig. 2 anddiscussions in Section 6. For example, on one extreme MACE [12]enforces plausibility and feasibility by using a general-purposeconstraint language, but the solver often takes many minutes tofind a counterfactual. At the other extreme, Google’s What-if Tool(WIT) [31] restricts the search space to a fixed dataset of examples,ensuring fast response time, but poor explanations.

In this paper we present GeCo, the first interactive system forcounterfactual explanations that supports a complex, real-life se-mantics of counterfactuals, yet provides answers in real time. Atits core, GeCo defines a search space of counterfactuals using aplausibility-feasibility constraint language, PLAF, and a database𝐷 . PLAF is used to define constraints like β€œage can only increase”,while the database 𝐷 is used to capture correlations, like β€œjob-titleand salary are correlated”. By design, the search space of possiblecounterfactuals is huge. To search this space, we make a simpleobservation. A good explanation 𝒙𝑐 𝑓 should differ from 𝒙 by only afew features; counterfactual examples 𝒙𝑐 𝑓 that require the customerto change too many features are of little interest. Based on this ob-servation, we propose a genetic algorithm, which we customize tosearch the space of counterfactuals by prioritizing those that havefewer changes. Starting from a population consisting of just thegiven entity 𝒙 , the algorithm repeatedly updates the population byapplying the operations crossover and mutation, and then selectingthe best counterfactuals for the new generation. It stops when itreaches a sufficient number of examples on which the classifierreturns the β€œgood” (desired) outcome.

The main performance limitation in GeCo is its innermost loop.By the nature of the genetic algorithm, GeCo needs to repeatedlyadd and remove counterfactuals to and from the current population,and ends up having to examine thousands of candidates, and applythe classifier𝑀 (𝒙 β€²) on each of them. We propose two novel opti-mizations to speedup the inner loop of GeCo: Ξ”-representation andclassifier specialization via partial evaluation. In Ξ”-representationwe group the current population by the set of features Δ𝐹 by whichthey differ from 𝒙 , and represent this entire subpopulation in a sin-gle relation whose attributes are only Δ𝐹 ; for example all candidateexamples 𝒙 β€² that differ from 𝒙 only by age are represented by asingle-column table, storing only the modified age. This leads tohuge memory savings over the naive representation (storing allfeatures of all candidates 𝒙 β€²), which, in turn, leads to performance

improvements. Partial evaluation specializes the code of the clas-sifier 𝑀 to entities 𝒙 β€² that differ from 𝒙 only in Δ𝐹 ; for example,if 𝑀 is a decision tree, then normally 𝑀 (𝒙 β€²) inspects all featuresfrom the root to a leaf, but if Δ𝐹 is age, then after partial evaluationit only needs to inspect age, for example 𝑀 can be age > 30, orperhaps 30 < age < 60, because all the other features are constantswithin this subpopulation. Ξ”-representation and partial evaluationwork well together, and, when combined, allows GeCo to computethe counterfactual in interactive time. At a high level, our opti-mizations belong to a more general effort that uses database andprogram optimizations to improve the performance of machinelearning tasks (e.g., [4, 9, 14, 19, 23]).

We benchmarked GeCo against four counterfactual explanationsystems: MACE [12], DiCE [17], What-If Tool [31], and CERTI-FAI [25]. We show that in all cases GeCo produced the best qualityexplanation (using quality metrics defined in [12]), at interactivespeed (under 500ms). We also conduct micro-experiments showingthat the two optimizations, Ξ”-representation and partial evaluation,account for a performance improvement of up to 5.2Γ—, and thus arecritical for an interactive deployment of GeCo.

Discussion Early explanation systems were designed for a spe-cific class of models, e.g., DeepLift [26] is specific for neural net-works, hence are called white-box. While many systems today claimto be black-box, there is yet no commonly agreed definition of whatblack-box means. In this paper we adopt a strict definition: theclassifier 𝑀 is black-box if it is available only as an oracle that,when given an input 𝒙 β€², returns the outcome𝑀 (𝒙 β€²). A black-boxmodel makes it much more difficult to explore the search space ofcounterfactuals, because we cannot examine the code of𝑀 for hintsof how to quickly get from 𝒙 to a counterfactual 𝒙𝑐 𝑓 . A gray-boxmodel allows us to examine the code of𝑀 . Unlike white-box, thegray-box does not restrict what𝑀 does, but it gives us access to itssource code. Some previous systems that were called black box are,in fact, gray box: MACE [12] translates both the classifier logic andthe feasibility/plausibility constraints into a logical formula andthen solves for counterfactuals via multiple calls to an SMT solver.The explanation systems based on gradient descent [18, 30] arealso gray-box, since they need access to the code of𝑀 in order tocompute its gradient. Fully black-box model are CERTIFAI [25] andGoogle’s What-if Tool (WIT) [31]. CERTIFAI is based on a geneticalgorithm, but the quality of its explanations are highly sensitive tothe computation of the initial population; we discussion CERTIFAIin detail in Sec. 4.7. WIT severely limits its search space to a givendataset. A black-box classifier also makes it difficult for the systemto compute an explanation that is consistent with the underlyingclassifier, meaning that 𝑀 (𝒙𝑐 𝑓 ) returns the desired, β€œgood”, out-come. As an example, LIME [21] uses a black-box classifier, learnsa simple, interpretable model locally, around the input data point𝒙 , and uses it to obtain an explanation; however, its explanationsare not always consistent with the predictions of the original classi-fier [22, 27]. Our system,GeCo, is black-box if we turn off the partialevaluation optimization. With the optimization enabled, GeCo isgray-box. Thus, GeCo differs from other systems in that it usesaccess to the code of𝑀 not to guide the search, but only to optimizethe execution time.

Page 3: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

Contributions In summary, in this paper wemake the followingcontributions.

β€’ We describe GeCo, the first explanation system that com-putes feasible and plausible explanations in interactive re-sponse time.

β€’ We describe the search space of GeCo consisting of a data-base 𝐷 and a constraint language PLAF. Section 3.

β€’ We describe a custom genetic algorithm for exploring thespace of candidate counterfactuals. Section 4.

β€’ We describe two optimization techniques: Ξ”-representationand partial evaluation. Section 5.

β€’ We conduct an extensive experimental evaluation of GeCo,and compare it to MACE, Google’s What-If Tool, and CER-TIFAI. Section 6.

2 COUNTERFACTUAL EXPLANATIONSWe consider𝑛 features 𝐹1, . . . , 𝐹𝑛 , with domains dom(𝐹1), . . . , dom(𝐹𝑛).We assume to have a black-box model𝑀 , i.e. an oracle that, whengiven a feature vector 𝒙 = (π‘₯1, . . . , π‘₯𝑛), returns a prediction𝑀 (𝒙).The prediction is a number between 0 and 1, where 1 is the desired,or good outcome, and 0 is the undesired outcome. For simplicity, wewill assume that𝑀 (𝒙) > 0.5 is β€œgood”, and everything else is β€œbad”.If the classifier is categorical, then we simply replace its outcomeswith the values {0, 1}. Given an instance 𝒙 for which𝑀 (𝒙) is β€œbad”,the goal of the counterfactual explanation is to find a counterfactualinstance 𝒙𝑐 𝑓 such that (1) 𝑀 (𝒙𝑐 𝑓 ) is β€œgood”, (2) π‘₯𝑐 𝑓 is close to π‘₯ ,and (3) 𝒙𝑐 𝑓 is both feasible and plausible. Formally, a counterfactualexplanation is a solution to the following optimization problem:

argmin𝒙𝑐𝑓

dist(𝒙, 𝒙𝑐 𝑓 ) (1)

𝑠 .𝑑 . 𝑀 (𝒙𝑐 𝑓 ) > 0.5𝒙𝑐 𝑓 ∈ P // π‘₯𝑐 𝑓 is plausible

𝒙𝑐 𝑓 ∈ F (𝒙) // π‘₯𝑐 𝑓 is feasible

where dist is a distance function. The counterfactual explanation𝒙𝑐 𝑓 ranges over the space dom(𝐹1) Γ— Β· Β· Β· Γ— dom(𝐹𝑛), subject tothe plausibility and feasibility constraints P and F (𝒙), which wediscuss in the next section. In practice, as we shall explain, GeCoreturns not just one, but the top π‘˜ best counterfactuals 𝒙𝑐 𝑓 .

The role of the distance function is to ensure that GeCo findsthe nearest counterfactual instance that satisfies the constraints.In particular, we are interested in counterfactuals that change thevalues of only a few features, which helps define concrete actionsthat the user can perform to achieve the desired outcome. Forthat purpose, we use the distance function dist(𝒙,π’š) introduced byMACE [12], which we briefly review here.

We start by defining domain-specific distance functions𝛿1, . . . , 𝛿𝑛for each of the 𝑛 features, as follows. If dom(𝐹𝑖 ) is categorical, then𝛿𝑖 (π‘₯𝑖 , 𝑦𝑖 )

def= 0 if π‘₯𝑖 = 𝑦𝑖 and 𝛿𝑖 (π‘₯𝑖 , 𝑦𝑖 )

def= 1 otherwise. If dom(𝐹𝑖 )

is a continuous or an ordinal domain, then 𝛿𝑖 (π‘₯𝑖 , 𝑦𝑖 )def= |π‘₯𝑖 βˆ’π‘¦π‘– |/𝑀 ,

where𝑀 is the range of the domain. We note that, when the rangeis unbounded, or unknown, then alternative normalizations arepossible, such as the Median Absolute Distance (MAD) [30], or thestandard deviation [31].

Next, we define the ℓ𝑝 -distance between 𝒙,π’š as:

dist𝑝 (𝒙,π’š)def=

(βˆ‘οΈπ‘–

𝛿𝑝

𝑖(π‘₯𝑖 , 𝑦𝑖 )

)1/𝑝and adopt the usual convention that the β„“0-distance is the numberof distinct features: dist0 (𝒙,π’š)

def= |{𝑖 | 𝛿𝑖 (π‘₯𝑖 , 𝑦𝑖 ) β‰  0}|. Finally, we

define our distance function as a weighted combination of the β„“0, β„“1,and β„“βˆž distances:

𝑑𝑖𝑠𝑑 (𝒙,π’š) =𝛼 Β· dist0 (𝒙,π’š)𝑛

+ 𝛽 Β· dist1 (𝒙,π’š)𝑛

+ 𝛾 Β· dist∞ (𝒙,π’š) (2)

where𝛼, 𝛽,𝛾 β‰₯ 0 are hyperparameters, whichmust satisfy𝛼+𝛽+𝛾 =

1. Notice that 0 ≀ 𝑑𝑖𝑠𝑑 (𝒙,π’š) ≀ 1. The intuition is that the β„“0-normrestricts the number of features that are changed, the β„“1 normaccounts for the average change of distance between 𝒙 and π’š, andthe β„“βˆž norm restricts the maximum change across all features.Although [12] defines the distance function (2), the MACE systemhardwires the hyperparameters to 𝛼 = 0, 𝛽 = 1, 𝛾 = 0; we discussthis, and the setting of different hyperparameters in Sec. 6.

3 THE SEARCH SPACEThe key to computing high quality explanations is to define a com-plete space of candidates that the system can explore. In GeCo thesearch space for the counterfactual explanations is defined by twocomponents: a database of entities 𝐷 = {𝒙1, 𝒙2, . . .} and plausibilityand feasibility constraint language called PLAF.

The database𝐷 can be the training set, a test set, or historical dataof past customers for which the system has performed predictions.It is used in two ways. First, GeCo computes the domain of eachfeature as the active domain in 𝐷 : dom(𝐹𝑖 )

def= Π𝐹𝑖 (𝐷). Second, the

data analyst can specify groups of features with the command:

GROUP 𝐹𝑖1 , 𝐹𝑖2 , . . .

and in that case the joint domain of these features is restricted to thecombination of values found in𝐷 , i.e.Π𝐹𝑖1𝐹𝑖2 Β· Β· Β· (𝐷). Grouping is use-ful in several contexts. The first is when the attributes are correlated.For example, if we have a functional dependency like zip β†’ city,then the data analyst would group zip, city. For another exampleof correlation, consider education and income. They are correlatedwithout satisfying a strict functional dependency; by grouping them,the data analyst ensures that GeCo considers only combinationsof values found in the dataset 𝐷 . As a final example, consider at-tributes that are the result of one-hot encoding: e.g. color may beexpanded into color_red, color_green, color_blue. By groupingthem together, the analyst restricts GeCo to consider only values(1, 0, 0), (0, 1, 0), (0, 0, 1) that actually occur in 𝐷 .

The constraint language PLAF allows the data analyst to specifywhich combination of features of 𝒙𝑐 𝑓 are plausible, and which canbe feasibly reached from 𝒙 . PLAF consists of statements of the form:

PLAF IF Ξ¦1 and Ξ¦2 and Β· Β· Β· THEN Ξ¦0 (3)

where each Φ𝑖 is an atomic predicate of the form 𝑒1 op 𝑒2 forop ∈ {=,β‰ , ≀, <, β‰₯, >}, and each expression 𝑒1, 𝑒2 is over the cur-rent features, denoted x.𝐹𝑖 , and/or counterfactual features, denotedx_cf.𝐹𝑖 . The IF part may be missing.

Page 4: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

explain (instance 𝒙 , classifier𝑀 , dataset 𝐷 , PLAF (Ξ“, C))C𝒙 = ground(𝒙, C); DG = feasibleSpace(𝐷, Ξ“, C𝒙 );POP = [ (𝒙, βˆ…) ]POP = mutate(POP, Ξ“, DG, C𝒙 ,π‘šinit) //initial population

POP = selectFittest(𝒙, POP, 𝑀, π‘ž)do {

CAND = crossover(POP, C𝒙 ) βˆͺmutate(POP, Ξ“, DG, C𝒙 ,π‘šmut)POP = selectFittest(𝒙, POP βˆͺ CAND, 𝑀, π‘ž)TOPK = POP[1 : π‘˜]

} until ( counterfactuals(TOPK, 𝑀) and TOPK ∩ CAND = βˆ… )return TOPK

Algorithm 1: Pseudo-code of GeCo’s custom genetic algo-rithm to generate counterfactual explanations.

Example 3.1. Consider the following PLAF specification:

GROUP education, income (4)PLAF x_cf.gender = x.gender (5)PLAF x_cf.age >= x.age (6)PLAF IF x_cf.education > x.education

THEN x_cf.age > x.age+4 (7)

The first statement says that education and income are correlated:GeCo will consider only counterfactual values that occur togetherin the data. Rule (5) says that gender cannot change, rule (6) saysthat age can only increase, while rule (7) says that, if we ask thecustomer to get a higher education degree, then we should alsoincrease the age by 4. The last rule (7) is adapted from [17], whohave argued for the need to restrict counterfactuals to those thatsatisfy causal constraints.

PLAF has the following restrictions. (1) The groups have to bedisjoint: if a feature 𝐹𝑖 needs to be part of two groups then theyneed to be union-ed into a larger group. We denote by 𝐺 (𝐹𝑖 ) theunique group containing 𝐹𝑖 (or 𝐺 (𝐹𝑖 ) = {𝐹𝑖 } if 𝐹𝑖 is not explic-itly included in any group). (2) the rules must be acyclic, in thefollowing sense. Every consequent Ξ¦0 in (3) must be of the formx_cf.𝐹𝑖 op 𝑒 , i.e. must β€œdefine” a counterfactual feature 𝐹𝑖 , andthe following graph must be acyclic: the nodes are the groups𝐺 (𝐹1),𝐺 (𝐹2), . . ., and the edges are (𝐺 (𝐹 𝑗 ),𝐺 (𝐹𝑖 )) whenever thereis a rule (3) that defines x_cf.𝐹𝑖 and that also contains x_cf.𝐹 𝑗 .We briefly illustrate with Example 3.1. The three rules (5)-(7) β€œde-fine” the features gender, age, and age respectively, and there thereis a single edge {education, income} β†’ {age}, resulting fromRule (7). Therefore, the PLAF program is acyclic.

4 GECO’S CUSTOM GENETIC ALGORITHMIn this section, we introduce the custom genetic algorithm thatGeCo uses to efficiently explore the space of counterfactual ex-planations. A genetic algorithm is a meta-heuristic for constraintoptimization that is based on the process of natural selection. Thereare four core operations. First, it defines an initial populationof candidates. Then, it iteratively selects the fittest candidatesin the population, and generates new candidates viamutate andcrossover on the selected candidates, until convergence.

While there are many optimization algorithms that could be usedto solve our optimization problem (defined in Eq (1)), we chose a ge-netic algorithm for the following reasons: (1) The genetic algorithmis easily customizable to the problem of finding counterfactual ex-planations; (2) it seamlessly supports the rich semantics of PLAFconstraints, which are necessary to ensure that the explanations arefeasible and plausible; (3) it does not require any restrictions on theunderlying classifier and data, and thus is able to provide black-boxexplanations; and (4) it returns a diverse set of explanations, whichmay provide different actions that can lead to the desired outcome.

In GeCo, we customized the core operations of the genetic algo-rithm based on the following key observation: A good explanation𝒙𝑐 𝑓 should differ from 𝒙 by only a few features; counterfactual ex-amples 𝒙𝑐 𝑓 that require the customer to change too many featuresare of little interest. For this reason, GeCo first explores counter-factuals that change only a single feature group, before exploringincreasingly more complex action spaces in subsequent generations.

In the following, we first overview of the genetic algorithm usedby GeCo and then provide additional details for the core operations.

4.1 OverviewGeCo’s pseudocode is shown in Algorithm 1. The inputs are: aninstance 𝒙 , the black-box classifier𝑀 , a dataset 𝐷 , and PLAF pro-gram (Ξ“,𝐢). Here Ξ“ = {𝐺1,𝐺2, . . .} are the groups of features, and𝐢 = {𝐢1,𝐢2, . . .} are the PLAF constraints, see Sec. 3. The algorithmhas four integer hyperparameters π‘˜,π‘šinit,π‘šmut, π‘ž > 0, with thefollowing meaning: π‘˜ represents the number of counterfactualsthat the algorithm returns to the user; π‘ž is the size of the populationthat is retained from one generation to the next; andπ‘šinit,π‘šmutcontrol the number of candidates that are generated for the initialpopulation, and during mutation respectively. We always set π‘˜ < π‘ž.

As explained in Sec. 3, the active domain of each attribute arevalues found in the database 𝐷 . More generally, for each group𝐺𝑖 ∈ Ξ“, its values must be sampled together from those in thedatabase 𝐷 . The GeCo algorithm starts by grounding (specializing)the PLAF program 𝐢 to the entity 𝒙 (the ground function), thencalls the feasibleSpace operator, which computes for each group𝐺𝑖 a relation 𝐷𝐺𝑖 with attributes 𝐺𝑖 representing the sample spacefor the group 𝐺𝑖 ; we give details in Sec. 4.2.

Next, GeCo computes the initial population. In our customizedalgorithm, the initial population is obtained simply by applying themutate operator to the given entity 𝒙 . Throughout the executionof the algorithm, the population is a set of pairs (𝒙 β€²,Ξ”β€²), where 𝒙 β€²is an entity and Ξ”β€² is the set of features that were changed from 𝒙 .Throughout this section we call the entities 𝒙 β€² in the populationcandidates, or examples, but we don’t call them counterfactuals,unless 𝑀 (𝒙 β€²) is β€œgood”, i.e. 𝑀 (𝒙 β€²) > 0.5 (see Sec. 2). In fact, it ispossible that none of the candidates in the initial population are clas-sified as good. The goal of GeCo is to find at least π‘˜ counterfactualsthrough a sequence of mutation and crossover operation.

The main loop ofGeCo’s genetic algorithm consists of extendingthe population with new candidates obtained by mutation andcrossover, then keeping only the π‘ž fittest for the next generation.The operators selectFittest, mutate, crossover are described inSec. 4.3, 4.4, and 4.5. The algorithm stops when the top π‘˜ (fromthe select π‘ž fittest) candidates are all counterfactuals and are stable

Page 5: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

from one generation to the next; the function counterfactualssimply tests that all candidates are counterfactuals, by checking1that𝑀 (𝒙 β€²) is β€œgood”. We describe now the details of the algorithm.

4.2 Feasible Space OperatorsThroughout the execution of the genetic algorithm, GeCo ensuresthat all candidates 𝒙 β€² satisfy all PLAF constraints𝐢 . It achieves thisefficiently through three functions: ground, feasibleSpace, andactionCascade. We describe these functions here, and omit theirpseudocode (which is straightforward).

The function ground(𝒙,𝐢) simply instantiates all features of 𝒙with constants, and returns β€œgrounded” constraints 𝐢𝒙 . All candi-dates 𝒙 β€² will need to satisfy these grounded constraints.

Example 4.1. Let𝐢 be the three constraints of the PLAF programin Example 3.1. Assume that the instance is:

𝒙 =(gender = female, age = 22, education = 3, income = 80π‘˜)

Then the grounded set of rules is 𝐢𝒙 is obtained from the rules (5)-(7). For example, x_cf.age >= x.age becomes age β‰₯ 22. Thethree grounded rules are:

gender = female (8)age β‰₯ 22 (9)education > 3 β‡’ age > 26 (10)

Every candidate 𝒙 β€² must satisfy all three rules.

A naive strategy to generate candidates 𝒙 β€² that satisfy 𝐢𝒙 isthrough rejection sampling: after each mutation and/or crossover,we verify 𝐢𝒙 , and reject 𝒙 β€² if it fails some constraint in 𝐢𝒙 . GeCoimproves over this naive approach in two ways. First, it computesfor each group 𝐺𝑖 ∈ Ξ“ a set of values 𝐷𝐺𝑖 that, in isolation, sat-isfy 𝐢𝒙 ; this is precomputed at the beginning of the algorithm byfeasibleSpace. Second, once it generates candidates 𝒙 β€² that dif-fer in multiple feature groups 𝐺𝑖1 ,𝐺𝑖2 , . . . from 𝒙 , then it enforcesthe constraint by possibly applying additional mutation, until allconstraints hold: this is done by the function actionCascade.

The function feasibleSpace(𝐷, Ξ“, C𝒙 ) computes, for each group𝐺𝑖 ∈ Ξ“, the relation:

𝐷𝐺𝑖def= 𝜎

𝐢(𝑖 )𝒙

(Π𝐺𝑖

(𝐷))

where 𝐢 (𝑖)𝒙 consists of the conjunction of all rules in 𝐢𝒙 that refer

only to features in the group𝐺𝑖 . We call the relation𝐷𝐺𝑖 the samplespace for the group𝐺𝑖 . Thus, the selection operator 𝜎

𝐢(𝑖 )𝒙

rules outvalues in the sample space that violate of some PLAF rule.

Example 4.2. Continuing Example 4.1, there are three groups,𝐺1 = {gender},𝐺2 = {education, income},𝐺3 = {age}, and theirsample spaces are computed as follows:

𝐷𝐺1def= 𝜎gender=female (Πgender (𝐷))

𝐷𝐺2def= Πeducation,income (𝐷))

𝐷𝐺3def= 𝜎ageβ‰₯22 (Ξ age (𝐷))

1In our implementation we store the value𝑀 (𝒙′) together with 𝒙′, so we don’t haveto compute it repeatedly. We omit some details for simplicity of the presentation.

Notice that we could only check the grounded rules (8) and (9). Therule (10) refers to two different groups, and can only be checked byexamining features from two groups; this is done by the functionactionCascade.

The function actionCascade(𝒙 β€²,Ξ”β€², C𝒙 ) is called during muta-tion and crossover, and its role is to enforce the grounded con-straints 𝐢𝒙 on a candidate 𝒙 β€² before it is added to the population.Recall from Sec. 3 that the PLAF rules are acyclic. actionCascadechecks each rule, in topological order of the acyclic graph: if therule is violated, then it changes the feature defined by that rule to avalue that satisfies the condition, and it adds the updated feature(or group) to Ξ”β€². The function returns the updated candidate 𝒙 β€²,which now satisfies all rules C𝒙 , as well as its set of changed featuresΞ”β€². For a simple illustration, referring to Example 4.2, when GeCoconsiders a new candidate 𝒙 β€², it checks the rule (10): if the rule isviolated, then it replaces age with a new value from 𝐷𝐺3, subjectto the additional condition age > 26, and adds age to Ξ”β€².

4.3 Selecting Fittest CandidatesAt each iteration, GeCo’s genetic algorithm extends the currentpopulation (through mutation and crossover), then retains onlythe β€œfittest” π‘ž candidates 𝒙 β€² for the next generation using theselectFittest(𝒙, POP, 𝑀, π‘ž) function. The function first evaluatesthe fitness of each candidate 𝒙 β€² in POP; then sorts the examples 𝒙 β€²by their fitness score; and returns the top π‘ž candidates.

We describe here how GeCo computes the fitness of each candi-date 𝒙 β€². For that,GeCo takes into account two pieces of information:whether 𝒙 β€² is counterfactual, i.e.𝑀 (𝒙 β€²) > 0.5, and the distance from𝒙 to 𝒙 β€² which is given by the function 𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²) defined in Sec. 2(see Eq. (2)). It combines these two pieces of information into asingle numerical value, π‘ π‘π‘œπ‘Ÿπ‘’ (𝒙 β€²) defined as:

π‘ π‘π‘œπ‘Ÿπ‘’ (𝒙 β€²) ={𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²) if𝑀 (𝒙 β€²) > 0.5(𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²) + 1)+(1 βˆ’π‘€ (𝒙 β€²)) otherwise

(11)

The rationale for this score is that we want every counterfactualcandidate to be better than every non-counterfactual candidate. If𝑀 (𝒙 β€²) > 0.5, then 𝒙 β€² is counterfactual, in that case it remains tominimize 𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²). But if𝑀 (𝒙 β€²) ≀ 0.5, then 𝒙 β€² is not counterfac-tual, and we add 1 to 𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²) ensuring that this term is largerthan any distance 𝑑𝑖𝑠𝑑 (𝒙, 𝒙 β€²β€²) for any counterfactual 𝒙 β€²β€² (because𝑑𝑖𝑠𝑑 ≀ 1, see Sec. 2); we also add a penalty 1 βˆ’ 𝑀 (𝒙 β€²), to favorcandidates 𝒙 β€² that are closer to becoming counterfactuals.

In summary, the function selectFittest computes the fitnessscore (Eq. (11)) for each candidate in the population, then sortsthe population in decreasing order by this score, and returns the π‘žfittest candidates. Notice that all counterfactuals examples 𝒙 β€² willprecede all non-counterfactuals in the sorted order.

4.4 Mutation OperatorAlgorithm 2 provides the pseudo-code of the mutation operator.The operatormutate(POP, Ξ“, DG, C𝒙 ,π‘š) takes as input the currentpopulation POP, the list of feature groups Ξ“ = {𝐺1,𝐺2, . . .}, theirassociated sample spaces DG = {𝐷𝐺1, 𝐷𝐺2, . . .}, the grounded con-straints C𝒙 , and an integerπ‘š > 0. The function generates, for eachcandidate 𝒙 β€² in the population,π‘š new mutations for each featuregroup𝐺𝑖 . We briefly explain the pseudo-code. For each candidate

Page 6: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

mutate (POP, Ξ“, DG, C𝒙 ,π‘š )

CAND = βˆ…foreach (π’™βˆ—,Ξ”βˆ—) ∈ POP do {

foreach 𝐺 ∈ Ξ“ s.t. 𝐺 βˆ‰ Ξ”βˆ— do {𝒙 β€² = π’™βˆ—; 𝑆 = sample(DG[𝐺],π‘š) //without replacement

foreach 𝑣 ∈ 𝑆 do {𝒙 β€².𝐺 = 𝑣 ; Ξ”β€² = Ξ”βˆ— βˆͺ {𝐺}(𝒙 β€²,Ξ”β€²) = actionCascade(𝒙 β€²,Ξ”β€², C𝒙 )CAND βˆͺ= {(𝒙 β€²,Ξ”β€²)}

}}

}return CAND

Algorithm 2: Pseudo-code for GeCo’s mutate operator.

crossover (POP, C𝒙 )

CAND = βˆ…; {Ξ”1, . . . ,Ξ”π‘Ÿ } = {Ξ” | (𝒙,Ξ”) ∈ POP}foreach (Δ𝑖 ,Ξ” 𝑗 ) ∈ {Ξ”1, . . . ,Ξ”π‘Ÿ } s.t. 𝑖 < 𝑗 do {

let (𝒙𝑖 ,πš«π‘– ) = best instance in POP with Ξ” = Δ𝑖let (𝒙 𝑗 ,𝚫 𝑗 ) = best instance in POP with Ξ” = Ξ” 𝑗

//combine actions from 𝒙𝑖 and 𝒙 𝑗𝒙 β€² = 𝒙𝑖 ; 𝚫

β€² = πš«π‘– βˆͺ 𝚫 𝑗

foreach 𝐺 ∈ πš«β€² do {

if 𝐺 ∈ Δ𝑖 \ Ξ” 𝑗 do 𝒙 β€².𝐺 = 𝒙𝑖 .𝐺elseif 𝐺 ∈ Ξ” 𝑗 \ Δ𝑖 do 𝒙 β€².𝐺 = 𝒙 𝑗 .𝐺else 𝒙 β€².𝐺 = rand(Bool) ? 𝒙𝑖 .𝐺 : 𝒙 𝑗 .𝐺

}(𝒙 β€²,Ξ”β€²) = actionCascade(𝒙 β€²,Ξ”β€², C𝒙 )CAND βˆͺ= {(𝒙 β€²,Ξ”β€²)}

}return CAND

Algorithm 3: Pseudo-code for GeCo’s crossover operator.

(π’™βˆ—,Ξ”βˆ—) ∈ POP, and for each feature group 𝐺 ∈ Ξ“ that has notbeen mutated yet (𝐺 βˆ‰ Ξ”βˆ—), we sampleπ‘š values without replace-ment from the sample space associated to 𝐺 , and construct a newcandidate 𝒙 β€² obtained by changing𝐺 to 𝑣 , for each value 𝑣 in thesample. As explained earlier (Sec. 4.2), before we insert 𝒙 β€² in thepopulation, we must enforce all constraints in 𝐢𝒙 , and this is doneby the function actionCascade.

In GeCo, the mutation operator also generates the initial popu-lation. In this case, the current population POP contains only theoriginal instance 𝒙 with Ξ” = βˆ…. Thus, GeCo first explores candi-dates in the initial population that differ from the original instance𝒙 only in a change for one group 𝐺 ∈ Ξ“. This ensures that weprioritize candidates that have few changes. Subsequent calls tothe mutation operator than change one additional feature groupfor each candidate in the current population.

4.5 Crossover OperatorThe crossover operator generates new candidates by combining theactions of two instances 𝒙𝑖 , 𝒙 𝑗 in POP. For example, consider candi-dates 𝒙𝑖 and 𝒙 𝑗 that differ from 𝒙 in the feature groups {age},

and {education, income} respectively. Then, the crossover op-erator generates a new candidate 𝒙 β€² that changes all three fea-tures age, education, income, such that that 𝒙 β€².age = 𝒙𝑖 .age and𝒙 β€².{education, income} = 𝒙 𝑗 .{education, income}. The pseudo-code of the operator is given by Algorithm 3.

In a conventional genetic algorithm, the candidates for crossoverare selected at random. In GeCo, however, we want to combine thecandidates that (1) change a distinct set of features, and (2) are thebest candidates amongst all candidates in POP that change the sameset of features. Recall that, for each candidate 𝒙 β€² in the population,we also store the set Ξ”β€² of feature groups where 𝒙 β€² differs from 𝒙 .To achieve our customized crossover, we first collect all distinctsets Ξ” in POP. Then, we consider any combination of two distinctsets (Δ𝑖 ,Ξ” 𝑗 ), and find the candidates 𝒙𝑖 and 𝒙 𝑗 which have the bestfitness score in the subset of POP for which Ξ” = Δ𝑖 and respectivelyΞ” = Ξ” 𝑗 . The fitness score is given in Eq. (11) and, since POP is alreadysorted by the fitness score, the best candidate with Ξ” = Δ𝑖 is alsothe first candidate in POP with Ξ” = Δ𝑖 . The operator then combinesthe candidates 𝒙𝑖 and 𝒙 𝑗 into a new candidate 𝒙 β€², which inheritsboth the mutations from 𝒙𝑖 and 𝒙 𝑗 ; in case a feature group 𝐺 wasmutated in both 𝒙𝑖 and 𝒙 𝑗 , then we choose at random to assignto 𝒙 β€².𝐺 either 𝒙𝑖 .𝐺 or 𝒙 𝑗 .𝐺 . Finally, we check if 𝒙 β€² requires anycascading actions, and apply them in order to satisfy the constraints𝐢𝒙 . We add 𝒙 β€² and the corresponding feature set Ξ”β€² to the collectionof new candidates CAND.

4.6 DiscussionIn this section, we discuss implementation details and potentialtradeoffs that we face for the performance optimization of GeCo.

Hyperparameters. We use the following defaults for the hyperpa-rameters: π‘ž = 100, π‘˜ = 5,π‘šinit = 20,π‘šmut = 5. The first parametermeans that each generation retains only the fittest π‘ž = 100 candi-dates. The value π‘˜ = 5means that the algorithm runs until the top 5candidates of the population are all counterfactuals (i.e.𝑀 (𝒙) > 0.5)and they remain in the top 5 during two consecutive generations.The initial population consists of 20 candidates 𝒙 β€² per feature group,and, during mutation, we create 5 new candidates per feature group.These defaults ensure that selected candidates of the initial pop-ulation will change a minimum of five distinct feature groups. Insubsequent generations, we then sample multiple actions for eachfeature group, in order to mitigate the potential issue that the initialsamples did not capture the best action for each feature group.

Representation of Ξ”. For each candidate (𝒙 β€²,Ξ”β€²) ∈ POP, we repre-sent the feature set Ξ”β€² as a compact bitset, which allows for efficientbitwise operations.

Sampling Operator. In the mutation operator, we use weightedsampling to sample actions from the sample space 𝐷𝐺𝑖 associatedto the group𝐺𝑖 . The weight is given by the frequency of each value𝑣 ∈ 𝐷𝐺𝑖 in the input dataset 𝐷 . As a result, GeCo is more likely tosample those actions that frequently occur in the dataset 𝐷 , whichhelps to ensure that the generated counterfactual is plausible.

Number of Mutated Candidates. By default the mutation operatormutates every candidate in the current population, which ensuresthat (1) we explore a large set of candidates, and (2) we samplemany values from the feasible space for each group. The downsideis that this operation can take a long time, in particular if we return

Page 7: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

many selected candidates (π‘ž is large), and the mutation operatorcan become a performance bottleneck. In this case, it is possibleto mutate only a selected number of candidates. We propose togroup the candidates by the set Ξ”, and then to mutate only the topcandidates in each group (just like we only do crossover on thetop candidate in each group). This approach ensures that we stillexplore candidates with the Ξ” sets as the default, but limits the totalnumber of candidates that are generated.

Large Sample Spaces. If there is a sample space 𝐷𝐺𝑖 that is verylarge, then it is possible that GeCo does not sample the best actionfrom this space. In this case, we designed a variant of the mutateoperator, which we call selectiveMutate. Whereas the normal mu-tation operator mutates candidates by changing feature groupsthat have not been changed, selective mutation mutates the fea-ture groups that were previously changed by sampling actions thatwould decrease the in a way to distance between the candidate andthe original instance. This additional operation ensures that GeCois more likely to identify the best action in large sample spaces.

4.7 Comparison with CERTIFAICERTIFAI [25] also computes counterfactual explanations using agenetic algorithm. The main difference that distinguishes CERTIFAIfrom GeCo is that CERTIFAI assumes that the initial population isa random sample of only β€œgood” counterfactuals. Once this initialpopulation is computed, the goal of its genetic algorithm is to findbetter counterfactuals, whose distance to the original instance issmaller. However, it is unclear how to compute the initial populationof counterfactuals, and the quality of the final answer of CERTIFAIdepends heavily on the choice of this initial population. CERTIFAIalso does not emphasize the exploration of counterfactuals thatchange only few features. In contrast, GeCo starts from only theoriginal instance 𝒙 as seed, whose outcome is β€œbad”, and relies onthe hypothesis that some β€œgood” counterfactual should be nearby,i.e. with few changed features.

Since CERTIFAI is not publicly available, we implemented ourown variant based on the description in the paper. Since it is notclear how the initial population is computed, we consider all in-stances in the database 𝐷 that are classified with the good outcomeand satisfy all feasibility and plausibility constraints. This is themost efficient way to generate a sample of good counterfactuals,but it has the downside is that the quality of the initial population,and, hence, of the final explanation, varies widely with the instance𝒙 . In fact, there are cases where 𝐷 contains no counterfactual thatsatisfies the feasibility constraints w.r.t. 𝒙 . In Section 6, we showthat GeCo is able to compute explanations that are closer to theoriginal instance significantly faster than CERTIFAI.

5 OPTIMIZATIONSThe main performance limitation in GeCo are the repeated callsto selectFittest and mutate. Between the two operations, GeCorepeatedly adds tens of thousands of candidates to the population,applies the classifier𝑀 on each of them, and then removes thosethat have a low fitness score. In Section 6, we show that selectionandmutation account for over 95% of the overall runtime. In order toapply GeCo in interactive settings, it is thus important to optimize

Age Edu. Occup. Income28 HS Service 10,00030 HS Service 10,00032 HS Service 10,00022 PhD Service 10,00022 BSc Service 10,00025 HS Service 15,00023 HS Service 20,00022 HS Sales 10,00022 HS Student 10,000

Ξ” Sets{Age}{Edu.}{Age, Income}{Occup.}

Age283032

Edu.PhDBSc

Age Income25 15,00023 20,000

Occup.Sales

Student

Figure 3: Example candidate population for instance 𝒙 =

(22,HS,Service,10000) using the naive listing representation(left) and the equivalent Ξ”-representation (right).

the performance of these two operators. In this section, we presenttwo optimizations which significantly improve their performance.

To optimizemutate, we present a loss-less, compressed data rep-resentation, called Ξ”-representation, for the candidate populationthat is generated during the genetic algorithm. In our experiments,the Ξ”-representation is up to 25Γ— more compact than an equiva-lent naive listing representation, which translates to a performanceimprovement of up to 3.9Γ— for mutate.

To optimize selectFittest, we draw on techniques from the PLcommunity, in particular partial evaluation, to optimize the eval-uation of the classifier. The optimizations exploit the fact that weknow the values for subsets of the input features before the eval-uation, which allows us to translate the model into a specializedequivalent model that pre-evaluates the static components. Theseoptimizations improve the runtime for selectFittest by up to 3.2Γ—.

The Ξ”-representation and partial evaluation complement eachother, and together decrease the end-to-end runtime of GeCo by afactor of 5.2Γ—. Next, we provide more details for our optimizations.

5.1 Ξ”-RepresentationThe naive representation for the candidate population of the geneticalgorithm is a listing of the full feature vectors 𝒙 β€² for all candidates(𝒙 β€²,Ξ”β€²). This representation is highly redundant, because mostvalues in 𝒙 β€² are equal to the original instance 𝒙 ; only the featuresin Ξ”β€² are different. The Ξ”-representation can represent each thecandidate (𝒙 β€²,Ξ”β€²) compactly by storing only the features in Ξ”β€².This is achieve by grouping the candidate population by the set offeatures Ξ”β€², and then representing the entire subpopulation in asingle relation 𝑅Δ′ whose attributes are only Ξ”β€².

Example 5.1. Figure 3 presents a candidate population for theinstance 𝒙 = (Age=22, Edu=HS, Occup=Service, Income=10000) us-ing (left) the naive listing representation and (right) the equivalentΞ”-representation. For simplicity, we highlight the changed valuesin the listing representation, instead of enumerating all Ξ” sets,

Most values stored in the listing representation are values from𝒙 . In contrast, the Ξ”-representation only represents the values thatare different from 𝒙 . For instance, the first three candidates, whichchange only Age, are represented in a relation with attributes Ageonly, and without repeating the values for Edu, Occup, Income.

Page 8: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

In our implementation, we represent the Ξ” sets as bitsets, andthe Ξ”-representation is a hashmap of that maps the distinct Ξ”sets to the corresponding relation 𝑅Δ, which is represented as aDataFrame. We provide wrapper functions so that we can applystandard DataFrame operations directly on the Ξ”-representation.

TheΞ”-representation has a significantly lowermemory footprint,which can lead to significant performance improvements over thenaive representation, because it is more efficient to add candidatesto the smaller relations, and it simplifies garbage collection.

There is, however, a potential performance tradeoff for the selec-tion operator, because the classifier𝑀 typically assumes as inputthe full feature vector 𝒙 β€². In this case, we copy the values in theΞ”-representation to a full instantiation of 𝒙 β€². This transformationcan be expensive, but, in our experiments, it does not outweigh thespeedup for mutation. In the next section, we show how we canuse partial evaluation to avoid the construction of the full 𝒙 β€².

5.2 Partial Evaluation for ClassifiersWe show how to adapt PL techniques, in particular code special-ization via partial evaluation, to optimize the evaluation of a givenclassifier𝑀 , and thus speedup the performance of selectFittest.

Consider a program 𝑃 : 𝑋 Γ— π‘Œ β†’ 𝑂 which maps two inputs(𝑋,π‘Œ ) into output 𝑂 . Assume that we know π‘Œ = 𝑦 at compile time.Partial evaluation takes program 𝑃 and input π‘Œ = 𝑦 and generates amore efficient program 𝑃 β€²βŸ¨π‘¦βŸ©π‘‹ β†’ 𝑂 , which precomputes the staticcomponents. Partial evaluation guarantees that 𝑃 β€²βŸ¨π‘¦βŸ© (π‘₯) = 𝑃 (π‘₯,𝑦)for all π‘₯ ∈ dom(𝑋 ). See [10] for more details on partial evaluation.

We next overview how we use partial evaluation in GeCo. Con-sider a classifier𝑀 with features 𝐹 . During the evaluation of can-didate (𝒙 β€²,Ξ”β€²), we know that the values for all features 𝐹 \ Ξ”β€² areconstants taken from the original instance 𝒙 . Thus, we can partiallyevaluate the classifier𝑀 to a simplified classifier𝑀Δ′ that precom-putes the static components related to features 𝐹 \ Ξ”β€². Once 𝑀Δ′

is generated, we cache the model so that we can apply it for allcandidates in the population that change the same feature set Ξ”β€².Note that by using partial evaluation, GeCo no longer explains ablack-box, since it requires access to the code of the classifier.

Example 5.2. Consider a decision tree classifier 𝑀 . For an in-stance (𝒙 β€²,Ξ”β€²),𝑀 (𝒙 β€²) typically evaluates the decisions for all fea-tures along a root to leaf path. Since we know the values for thefeatures 𝐹 \ Ξ”β€², we can precompute and fold all nodes in the treethat involve the features 𝐹 \ Ξ”β€². If Ξ”β€² is small, then partial evalu-ation can generate a very simple tree. For instance, if Ξ”β€² = {Age}then𝑀Δ′ only needs to evaluate decisions of the form age > 30 or30 < age < 60 to classify the input.

In addition to optimizing the evaluation of𝑀 , a partially evalu-ated classifier𝑀Δ can be directly evaluated over the partial relation𝑅Δ in the Ξ”-representation, and thus we mitigate the overhead re-sulting from the need to construct the full entity for the evaluation.

Partial evaluation has been studied and applied in various do-mains, e.g., in databases, it has been used to optimize query evalu-ation (see e.g., [24, 28]). We are, however, not aware of a general-purpose partial evaluator that be applied in GeCo to optimize arbi-trary classifiers. Thus, we implemented our own partial evaluator

Table 1: Key characteristics for each considered dataset.

Credit Adult Allstate YelpData Points 30K 45K 13.2M 22.4MVariables 14 12 37 34Features (one-hot enc.) 20 51 864 1064

for two model classes: (1) tree-based models, which includes de-cision trees, random forests, and gradient boosted trees, and (2)neural networks and multi-layered perceptrons.

In the following, we briefly introduce the partial evaluation weuse for tree-based models and neural networks.Tree-based models. We optimize the evaluation of tree-basedmodels in two steps. First, we use existing techniques to turn themodel into a more optimized representation for evaluation. Then,we apply partial evaluation on the optimized representation.

Tree-based models face performance bottlenecks during eval-uation because, by nature of their representation, they are proneto cache misses and branch misprediction. For this reason, the MLsystems community has studied how tree-based models can be rep-resented so that they can be evaluated without random lookupsin memory and repeated if-statements (see e.g., [5, 16, 32]). InGeCo, we use the representation proposed by the QuickScoreralgorithm [16], which we briefly overview next.

Instead of evaluating a decision tree 𝑇 following a root to leafpath, QuickScorer evaluates all decision nodes in𝑇 and keeps trackof which leaves cannot be reached whenever a decision fails. Onceall decisions are evaluated, the prediction is guaranteed to be givenby the first leaf that can be reached. All operations in QuickScoreruse efficient, cache-conscious bitwise operations, and avoid branchmispredictions. This makes the QuickScorer very efficient, even ifit evaluates many more decisions than the naive evaluation.

For partial evaluation, we exploit the fact that the first phase inQuickScorer computes the decision nodes for each feature indepen-dently of all other features. Thus, given a set of fixed feature values,we can precompute all corresponding decisions, and significantlynumber of decision that are evaluated at runtime.Neural Networks and MLPs. We consider neural networks forstructured, relational data (as opposed to images or text). In this set-ting, each hidden node𝑁 in the first layer of the network is typicallya linear model [3, 17]. Given an input vector 𝒙 , parameter vectorπ’˜ , and bias term 𝑏, the node 𝑁 thus computes: 𝑦 = 𝜎 (π’™βŠ€π’˜ + 𝑏),where 𝜎 is an activation function. If we know that some values in𝒙 are static, then we can apply partial evaluation for 𝑁 by precom-puting the product between 𝒙 andπ’˜ for all static components andadding the partial product to the bias term 𝑏. During evaluation,we then only need to compute the product between 𝒙 andπ’˜ for thenon-static components.

Typically, the first layer is fully-connected, so that we can applypartial evaluation only on the first layer of the network. If thelayers are not fully-connected, then we can also partially evaluatesubsequent layers. Note that this framework for partial evaluationis also applicable to multi-layered perceptrons and linear models.

Page 9: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

6 EXPERIMENTSIn this section, we present the results for our experimental eval-uation of GeCo on four real datasets. We conduct the followingexperiments:

(1) We investigate whether GeCo is able to compute counterfac-tual explanations for one end-to-end example, and comparethe explanation with four existing systems.

(2) We benchmark all considered systems on 5,000 instances,and investigate the tradeoff between the quality of the ex-planations and the runtime for each system.

(3) We conduct microbenchmarks for GeCo. In particular, webreakdown the runtime into individual components and in-vestigate the impact of the optimizations from Section 5.

The code and scripts to run the experiments are publicly availableon: https://github.com/mjschleich/GeCo.jl.

6.1 Experimental SetupIn this section, we present the considered datasets and systems, aswell as the setup used for all our experiments.Datasets. We consider four real datasets: (1) Credit [33] is usedpredict customer’s default on credit card payments in Taiwan; (2)Adult [13] is used to predict whether the income of adults exceeds$50K/year using US census data from 1994; (3) Allstate is a Kaggledataset for the Allstate Claim Prediction Challenge [2], used topredict insurance claims based on the characteristics of the insured’svehicle; (4) Yelp is based on the public Yelp Dataset Challenge [34]and is used to predict review ratings that users give to businesses.

Table 1 presents key statistics for each dataset. Credit and Adultare from the UCI repository [8] and commonly used to evaluateexplanations (e.g., [12, 18, 29]). For all datasets, we one-hot encodethe categorical variables. For the evaluation with existing systems,we further apply the same preprocessing that was proposed by theexisting system, in order to ensure that our evaluation is fair.

For all datasets, we form one feature group for the featuresderived from one-hot encoding. In addition, we use the followingPLAF constraints. For Adult, we enforce that Age and Educationcan only increase, and MaritalStatus, Relationship, Gender,and NativeCountry cannot change. For Credit, we enforce thatHasHistoryOfOverduePayments can only increase, while isMaleand isMarried cannot change. We do not specify any constrainsfor Allstate and Yelp. We also do not use more complex constraintswith implications because they are not supported by all systems.Considered Systems. We benchmark GeCo against four existingsystems: (1) MACE [12] solves for counterfactuals using multipleruns of an SMT solver; (2) DiCE [17] generates counterfactual ex-planations with a variational auto-encoder; (3)WIT is our imple-mentation of the counterfactual reasoning approach in Google’sWhat-if Tool [31]. WIT look-ups the closest counterfactual thatsatisfies the PLAF constraints in the database 𝐷 . We implementedour own version, because the What-if Tool does not support feasi-bility constraints; (4) CERT is our implementation of the geneticalgorithm that is used in CERTIFAI [25] (see Sec. 4.7 for details).We reimplemented the algorithm because CERTIFAI is not public.Evaluation Metrics.We use the following three metrics to eval-uate the quality of the explanation: (1) The consistency of the ex-planations, i.e., does the classifier return the good outcome for the

Table 2: Examples of counterfactual explanations by GeCo,MACE, and DiCE for one instance in Adult. We show se-lected features, all others are not changed. MACE and DiCEuse different models, so we show GeCo’s explanation foreach model. The neural network does not use CapitalGainand CapitalLoss. (–) means the value is not changed.

Decision Tree Neural NetworkAttribute Instance GeCo MACE GeCo DiCEAge 49 – – 53 54Education School – – – DoctorateOccupation Service – – – BlueCollarCapitalGain 0.0 4,687.0 4,826.0CapitalLoss 0.0 – 19.9Hours/week 16 – – 98 40Race Other – – – WhiteGender Female – – – MaleIncome (>50K) False True True True True

counterfactual 𝒙𝑐 𝑓 ; (2) The distance between 𝒙𝑐 𝑓 and the originalinstance 𝒙 ; (3) The number of features changed in 𝒙𝑐 𝑓 .

For the comparison with existing systems, we use the β„“1 normto aggregate the distances for each feature (i.e., 𝛽 = 1, 𝛼 = 𝛾 = 0in Eq. (2)), because MACE and DiCE do not support combiningnorms. We examine other choices of these hyperparameters in themicrobenchmarks. To compare runtimes, we report the averagewall-clock time it takes to explain a single instance.

GeCo and CERT return multiple counterfactuals for each in-stance. We only consider the best counterfactual in this evaluation.Classifiers. We benchmark the systems on tree-based models andmulti-layered perceptrons (MLP).

Comparison with Existing Systems. For the comparison of withexisting systems we use the classifiers proposed byMACE and DiCEfor all systems to ensure a fair comparison. For tree-based models,we attempted to compute explanations for random forest classi-fiers, but MACE took on average 30 minutes to compute a singleexplanation, which made it infeasible to compute explanations formany instances. Thus, we consider a single decision tree for thisevaluation (computed in scikit-learn, default parameters). DiCEdoes not support decision trees, since it requires a differentiableclassifier. For the neural net, we use the classifier proposed by DiCE,which is a two-layered neural network with 20 (fully connected)hidden units and ReLU activation.

Microbenchmarks. In the micro benchmarks, we consider a ran-dom forest with 500 trees and maximum depth of 10 from the JuliaMLJ library [7], the MLPClassifier from the scikit-learn library [20]with the default settings (one hidden layer with 100 nodes).Setup.We implemented GeCo, WIT, and CERT in Julia 1.5.2. Allexperiments are run on an Intel Xeon CPU E7-4890/2.80GHz/64bitwith 108GB RAM, Linux 4.16.0, and Ubuntu 16.04.

We use the default hyperparameters for GeCo (c.f. Sec. 4.6) andMACE (πœ– = 10βˆ’3). CERT runs for 300 generations, as in the originalCERTIFAI implementation. In GeCo, we precompute the active do-main of each feature group, which is invariant for all explanations.

Page 10: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

Credit Dataset

Adult Dataset

Figure 4: Comparison of the average distance (β„“1 norm) (left), and average runtime (right) for the explanations by GeCo, WIT,CERT, and MACE for 5000 instances on the Credit and Adult datasets. Error bars represent one standard deviation.

6.2 End-to-end ExampleWe consider one specific instance in the Adult dataset that is classi-fied as β€œbad” (Income <$50K) and illustrate the differences betweenthe the explanations for each considered system.

Table 2 presents the instance 𝒙 and the counterfactuals returnedby GeCo, MACE, and DiCE. WIT and CERT fail to return an ex-planation, because the Adult dataset has no instance with income>$50K that also satisfies all PLAF constraints for 𝒙 . MACE computesthe explanation over a decision tree, and DiCE considers a neuralnetwork. We present GeCo’s explanation for each model.

For the decision tree, GeCo is able to find a counterfactual thatchanges only Capital Gains. In contrast, the explanation by MACErequires a change in Capital Gains and Capital Loss. Remarkably,the change in Capital Gains is larger than the one required byGeCo.Thus, GeCo is able to find better explanations than MACE.

The neural net used for the comparison with DiCE does not usethe features Capital Gains and Capital Loss. For this model, GeCorequires a change in Age and Hours/week to find a counterfac-tual explanation. In contrast, DiCE changes the values for all 8 (!)features, which is implausible.

In addition to the top explanation in Table 2, GeCo also returnsalternative counterfactuals that satisfy all constraints. For instance,for the neural net, GeCo also proposes (1) changing Age to 76 andHours/Week to 80, or (2) increasing the education level from aSchool to a Masters degree. These alternatives provide differentpathways for the user to achieve the good outcome. Only GeCoand CERT return multiple counterfactual explanations by default.

6.3 Quality and Runtime TradeoffIn this section, we investigate the tradeoff between the quality andthe runtime of the explanations for all considered systems on Credit

and Adult. As explained in Sec. 6.1, we evaluate the systems usinga single decision tree and a neural network.Takeaways for Evaluation with Decision Trees. Figures 4 and5 show the results of our evaluation with decision trees on Credit(top) and Adult (bottom). For each dataset, we explain a sample of5,000 instances for which the classifier returns the negative outcome.Figure 4 presents the average distance (left) and runtime (right) foreach competitor. Figure 5 presents the distribution of the numberof features changed by MACE, WIT and CERT in comparison withthe number of features changed by GeCo.

Quality comparison. GeCo and MACE are always able to find afeasible and plausible explanation. WIT and CERT, however, failto find an explanation in 2.1% of the cases for Adult (representedby the∞ bar in Figure 5). This is because the two techniques arerestricted by the database𝐷 , whichmay not contain an instance thatis classified as good and represents feasible and plausible actions.

GeCo’s explanations are on average the closest to the originalinstance. This can be explained by the fact that GeCo is able to findthese explanations by changing significantly fewer features. For theCredit dataset, for instance, GeCo can find explanations with 1.27changes on average, while MACE (the best competitor) changes onaverage 3.39 features. WIT and CERT change on average 3.97 andrespectively 3.15 features.

Like GeCo, CERT uses a genetic algorithm, but it takes up 3.Γ—longer to compute explanations that do not have the same quality asGeCo’s. Thus, GeCo’s custom genetic algorithm, which is designedto explore counterfactuals with few changes, is very effective.

Runtime comparison. GeCo is able to compute each explanationin less than the 500ms required for interactive analysis. On average,GeCo is up to 23Γ— faster than MACE. This performance is onlymatched by WIT, which performs simple looks-ups in the database𝐷 and does not return explanations with the same quality asGeCo’s.

Page 11: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

Credit Dataset

Adult Dataset

Figure 5: Distribution of the number of features changed in the counterfactual for GeCo, MACE, WIT and CERT.

Figure 6: Comparison of GeCo and DiCE for the neural net-work on 5000 instance from Adult. (left) Average distanceof the counterfactual. (right) Distribution of number of fea-tures changed. Error bars represent one standard deviation.

Table 3: Microbenchmarks results for tree-based models.

Credit Adult Allstate YelpGenerations 4.22 4.66 5.26 3.26Explored Candidates 18.1K 15.3K 62.8K 50.8KSize of Naive Rep. 62.23K 117.41K 8.03M 12.74MSize of Ξ”-Rep. 18.64K 21.88K 412K 502KCompression 3.3Γ— 5.4Γ— 19.5Γ— 25.4Γ—

Takeaways for Evaluation with Neural Net. Figure 6 presentsthe key results for our evaluation on the MLP classifier. We onlyshow the comparison of GeCo and DiCE, because the comparisonwith WIT and CERT is similar to the one for the decision tree.MACE requires an extensive conversion of the classifier into alogical formula, which is not supported for the considered model.

Since DiCE computes the counterfactual explanation in one passover a variational auto-encoder, it is able to compute the explana-tions very efficiently. In our experiments, DiCE was able to findan explanation in 5 milliseconds, whereas GeCo required 250 mil-liseconds on average. The counterfactuals that DiCE generates,however, have poor quality. Whereas GeCo is again able to findexplanations with only few required changes, DiCE changes onaverage 5.3 features. As a result, GeCo’s explanation is on average4.4Γ— closer to the original instance. Therefore, we consider GeCo’sexplanations much more suitable for real-world applications.

6.4 MicrobenchmarksIn this section, we present the results for the microbenchmarks.Breakdown of GeCo’s Components. First, we analyze the run-time of each operator presented in Sec. 4, as well as the impactof the Ξ”-representation and partial evaluation (see Sec. 5) on twotree-based models and a multi-layered perceptron (MLP) over theAllstate and Yelp datasets. For each benchmark, we run GeCo for 5generations on 1,000 instances that have been classified as bad.

Figure 7 presents the results for this benchmark. Init capturesthe time it takes to compute the feasible space, and to generateand select the fittest candidates for the init population. The timesfor selection, crossover, and mutation are accumulated over allgenerations. For each considered scenario, we first present theruntime for: (1) without the Ξ”-representation and partial evaluationenabled, (2) with each optimization individually, and (3) with both.

The results show that the selection and mutation operators arethe most time consuming operations of the genetic algorithm. Thisis not surprising since they operate on tens of thousands of candi-dates, whereas crossover combines only a few selected candidates.

Partial evaluation of the classifier decreases the runtime of theselection operator by up to 3.2Γ— (Allstate with random forest),which translates into an overall speedup of 1.7Γ—. Partial evaluationis more effective for the random forest model, because we can onlypartially evaluate the first layer of the MLP.

The Ξ”-representation decreases the runtime of the mutationoperator by 3.9Γ— for Allstate and 3.7Γ— for Yelp (with MLP). Thisspeedup is due to the compression achieved by the Ξ”-representation(see below). If the classifier is not partially evaluated, then there is atradeoff in the runtime for the selection operator, because it requiresthe materialization of the full feature vector. This materializationincreases the runtime of select by up to 2.1Γ— (Yelp, random forest).

The best performance is achieved if the Ξ”-representation andpartial evaluation of the classifier are used together. In this case,there is a significant runtime speedup for both the mutation opera-tor and selection operators. Overall, this can lead to a performanceimprovement of 5.2Γ— (Yelp with random forest).Number of Generations and Explored Candidates. Table 3shows for each dataset how many generations GeCo needed on

Page 12: GeCo: Quality Counterfactual Explanations in Real Time

Maximilian Schleich, Zixuan Geng, Yihong Zhang, and Dan Suciu

Allstate Dataset w. Random Forest Yelp Dataset w. Random Forest Yelp Dataset w. MLP

Figure 7: Breakdown of GeCo’s runtime into the main operators on Allstate and Yelp using random forest and MLP classifiers.We present the runtimes (1) without the Ξ”-representation and partial evaluation, (2) with partial evaluation, (3) with the Ξ”-representation, and (4) with both optimizations. The runtime is averaged over 1,000 instances.

average to converge, and how many candidates it explored. Themajority (up to 97%) of the candidates were generated by mutate.Compression by Ξ”-representation. Table 3 compares the sizesfor the naive listing representation and the Ξ”-representation. Wemeasure size in terms of the number of represented values, andtake the average over all generations for the size of the candidatepopulation after the mutate and crossover operations. Overall, theΞ”-representation can represent the candidate population up to 25Γ—more compactly than the naive listing representation.Effect of Distance Function. Recall our distance function (Eq. (2))and its parameters 𝛼, 𝛽,𝛾 . In Sec. 6.3, we show the explanationquality for GeCo on Credit and Adult using the parameters (𝛼 =

0, 𝛽 = 1, 𝛾 = 0). When setting (𝛼 = 0.5, 𝛽 = 0.5, 𝛾 = 0) and (𝛼 =

0.33, 𝛽 = 0.34, 𝛾 = 0.33), we observe that using the β„“0 norm (𝛼 >

0) further reduces the number of features that GeCo changes. Inthis case, GeCo always returns explanations that change only asingle feature, for both Credit and Adult. The β„“βˆž-norm has nosignificant effect our in experiments, which may be because theβ„“1-norm already restricts the maximum change.

6.5 SummaryThe experiments show that GeCo is the only system that can pro-vide high-quality explanations in real time. On the Credit and Adultdatasets, MACE is the best competitor in terms of quality, but takesorders of magnitude longer to run. The explanations for the fastestcompetitors (WIT and DiCE) have poor quality.

The Ξ”-representation and partial evaluation of the classifier cansignificantly reduce GeCo’s end-to-end runtime. For our evaluationon the real Allstate and Yelp datasets, the two optimizations improvethe end-to-end runtime by a factor of up to 5.2Γ—.

7 CONCLUSIONSWe described GeCo, the first interactive system for counterfactualexplanations that supports a complex, real-life semantics of coun-terfactuals, yet provides answers in real time. GeCo defines a richsearch space for counterfactuals, by considering both a dataset ofexample instances, and a general-purpose constraint language. Ituses a genetic algorithm to search for counterfactuals, which is cus-tomized to favor counterfactuals that require the smallest number of

changes. We described two powerful optimization techniques thatspeed up the inner loop of the genetic algorithm: Ξ”-representationand partial evaluation. We demonstrated that, among four othersystems reported in the literature, GeCo is the only one that canboth compute quality explanations and find them in real time.

At a high level, our work belongs to the more general effortof applying database and compiler optimizations to improve theperformance of machine learning tasks [1, 4, 9, 14, 23].

GeCo has two limitations that we plan to address in future work.The first is that we have implemented partial evaluation manu-ally, and it is only supported for random forests and simple neuralnetwork classifiers. We plan to extend this optimization to othermodels. Ideally, we would like to leverage work from the compilerscommunity, and use an out-of-the-box partial compiler, but a majoris that GeCo is written in Julia for which no partial compilers exists.

The second limitation is related to privacy. GeCo uses a datasetto define its sample spaces, and that may leak private informationabout the users whose data is stored in that dataset. Notice thatGeCo does not compute aggregates on the dataset, but uses valuesthat exists in the data; hence techniques from differential privacydo not apply out of the box. A technical solution to this problemwould be to generate a different instance, using a differentiallyprivate (and, hence, randomized) algorithm, then use that to com-pute the sample space; inevitably, this will lead to a degradationin the quality of the explanations. The solution that we favor isregulatory: certain combinations of attributes are not consideredto be pseudo-identifiers, and, hence, their values are consideredpublic. Explanation systems like GeCo are likely to be deployed inhighly regulated settings that legislate what explanations the usersare entitled to, and also likely to legislate what combinations ofattributes are private.

ACKNOWLEDGMENTSThis work was partially supported by NSF IIS 1907997, NSF IIS1954222, and a generous gift from RelationalAI.

Page 13: GeCo: Quality Counterfactual Explanations in Real Time

GeCo: Quality Counterfactual Explanations in Real Time

REFERENCES[1] Mahmoud Abo Khamis, Hung Q. Ngo, XuanLong Nguyen, Dan Olteanu, and

Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. In PODS.325–340.

[2] Allstate. 2011. Allstate Claim Prediction Challenge. https://www.kaggle.com/c/ClaimPredictionChallenge

[3] Sercan Arik and Tomas Pfister. 2019. Tabnet: Attentive interpretable tabularlearning. arXiv preprint arXiv:1908.07442 (2019).

[4] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, andMatei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In SIGMOD.Association for Computing Machinery, New York, NY, USA, 1383–1394.

[5] Nima Asadi, Jimmy Lin, and Arjen P De Vries. 2013. Runtime optimizations fortree-based machine learning models. IEEE transactions on Knowledge and DataEngineering 26, 9 (2013), 2281–2292.

[6] Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, YunhanJia, Joydeep Ghosh, Ruchir Puri, JosΓ© MF Moura, and Peter Eckersley. 2020.Explainable machine learning in deployment. In FAT*. 648–657.

[7] Anthony D. Blaom, Franz Kiraly, Thibaut Lienart, Yiannis Simillides, DiegoArenas, and Sebastian J. Vollmer. 2020. MLJ: A Julia package for composablemachine learning. arXiv:2007.12285 [cs.LG]

[8] Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml

[9] Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jer-maine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS:Or, Why You Should Use a Database for Distributed Machine Learning. PVLDB12, 7 (March 2019), 822–835.

[10] Neil D Jones, Carsten K Gomard, and Peter Sestoft. 1993. Partial evaluation andautomatic program generation. Peter Sestoft.

[11] Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. 2012.Enterprise Data Analysis and Visualization: An Interview Study. IEEE Trans. Vis.Comput. Graph. 18, 12 (2012), 2917–2926.

[12] Amir-Hossein Karimi, Gilles Barthe, Borja Balle, and Isabel Valera. 2020. Model-agnostic counterfactual explanations for consequential decisions. In AISTATS.895–905.

[13] Ron Kohavi. 1996. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid.. In KDD, Vol. 96. 202–207.

[14] Arun Kumar, Jeffrey Naughton, and Jignesh M Patel. 2015. Learning generalizedlinear models over normalized data. In SIGMOD. 1969–1984.

[15] Zhicheng Liu and Jeffrey Heer. 2014. The effects of interactive latency on ex-ploratory visual analysis. IEEE transactions on visualization and computer graphics20, 12 (2014), 2122–2131.

[16] Claudio Lucchese, Franco Maria Nardini, Salvatore Orlando, Raffaele Perego,Nicola Tonellotto, and Rossano Venturini. 2015. Quickscorer: A fast algorithm torank documents with additive ensembles of regression trees. In SIGIR. 73–82.

[17] Divyat Mahajan, Chenhao Tan, and Amit Sharma. 2019. Preserving CausalConstraints in Counterfactual Explanations for Machine Learning Classifiers. In

CausalML @ NeurIPS.[18] Ramaravind K Mothilal, Amit Sharma, and Chenhao Tan. 2020. Explaining

machine learning classifiers through diverse counterfactual explanations. InFAT*. 607–617.

[19] Dan Olteanu and Maximilian Schleich. 2016. Factorized Databases. SIGMOD Rec.45, 2 (2016), 5–16.

[20] Fabian Pedregosa, GaΓ«l Varoquaux, Alexandre Gramfort, and et al. 2011. Scikit-learn: Machine Learning in Python. J. Machine Learning Research 12 (2011),2825–2830.

[21] Marco TΓΊlio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why Should ITrust You?": Explaining the Predictions of Any Classifier. In SIGKDD. 1135–1144.

[22] Cynthia Rudin. 2019. Stop explaining black box machine learning models forhigh stakes decisions and use interpretable models instead. Nature MachineIntelligence 1, 5 (2019), 206–215.

[23] Maximilian Schleich, Dan Olteanu, Mahmoud Abo Khamis, Hung Q. Ngo, andXuanLong Nguyen. 2019. A Layered Aggregate Engine for Analytics Workloads.In SIGMOD. 1642–1659.

[24] Amir Shaikhha, Yannis Klonatos, and Christoph Koch. 2018. Building EfficientQuery Engines in a High-Level Language. ACM Trans. Database Syst. 43, 1, Article4 (2018), 45 pages.

[25] Shubham Sharma, Jette Henderson, and Joydeep Ghosh. 2020. CERTIFAI: ACommon Framework to Provide Explanations and Analyse the Fairness andRobustness of Black-box Models. In Proceedings of the AAAI/ACM Conference onAI, Ethics, and Society. 166–172.

[26] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning impor-tant features through propagating activation differences. In ICML. 3145–3153.

[27] Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju.2020. Fooling lime and shap: Adversarial attacks on post hoc explanationmethods.In AIES. 180–186.

[28] Ruby Y. Tahboub, GrΓ©gory M. Essertel, and Tiark Rompf. 2018. How to Architecta Query Compiler, Revisited. In SIGMOD. 307–322.

[29] Berk Ustun, Alexander Spangher, and Yang Liu. 2019. Actionable recourse inlinear classification. In FAT*. 10–19.

[30] Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2017. Counterfactualexplanations without opening the black box: Automated decisions and the GDPR.Harv. JL & Tech. 31 (2017), 841.

[31] JamesWexler, Mahima Pushkarna, Tolga Bolukbasi, MartinWattenberg, FernandaViΓ©gas, and Jimbo Wilson. 2019. The what-if tool: Interactive probing of machinelearning models. IEEE transactions on visualization and computer graphics 26, 1(2019), 56–65.

[32] Ting Ye, Hucheng Zhou, Will Y Zou, Bin Gao, and Ruofei Zhang. 2018. Rapid-scorer: fast tree ensemble evaluation by maximizing compactness in data levelparallelization. In SIGKDD. 941–950.

[33] I-Cheng Yeh and Che-hui Lien. 2009. The comparisons of data mining techniquesfor the predictive accuracy of probability of default of credit card clients. ExpertSystems with Applications 36, 2 (2009), 2473–2480.

[34] Yelp. 2017. Yelp Dataset Challenge. https://www.yelp.com/dataset/challenge/