explaining neural matrix factorization with gradient rollback

Explaining Neural Matrix Factorization with Gradient Rollback

Carolin Lawrence, Timo Sztyler, Mathias Niepert1

1NEC Laboratories EuropeKurfursten-Anlage 36D-69115 Heidelberg

carolin.lawrence, timo.sztyler, [email protected]

Abstract

Explaining the predictions of neural black-box models is animportant problem, especially when such models are used inapplications where user trust is crucial. Estimating the influ-ence of training examples on a learned neural model’s behaviorallows us to identify training examples most responsible for agiven prediction and, therefore, to faithfully explain the outputof a black-box model. The most generally applicable existingmethod is based on influence functions, which scale poorlyfor larger sample sizes and models.We propose gradient rollback, a general approach for influenceestimation, applicable to neural models where each parameterupdate step during gradient descent touches a smaller num-ber of parameters, even if the overall number of parameters islarge. Neural matrix factorization models trained with gradientdescent are part of this model class. These models are popularand have found a wide range of applications in industry. Espe-cially knowledge graph embedding methods, which belong tothis class, are used extensively. We show that gradient rollbackis highly efficient at both training and test time. Moreover,we show theoretically that the difference between gradientrollback’s influence approximation and the true influence on amodel’s behavior is smaller than known bounds on the stabilityof stochastic gradient descent. This establishes that gradientrollback is robustly estimating example influence. We alsoconduct experiments which show that gradient rollback pro-vides faithful explanations for knowledge base completion andrecommender datasets. An implementation is available.1

IntroductionEstimating the influence a training sample (or a set of trainingsamples) has on the behavior of a machine learning modelis a problem with several useful applications. First, it canbe used to interpret the behavior of the model by providingan explanation for its output in form of a set of trainingsamples that impacted the output the most. In addition toproviding a better understanding of model behavior, influenceestimation has also been used to find adversarial examples,to uncover domain mismatch, and to determine incorrect ormislabeled examples (Koh and Liang 2017). Finally, it canalso be used to estimate the uncertainty for a particular output

Copyright © 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1https://github.com/carolinlawrence/gradient-rollback

by exploring the stability of the output probability before andafter removing a small number of influential training samples.

We propose gradient rollback (GR), a novel approachfor influence estimation. GR is applicable to neural modelstrained with gradient descent and is highly efficient espe-cially when the number of parameters that are significantlychanged during any update step is moderately sized. This isthe case for neural matrix factorization methods. Here, wefocus on neural link prediction models for multi-relationalgraphs (also known as knowledge base embedding models),as these models subsume several other matrix factorizationmodels (Guo et al. 2020). They have found a wide range ofindustry applications such as in recommender and questionanswering systems. They are, however, black-box modelswhose predictions are not inherently interpretable. Othermethods, such as rule-based methods, might be more in-terpretable, but typically have worse performance. Hence,neural matrix factorization methods would greatly benefitfrom being more interpretable (Bianchi et al. 2020). For anillustration of matrix factorization and GR see Figure 1.

We explore two crucial questions regarding the utility ofGR. First, we show that GR is highly efficient at both trainingand test time, even for large datasets and models. Second, weshow that its influence approximation error is smaller thanknown bounds on the stability of stochastic gradient descentfor non-convex problems (Hardt, Recht, and Singer 2016).The stability of an ML model is defined as the maximumchange of its output one can expect on any sample whenretrained on a slightly different set of training samples. Therelationships between uniform stability and generalization ofa learning system is a seminal result (Bousquet and Elisseeff2002). Here, we establish a close connection between thestability of a learning system and the challenge of estimat-ing training sample influence and, therefore, explaining themodel’s behavior. Intuitively, the more stable a model, themore likely is it that we can estimate the influence of trainingsamples well. We show theoretically that the difference be-tween GR’s influence approximation and the true influenceon the model behavior is (strictly) smaller than known boundson the stability of stochastic gradient descent.

We perform experiments on standard matrix factorizationdatasets including those for knowledge base completion andrecommender systems. Concretely, GR can explain a predic-tion of a learnt model by producing a ranked list of train-

arX

iv:2

010.

0551

6v4

[cs

.LG

] 1

5 D

ec 2

020

https://github.com/carolinlawrence/gradient-rollback

ing examples, where each instance of the list contributed tochanging the likelihood of the prediction and the list is sortedfrom highest to lowest impact. To produce this list, we (1)estimate the influence of training examples during trainingand (2) use this estimation to determine the contribution eachtraining example made to a particular prediction. To evaluatewhether GR selected training examples relevant for a par-ticular prediction, we remove the set of training examples,retrain the model from scratch and check if the likelihood forthe particular prediction decreased. Compared to baselines,GR can identify subsets of training instances that are highlyinfluential to the model’s behavior. These can be representedas a graph to explain a prediction to a user.

Neural Matrix FactorizationWe focus on neural matrix factorization models for link pre-diction in multi-relational graphs (also known as knowledgegraph embedding models) for two reasons. First, these modelsare popular with a growing body of literature. There has beenincreasing interest in methods for explaining and debuggingknowledge graph embedding methods (Bianchi et al. 2020).Second, matrix factorization methods for recommender sys-tems can be seen as instances of neural matrix factorizationwhere one uses pairs of entities instead of (entity, relationtype, entity)-triples (Guo et al. 2020).

We consider the following general setting of representationlearning for multi-relational graphs2. There is a set of entitiesE and a set of relation typesR. A knowledge base K consistsof a set of triples d = (s, r, o) where s ∈ E is the subject (orhead entity), r ∈ R the relation type, and o ∈ E the object (ortail entitiy) of the triple d. For each entity and relation type welearn a vector representation with hidden dimension h.3 Foreach entity e and relation type r we write e and r, respectively,for their vector representations. Hence, the set of parametersof a knowledge base embedding model consists of one matrixE ∈ R|E|×h and one matrix R ∈ R|R|×h. The rows ofmatrix E are the entity vector representations (embeddings)and the rows of R are the relation type vector representations(embeddings). To refer to the set of all parameters we oftenuse the termw ∈ Ω to improve readability, with Ω denotingthe parameter space. Hence, e = w[e], that is, the embeddingof entity e is the part ofw pertaining to entity e. Analogouslyfor the relation type embeddings. Moreover, for every tripled = (s, r, o), we writew[d] = (w[s],w[r],w[o]) = (s, r,o)to denote the triple of parameter vectors associated with d forw ∈ Ω. We extend element-wise addition and multiplicationto triples of vectors in the obvious way.

We can now define a scoring function φ(w; d) which,given the current set of parameters w, maps each triple dto a real-valued score. Note that, for a given triple d, the setof parameters involved in computing the value of φ(w; d)are a small subset of w. Usually, φ is a function of the vec-tor representations of entities and relation types of the triple

2Our results with respect to 3-dimensional matrices apply tok-dimensional matrices with k ≥ 2.

3Our results generalize to methods where relation types arerepresented with more than one vector such as in RESCAL (Nickel,Tresp, and Kriegel 2011) or COMPLEX (Trouillon et al. 2016).

d = (s, r, o), and we write φ(s, r,o) to make this explicit.Example 1. Given a triple d = (s, r, o). The scoring func-tion of DISTMULT is defined as φ(w; d) = 〈s, r,o〉, that is,the inner product of the three embeddings.

To train a neural matrix factorization model means run-ning an iterative learning algorithm to find a set of paramatervaluesw that minimize a loss function L(w;D), which com-bines the scores of a set of training triples D into a loss-value.We consider general iterative update rules such as stochasticgradient descent (SGD) of the form G : Ω→ Ω which mapa point w in parameter space Ω to another point G(w) ∈ Ω.A typical loss function is L(w;D) =

∑d∈D `(w; d) with

`(w; d = (s, r, o)) = − log Pr(w; o | s, r) and (1)

Pr(w; o | s, r) =exp(φ(w; (s, r, o)))∑o′ exp (φ(w; (s, r, o′)))

. (2)

The set of o′ in the denominator is mostly a set of randomlysampled entities (negative sampling) but recent results haveshown that computing the softmax over all entities can beadvantageous. Finally, once a model is trained, Equation 2can be used to determine for a test query (s, r, ?) how likelyeach entity o is by summing over all other entities o′.

Gradient Rollback for Matrix FactorizationA natural way of explaining the model’s output for a testtriple d is to find the training triple d′ such that retraining themodel without d′ changes (decreases or increases) f(w′; d)the most. The function f can be the loss function. Most often,however, it is the scoring function as a proxy of the loss.

dexpl = arg maxd′∈D

[f(w; d)− f(w′; d)] or

dexpl = arg mind′∈D

[f(w; d)− f(w′; d)] ,(3)

wherew′ are the parameters after retraining the model withD−d′. This can directly be extended to obtaining the top-kmost influential triples. While this is an intuitive approach,solving the optimization problems above is too expensive inpractice as it requires retraining the model |D| = n times foreach triple one seeks to explain.4

Instead of retraining the model, we propose gradient roll-back (GR), an approach for tracking the influence each train-ing triple has on a model’s parameters during training. Beforetraining and for each triple d′ and the parameters it can in-fluence w(d′), we initialize its influence values with zero:∀d′,γ[d′,w(d′)] = 0. We now record the changes in parametervalues during training that each triple causes. With stochasticgradient descent (SGD), for instance, we iterate through thetraining triples d′ ∈ D and record the changes it makes attime t, by setting its influence value to

γ[d′,wt+1(d′)] ← γ[d′,wt(d′)] − α∇f(wt; d′).5

4For example, FB15K-237 contains 270k training triples andtraining one model with TF2 on a RTX 2080 Ti GPU takes about15 minutes. To explain one triple by retraining |D| = 270k timeswould take over 2 months.

5For any iterative optimizer such as SGD and Adam, the param-eter updates are readily available and can be obtained efficientlyduring training.

SeattleUSA

Space Needle

Volunteer Park William H. Seward

locatedInloca

tedI

n hasSculptureOf bornIn

locatedIn

Seattle USAlocatedIn

f ( ) ER

Volunteer Park USAlocatedIn

William H. Seward

hasS

culp

ture

Of

bornIn

loca

tedI

n

Seattle

Figure 1: An illustration of the problem and the proposed method. A knowledge base of facts is represented as a sparse3-dimensional matrix. Neural matrix factorization methods perform an implicit matrix decomposition by minimizing a lossfunction f operating on the entity and relation representations. Gradient rollback tracks the parameter changes γ caused by eachsample during training. At test time, the aggregated updates are used to estimate triple influence without the need to retrain themodel. Influence estimates are used to provide triples (blue and red) explaining the model’s behavior for a test triple (dashed).

After training, γ can be utilised as a look up table: for a tripled that we want to explain, we look up the influence each tripled′ had on the weightsw(d) of d via γ[d′,w(d)] = (γs,γr,γo).Concretely, a triple d has an influence on d′ at any point wherethe two triples have a common element wrt. their subject,relation or object. Consequently, explanations can be anytriple d′ that has either an entity or the relation in commonwith d.

Let w′ be the parameters of the model when trained onD − d′. At prediction time, we now approximate, forevery test triple d, the difference f(w; d) − f(w′; d) withf(w; d) − f(w − γ[d′,w(d)]; d), that is, by rolling back theinfluence triple d′ had on the parameters of triple d duringtraining. We then simply choose the triple d′ that maximizesthis difference.

There are two crucial questions determining the utility ofGR, which we address in the following:

1. The resource overhead of GR during training and at testtime; and

2. How closely f(w − γ[d′,w(d)]; d) approximates f(w′; d).

Resource Overhead of Gradient Rollback

To address the first question, let us consider the computationaland memory overhead for maintaining γ[d,wt(d)] during train-ing for each d. The loss functions’s normalization term is(in expectation) constant for all triples and can be ignored.We verified this empirically and it is indeed a reasonableassumption. Modern deep learning frameworks compute thegradients and make them accessible in each update step. Up-dating γ[d,wt(d)] with the given gradients takes O(h), that is,it is possible in constant time. Hence, computationally, theoverhead is minimal. Now, to store γ[d,wt(d)] for every d ∈ Dwe need h|D|+2h|D| floats. Hence, the memory overhead ofgradient rollback is 3h|D|. This is about the same size as theparameters of the link prediction model itself. In a nutshell:

Gradient rollback has a O(h|D|) computational and3h|D| memory overhead during training.

At test time, we want to understand the computational com-plexity of computing

dexpl = arg maxd′∈D

f(w; d)− f(w − γ[d′,w(d)]; d) or

dexpl = arg mind′∈D

f(w; d)− f(w − γ[d′,w(d)]; d)(4)

We have γ[d′,w(d)] 6= 0 only if d and d′ have at least oneentity or relation type in common. Hence, to explain a tripled we only have to consider triples d′ adjacent to d in theknowledge graph, that is, triples where either of the argu-ments overlap. Let adjmax and adjavg be the maximum andaverage number of adjacent triples in the knowledge graph.Then, we have the following:

Gradient rollback requires at most adjmax +1 and onaverage adjavg +1 computations of the function f toexplain a test triple d.

Approximation Error of Gradient RollbackTo address the second question above, we need, for every pairof triples d, d′, to bound the expression

E |f(w − γ(d′,w(d)); d)− f(w′; d)|,

where w are the parameter values resulting from training fon all triples D and w′ are the parameter values resultingfrom training f on D − d′. If the above expression canbe bounded and said bound is lower than what one wouldexpect due to randomness of the iterative learning dynamics,the proposed gradient rollback approach would be highlyuseful. We use the standard notion of stability of learningalgorithms (Hardt, Recht, and Singer 2016). In a nutshell, ourtheoretical results will establish that:

Gradient rollback can approximate, for any d′ ∈ D, thechanges of a scoring/loss function one would observeif a model were to be retrained on D − d′. The ap-proximation error is in expectation lower than knownbounds on the stability of stochastic gradient descent.

Definition 1. A function f is L-Lipschitz if for all u, v inthe domain of f we have ‖∇f(u)‖ ≤ L. This implies that|f(u)− f(v)| ≤ L ‖u− v‖.

We analyze the output of stochastic gradient descent ontwo data sets, D and D − d′, that differ in precisely onetriple. If f is L-Lipschitz for every example d, we have

E |f(w; d)− f(w′; d)| ≤ LE ‖w −w′‖

for allw andw′. A vector norm in this paper is always the 2-norm. Hence, we have E |f(w−γ(d′,w(d)); d)−f(w′; d)| ≤LE

∥∥w − γ(d′,w(d)) −w′∥∥ and we can assess the approxi-

mation error by tracking the extent to which the parametervalues w and w′ from two coupled iterative learning dy-namics diverge over time. Before we proceed, however, letus formally define some concepts and show that they applyto typical scoring functions of knowledge base embeddingmethods.

Definition 2. A function f : Ω → R is β-smooth if forall u, v in the domain of f we have ‖∇f(u)−∇f(v)‖ ≤β ‖u− v‖ .

To prove Lipschitz and β-smoothness properties of a func-tion f(w; d) for all w ∈ Ω and d ∈ D, we henceforthassume that the norm of the entity and relation type em-beddings is bounded by a constant C > 0. That is, we as-sume maxr∈R ‖w[r]‖ ≤ C and maxe∈E ‖w[e]‖ ≤ C forall w ∈ Ω. This is a reasonable assumption for two reasons.First, several regularization techniques constrain the norm ofembedding vectors. For instance, the unit norm constraint,which was used in the original DISTMULT paper (Yang et al.2015), enforces that C = 1. Second, even in the absence ofsuch constraints we can assume a bound on the embeddingvectors’ norms as we are interested in the approximation er-ror for and the stability of a given model that was obtainedfrom running SGD a finite number of steps using the sameparameter initializations. When running SGD, for D and allD − d′, a finite number of steps with the same initializa-tion, the encountered parameters w and w′ in each step ofeach run of SGD form a finite set. Hence, for our purposes,we can assume that f ’s domain Ω is compact. Given thisassumption, we show that the inner product, which is used inseveral scoring functions, is L-Lipschitz and β-smooth on Ω.The proofs of all lemmas and theorems can be found in theappendix.

Lemma 2. Let φ be the scoring function of DISTMULTdefined as φ(w; d = (s, r, o)) = 〈s, r,o〉 with w(d) =(s, r,o), and let C be the bound on the norm of the embed-ding vectors for all w ∈ Ω. For a given triple d = (s, r, o)and all w,w′ ∈ Ω, we have that

|φ(w; d)− φ(w′; d)| ≤ 2C2 ‖w −w′‖ .


‖∇φ(w; d)−∇φ(w′, d)‖ ≤ 4C ‖w −w′‖ .

Considering typical KG embedding loss functions and thesoftmax and sigmoid function being 1-Lipschitz, this impliesthat the following theoretical analysis of gradient rollback

applies to a large class of neural matrix factorization models.Let us first define an additional property iterative learningmethods can exhibit.

Definition 3. An update rule G : Ω→ Ω is η-expansive if

supu,v∈Ω

‖G(u)−G(v)‖‖u− v‖

≤ η.

Consider the gradient updates G1, ..., GT and G′1, ..., G′T

induced by running stochastic gradient descent on D andD − d′, respectively. Every gradient update changes theparameters w and w′ of the two coupled models. Due tothe difference in size, there is one gradient update Gi whosecorresponding update G′i does not change the parameters w′.Again, note that there is always a finite set of parameters wand w′ encountered during training.

We can derive a stability bound for stochastic gradientdescent run on the two sets, D and D − d′. The followingtheorem and its proof are an adaptation of Theorem 3.21 inHardt, Recht, and Singer (2016) and the corresponding proof.There are two changes compared to the original theorem.First, f is not assumed to be a loss function but any functionthat is L-Lipschitz and β-smooth. For instance, it could bethat the loss function of the model is defined as g(f(·)) forsome Lipschitz and smooth function g. The proof of theoriginal theorem does not rely on f being a loss function andit only requires that f(·; d) ∈ [0, 1] is an L-Lipschitz andβ-smooth function for all inputs. Second, the original proofassumed two training datasets of the same size that differ inone of the samples. We assume two sets, D and D − d′,where D − d′ has exactly one sample less than D. Theproof of the theorem is adapted to this setting.

Theorem 1. Let f(·; d) ∈ [0, 1] be an L-Lipschitz and β-smooth function for every possible triple d and let c be theinitial learning rate. Suppose we run SGD for T steps withmonotonically non-increasing step sizes αt ≤ c/t on twodifferent sets of triples D and D − d′. Then, for any d,

E |f(wT ; d)− f(w′T ; d)| ≤ 1 + 1/βc

n− 1(cL2)

1βc+1T

βcβc+1 ,

with wT and w′T the parameters of the two models afterrunning SGD. We name the right term in the above inequalityΛstab-nc.

The assumption f(·; d) ∈ [0, 1] of the theorem is fulfilled ifwe use the loss from Equation 1, as long as we are interestedin the stability of the probability distribution of Equation 2.That is because the cross-entropy loss applied to a softmaxdistribution is 1-Lipschitz and 1-smooth where the derivativeis taken with respect to the logits. Hence, the L-Lipschitz andβ-smoothness assumption holds also for the probability dis-tribution of Equation 2. In practice, estimating the influenceon the probability distribution from Equation 2 is what one isinterested in and what we evaluate in our experiments.

The following lemma generalizes the known (1 + αβ)-expansiveness property of the gradient update rule (Hardt,Recht, and Singer 2016). It establishes that the increase ofthe distance betweenw−γ andw′ after one step of gradientdescent is at most as much as the increase in distance between

two parameter vectors corresponding toD andD−d′ afterone step of gradient descent.Lemma 4. Let f : Ω → R be a function and let G(w) =w − α∇f(w) be the gradient update rule with step sizeα. Moreover, assume that f is β-smooth. Then, for everyw,w′,γ ∈ Ω we have

‖G(w)− γ −G(w′)‖ ≤ ‖w − γ −w′‖+ αβ ‖w −w′‖ .

Lemma 5. Let f(·; d) be L-Lipschitz and β-smooth func-tion. Suppose we run SGD for T steps on two setsof triples D and D − d′ for any d′ ∈ D andwith learning rate αt at time step t. Moreover, let∆t = E

[‖wt −w′t‖ |

∥∥wt0 −w′t0∥∥ = 0]

and ∆t =

E[∥∥wt − γ[d′,wt(d)] −w′t

∥∥ | ∥∥wt0 −w′t0∥∥ = 0]

for somet0 ∈ 1, ..., n . Then, for all t ≥ t0,

∆t+1 <

(1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL) .

The lemma has broader implications since it can be ex-tended to other update rules G as long as they fulfill an η-expansive property and their individual updates are bounded.For each of these update rules the above lemma holds.

The following theorem establishes that the approximationerror of gradient rollback is smaller than a known stabilitybound of SGD. It uses Lemma 5 and the proof of Theorem 1.Theorem 2. Let f(·; d) ∈ [0, 1] be an L-Lipschitz and β-smooth function. Suppose we run SGD for T steps withmonotonically non-increasing step sizes αt ≤ c/t on twosets of triples D and D−d′. LetwT andw′T , respectively,be the resulting parameters. Then, for any triple d that hasat least one element in common with d′ we have,

E |f(wT − γ[d′,wT (d)]; d)− f(w′T ; d)| < Λstab-nc.

The previous results establish a connection between esti-mating the influence of training triples on the model’s behav-ior using GR and the stability of SGD when used to train themodel. An interesting implication is that regularization ap-proaches that improve the stability (by reducing the Lipschitzconstant and/or the expansiveness properties of the learningdynamics, cf. (Hardt, Recht, and Singer 2016)) also reducethe error bound of GR. We can indeed verify this empirically.

Related WorkThe first neural link prediction method for multi-relationalgraphs performing an implicit matrix factorization isRESCAL (Nickel, Tresp, and Kriegel 2011). Numerous scor-ing functions have since been proposed. Popular examples areTRANSE (Bordes et al. 2013), DISTMULT (Yang et al. 2015),and COMPLEX (Trouillon et al. 2016). Knowledge graphembedding methods have been mainly evaluated throughtheir accuracy on link prediction tasks. A number of papershas recently shown that with appropriate hyperparametertuning, COMPLEX, DISTMULT, and RESCAL are highlycompetitive scoring functions, often achieving state-of-the-art results (Kadlec, Bajgar, and Kleindienst 2017; Ruffinelli,Broscheit, and Gemulla 2020; Jain et al. 2020). There are anumber of proposals for combining rule-based and matrix

factorization methods (Rocktaschel, Singh, and Riedel 2015;Guo et al. 2016; Minervini et al. 2017), which can makelink predictions more interpretable. In contrast, we aim togenerate faithful explanations for non-symbolic knowledgegraph embedding methods.

There has been an increasing interest in understandingmodel behavior through adversarial attacks (Biggio, Fumera,and Roli 2014; Papernot et al. 2016; Dong et al. 2017;Ebrahimi et al. 2018). Most of these approaches are aimed atvisual data. There are, however, several approaches that con-sider adversarial attacks on graphs (Dai et al. 2018; Zugner,Akbarnejad, and Gunnemann 2018). While analyzing attackscan improve model interpretability, the authors focused onneural networks for single-relational graphs and the task ofnode classification. For a comprehensive discussion of adver-sarial attacks on graphs we refer the reader to a recent survey(Chen et al. 2020). There is prior work on adversarial samplesfor KGs but with the aim to improve accuracy and not modelinterpretability (Minervini et al. 2017; Cai and Wang 2018).

There are two recent papers that directly address the prob-lem of explaining graph-based ML methods. First, GNNEX-PLAINER (Ying et al. 2019) is a method for explaining thepredictions of graph neural networks and, specifically, graphconvolutional networks for node classification. Second, thework most related to ours proposes CRIAGE which aims atestimating the influence of triples in KG embedding meth-ods (Pezeshkpour, Tian, and Singh 2019). Given a triple,their method only considers a neighborhood to be the set oftriples with the same object. Moreover, they derive a first-order approximation of the influence in line with work oninfluence functions (Koh and Liang 2017). In contrast, GRtracks the changes made to the parameters during trainingand uses the aggregated contributions to estimate influence.In addition, we establish a theoretical connection to the sta-bility of learning systems. Influence functions, a conceptfrom robust statistics, were applied to black-box models forassessing the changes in the loss caused by changes in thetraining data (Koh and Liang 2017). The paper also proposedseveral strategies to make influence functions more efficient.It was shown in prior work (Pezeshkpour, Tian, and Singh2019), however, that influence functions are not usable fortypical knowledge base embedding methods as they scalevery poorly. Consequently our paper is the first to offer anefficient and theoretically founded method of tracking influ-ence in matrix factorization models for explaining predictionvia providing the most influential training instances.

ExperimentsIdentifying explanations. We analyze the extent to whichGR can approximate the true influence of a triple (or set oftriples). For a given test triple d = (s, r, o) and a trainedmodel with parameters w, we use GR to identify a set oftraining triples S ⊆ D that has the highest influence on d.To this end, we first identify the set of triples N adjacent tod (that is, triples that contain at least one of s, r or o) andcompute ∆(d′, d) = Pr(w; o | s, r)−Pr(w−γ[d′,w(d)]; o |s, r) for each d′ ∈ N . We then let S be the set resulting frompicking (a) exactly k triples d′ ∈ N with the k largest values

PD% TC%NATIONS FB15K-237 MOVIELENS NATIONS FB15K-237 MOVIELENS

1 10 ALL 1 10 ALL 1 10 ALL 1 10 ALL 1 10 ALL 1 10 ALLNH 54 66 82 59 67 83 53 52 92 NH 18 36 70 35 45 59 3 14 72GR 93 97 100 77 82 96 68 82 100 GR 38 83 97 38 58 85 20 38 100∆ ↑ 39 31 18 18 15 13 15 30 8 ∆ ↑ 20 47 27 3 13 26 17 24 28NH 59 68 80 72 83 91 53 61 91 NH 13 47 76 69 70 79 5 13 77GR 90 95 100 80 88 99 73 71 100 GR 38 85 99 65 81 91 29 48 100∆ ↑ 31 27 20 8 5 8 20 10 9 ∆ ↑ 25 38 23 -4 11 12 24 35 23

Table 1: Results (DISTMULT at the top, COMPLEX at the bottom) of removing a set of training triples (of size 1, 10, or ALL),randomly chosen from triples adjacent to the test triples (NH) or by using gradient rollback (GR). For ALL we delete onaverage (± standard deviation), NATIONS: 261±56, FB15K-237: 2.9k±2.3k and MOVIELENS: 16.7k±5k (DISTMULT). Theaverage number of adjacent triples (± standard deviation) is, NATIONS: 508±101, FB15K-237: 5.5k±4.5k and MOVIELENS:23.7k±7.8k. GR removes sets that lead to a larger change in probability and top-1 predictions (difference to NH is given in row∆ ↑).

for ∆(d′, d) or (b) all triples with a positive ∆(d′, d). Werefer to the former as GR-k and the latter as GR-ALL.

To evaluate if the set of chosen triples S are faithful ex-planations (Jacovi and Goldberg 2020) for the prediction,we follow the evaluation paradigm “RemOve And Retrain(ROAR)” of Hooker et al. (2019): We let D′ = D − S andretrain the model from scratch with training set D′ leadingto a new model with parameters w′. After retraining, wecan now observe Pr(w′; o | s, r), which is the true proba-bility for d when removing the explanation set S from thetraining set, and we can use this to evaluate GR.6 Since itis expensive to retrain a model for each test triple, we re-strict the analysis to explaining only the triple (s, r, o) witho = arg maxo Pr(w; o | s, r) for each test query (s, r, ?),that is, the triple with the highest score according to themodel.

Evaluation metrics. We use two different metrics toevaluate GR. First, if the set S contains triples influen-tial to the test triple d, then the probability of d under thenew model (trained without S) should be smaller, that is,Pr(w′; o | s, r) < Pr(w; o | s, r). We measure the abil-ity of GR to identify triples causing a Probability Dropand name this measure PD%. A PD of 100% would im-ply that each set S created with GR always caused a dropin probability for d after retraining with D′. In contrast,when removing random training triples, we would expecta PD% close to 50%. An additional way to evaluate GRis to measure whether the removal of S causes the newlytrained model to predict a different top-1 triple, that is,arg maxo Pr(w; o | s, r) 6= arg maxo Pr(w′; o | s, r). Thismeasures the ability of GR to select triples causing the Top-1prediction to Change and we name this TC%. If the removalof S causes a top-1 change, it suggests that the trainingsamples most influential to the prediction of triple d havebeen removed. This also explores the ability of GR to iden-tify triples for effective removal attacks on the model. Wecompare GR to two baselines: NH-k removes exactly k ran-dom triples adjacent to d and NH-ALL randomly removes

6We fix all random seeds and use the same set of negative sam-ples during (re-)training to avoid additional randomization effects.

the same number of triples as GR-ALL adjacent to d. In anadditional experiment, we also directly compare GR withCRIAGE (Pezeshkpour, Tian, and Singh 2019).

Datasets & Training. We use DISTMULT (Yang et al.2015) and COMPLEX (Trouillon et al. 2016) as scoring func-tions since they are popular and competitive (Kadlec, Bajgar,and Kleindienst 2017). We report results on three datasets:two knowledge base completion (NATIONS (Kok and Domin-gos 2007), FB15K-237 (Toutanova et al. 2015)) and onerecommendation dataset (MOVIELENS (Harper and Konstan2015)). MOVIELENS contains triples of 5-star ratings usershave given to movies; as in prior work the set of entities isthe union of movies and users and the set of relations arethe ratings (Pezeshkpour, Chen, and Singh 2018). When pre-dicting movies for a user, we simply filter out other users.Statistics and hyperparameter settings are in the appendix.Since retraining is costly, we only explain the top-1 predic-tion for a set of 100 random test triples for both FB15K-237and MOVIELENS. For NATIONS we use the entire test set.

We want to emphasize that we always retrain completelyfrom scratch and use hyperparameter values typical for state-of-the-art KG completion models, leading to standard re-sults for DISTMULT (see Table 4 in the Appendix). Priorwork either retrained from pretrained models (Koh and Liang2017; Pezeshkpour, Tian, and Singh 2019) or used non-standard strategies and hyperparameters to ensure modelconvergence (Pezeshkpour, Tian, and Singh 2019).

Results. The top half of Table 1 lists the results usingDISTMULT for GR and NH. For NATIONS and DISTMULT,removing 1 triple from the set of adjacent triples (NH-1) isclose to random, causing a drop in probability (PD%) abouthalf of the time. In contrast, removing the 1 training instanceconsidered most influential by GR (GR-1), gives a PD of over90%. GR-1 leads to a top-1 change (TC) in about 40% of thecases. This suggests that there is more than one influentialtriple per test triple. When removing all triples identified byGR (GR-ALL) we observe PD and TC values of nearly 100%.In contrast, removing the same number of adjacent triplesrandomly leads to a significantly lower PC and TC. In fact,removing the highest ranked triple (GR-1) impacts the model

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.81, L=6.93

NationsTest Instance (143)

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.94, L=28.24

FB15k-237Test Instance (100)

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.79, L=16.54

MovielensTest Instance (100)

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.97, L=0.67

Nations Unit NormTest Instance (143)

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.92, L=0.72

Nations Max Norm 2.0Test Instance (143)

0.00 0.25 0.50 0.75 1.00GR [Pr]

0.0

0.2

0.4

0.6

0.8

1.0

True

Ret

rain

[Pr

]

r=0.85, L=4.19

Nations Max Norm 3.0Test Instance (143)

Figure 2: Scatter plots illustrating the correlation between the probability values estimated by GR and after a retraining ofthe model, as well as Pearson correlation value r and Lipschitz constant L. The high correlations indicate that GR is able toaccurately approximate the probability and, therefore, the change in probability, of a triple after retraining the model without thetriple that GR deemed to cause the highest change in probability. The bottom row shows results when imposing constraints onthe weights for the NATIONS dataset: the stronger the constraint (unit norm > maximum norm 2.0 > 3.0 > None), the smallerthe Lipschitz constant and the better the Pearson correlation.

PD% TC%1 3 5 10 1 3 5 10

NH 49 50 51 60 3 13 14 20CRIAGE 91 93 94 95 16 36 48 68GR-O 92 94 97 97 25 48 62 68GR 92 96 97 98 27 51 66 73

Table 2: Results on NATIONS, using DISTMULT with a SIG-MOID activation function and for k = 1, 3, 5, 10. Boldmarks the best results; underlined results mark a statisticalsignificance with regards to CRIAGE at p ≤ 0.01 using anapproximate randomization test; all results are statisticallysignifiant with regards to NH. CRIAGE performs worse thanboth GR and GR-O, especially with regards to TC. Further-more, CRIAGE considers only training triples with the sameobject as an explanation and is significantly slower.

behavior more than deleting about half the adjacent triples atrandom (NH-ALL) in terms of PD%.

For FB15K-237 we observe that GR is again better atidentifying influential triples compared to the NH baselines.Moreover, only when the NH method deletes over half of theadjacent triples at random (NH-ALL) does it perform on parwith GR-10 in terms of PD% and TC%. Crucially, deletingone instance at random (NH-1) causes a high TC% compared

to the other two datasets. This suggests that the model is lessstable for this dataset. As our theoretical results show, GRworks better on more stable models, and this could explainwhy GR-ALL is further away from reaching a TC% value of100 than on the other two datasets.

For MOVIELENS GR is again able to outperform all respec-tive NH baselines. Furthermore, on this dataset GR-ALL isable to achieve perfect PD% and TC% values. Additionally,GR shows a substantial increase for PD% and TC% whenmoving from k = 1 to k = 10, suggesting that more than onetraining triple is required for an explanation. At the same time,NH-10 still does not perform better than random in terms ofPD%. The bottom half of Table 1 reports the results usingCOMPLEX, which show a similar patterns as DISTMULT. Inthe appendix, Figure 4 depicts the differences between GRand NH: GR is always better at identifying highly influentialtriples.

Approximation Quality. In this experiment we select, foreach test triple, the training triple d′ GR deemed most influ-ential. We then retrain without d′ and compare the predictedprobability of the test triple with the probability (PR) afterretraining the model. Figure 2 shows that GR is able to accu-rately estimate the probability of the test triple after retraining.For all datasets the correlation is rather high. We also confirmempirically that imposing a constraint on the weights (enforc-ing unit norm or a maximum norm), reduces the Lipschitz

treaties

China

Egypt

embassy

Netherlands

relintergovorgs reldiplom

acy

Cuba

embassy

intergovorgs3

Burma

USAexports3

UK

exports3 treaties

Israel

exports3

Figure 3: Some example explanations generated using GR.The dashed line indicates the test triple. Depicted are triplesdeemed highly influential on model behavior by GR (i) withthe same head and tail as test triple (green), (ii) with the samerelation type (red), and (iii) a pair of triples via a third entity(blue).

constant of the model and in turn reduces the error of GR, asevidenced by the increase in correlation. In addition to thequantitative results, we also provide some typical exampleexplanations generated by GR and their interpretation in theappendix. Figure 3 also shows two qualitative examples.

Comparison to CRIAGE. Like GR, CRIAGE can be usedto identify training samples that have a large influence on aprediction. However, CRIAGE has three disadvantages: (1) itonly considers training instances with the same object as theprediction as possible explanations; (2) it only works witha SIGMOID activation function and not for SOFTMAX; (3)retrieving explanations is time consuming, e.g. on MOVIE-LENS GR can identify an explanation in 3 minutes whereasCRIAGE takes 3 hours. Consequently, we only run CRIAGEon the smaller NATIONS dataset. For a direct comparison, weintroduce GR-O, which like CRIAGE only considers traininginstances as possible explanations if they have the same ob-ject as the prediction. Results for the top-k ∈ 1, 3, 5, 10 arereported in Table 2. Both GR and GR-O outperform CRIAGEand GR performs best overall for all k and both metrics.

ConclusionGradient rollback (GR) is a simple yet effective method totrack the influence of training samples on the parameters of amodel. Due to its efficiency it is suitable for large neural net-works, where each training instance touches only a moderatenumber of parameters. Instances of this class of models areneural matrix factorization models. To showcase the utilityof GR, we first established theoretically that the resource

overhead is minimal. Second, we showed that the differenceof GR’s influence approximation and the true influence onthe model behaviour is smaller than known bounds on thestability of stochastic gradient descent. This establishes a linkbetween influence estimation and the stability of models andshows that, if a model is stable, GR’s influence estimationerror is small. We showed empirically that GR can success-fully identify sets of training samples that cause a drop inperformance if a model is retrained without this set. In thefuture we plan to apply GR to other models.

AcknowledgementsWe thank Takuma Ebisu for his time and insightful feedback.

Broader ImpactThis paper addresses the problem of explaining and analyzingthe behavior of machine learning models. More specifically,we propose a method that is tailored to a class of ML modelscalled matrix factorization methods. These have a wide rangeof applications and, therefore, also the potential to providebiased and otherwise inappropriate predictions in numeroususe cases. For instance, the predictions of a recommendersystem might be overly gender or race specific. We hope thatour proposed influence estimation approach can make thesemodels more interpretable and can lead to improvementsof production systems with respect to the aforementionedproblems of bias, fairness, and transparency. At the sametime, it might also provide a method for an adversary to findweaknesses of the machine learning system and to influenceits predictions making them more biased and less fair. In theend, we present a method that can be used in several differentapplications. We believe, however, that the proposed methodis not inherently problematic as it is agnostic to use cases anddoes not introduce or discuss problematic applications.

ReferencesBianchi, F.; Rossiello, G.; Costabello, L.; Palmonari, M.;and Minervini, P. 2020. Knowledge Graph Embeddingsand Explainable AI. In Ilaria Tiddi, Freddy Lecue, P. H.,ed., Knowledge Graphs for eXplainable AI – Foundations,Applications and Challenges. Studies on the Semantic Web.Amsterdam: IOS Press.

Biggio, B.; Fumera, G.; and Roli, F. 2014. Security Evalua-tion of Pattern Classifiers under Attack. IEEE Transactionson Knowledge and Data Engineering 26(4): 984–996.

Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; andYakhnenko, O. 2013. Translating Embeddings for ModelingMulti-relational Data. In Advances in Neural InformationProcessing Systems 26, NeurIPS, 2787–2795.

Bousquet, O.; and Elisseeff, A. 2002. Stability and General-ization. Journal of Machine Learning Research 2: 499–526.

Cai, L.; and Wang, W. Y. 2018. KBGAN: Adversarial Learn-ing for Knowledge Graph Embeddings. In Proceedings ofthe 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics, NAACL, 1470–1480.

Chen, L.; Li, J.; Peng, J.; Xie, T.; Cao, Z.; Xu, K.; He, X.;and Zheng, Z. 2020. A Survey of Adversarial Learning onGraphs. arXiv abs/2003.05730: 1–28.

Dai, H.; Li, H.; Tian, T.; Huang, X.; Wang, L.; Zhu, J.; andSong, L. 2018. Adversarial Attack on Graph StructuredData. In Proceedings of the 35th International Conferenceon Machine Learning, ICML, 1115–1124.

Dong, Y.; Su, H.; Zhu, J.; and Bao, F. 2017. Towards Inter-pretable Deep Neural Networks by Leveraging AdversarialExamples. arXiv abs/1708.05493: 1–12.

Ebrahimi, J.; Rao, A.; Lowd, D.; and Dou, D. 2018. HotFlip:White-Box Adversarial Examples for Text Classification. InProceedings of the 56th Annual Meeting of the Associationfor Computational Linguistics, ACL, 31–36.

Guo, Q.; Zhuang, F.; Qin, C.; Zhu, H.; Xie, X.; Xiong, H.;and He, Q. 2020. A Survey on Knowledge Graph-BasedRecommender Systems. arXiv abs/2003.00911: 1–17.

Guo, S.; Wang, Q.; Wang, L.; Wang, B.; and Guo, L. 2016.Jointly Embedding Knowledge Graphs and Logical Rules. InProceedings of the 2016 Conference on Empirical Methodsin Natural Language Processing, EMNLP, 192–202.

Hardt, M.; Recht, B.; and Singer, Y. 2016. Train Faster, Gen-eralize Better: Stability of Stochastic Gradient Descent. InProceedings of the 33rd International Conference on Interna-tional Conference on Machine Learning, ICML, 1225––1234.

Harper, F. M.; and Konstan, J. A. 2015. The MovieLensDatasets: History and Context. ACM Transactions on Inter-active Intelligent Systems 5(4).

Hooker, S.; Erhan, D.; Kindermans, P.-J.; and Kim, B. 2019.A Benchmark for Interpretability Methods in Deep NeuralNetworks. In Advances in Neural Information ProcessingSystems 32, NeurIPS, 9737–9748.

Jacovi, A.; and Goldberg, Y. 2020. Towards Faithfully In-terpretable NLP Systems: How should we define and eval-uate faithfulness? In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics, ACL,4198–4205.

Jain, P.; Rathi, S.; Chakrabarti, S.; et al. 2020. KnowledgeBase Completion: Baseline strikes back (Again). arXivabs/2005.00804: 1–6.

Kadlec, R.; Bajgar, O.; and Kleindienst, J. 2017. KnowledgeBase Completion: Baselines Strike Back. In Proceedingsof the 2nd Workshop on Representation Learning for NLP,69–74.

Kingma, D. P.; and Ba, J. 2015. Adam: A Method for Stochas-tic Optimization. In 3rd International Conference on Learn-ing Representations, ICLR.

Koh, P. W.; and Liang, P. 2017. Understanding Black-boxPredictions via Influence Functions. In Proceedings of the34th International Conference on Machine Learning, ICML,1885–1894.

Kok, S.; and Domingos, P. 2007. Statistical Predicate Inven-tion. In Proceedings of the 24th International Conference onMachine Learning, ICML, 433–440.

Minervini, P.; Demeester, T.; Rocktaschel, T.; and Riedel, S.2017. Adversarial Sets for Regularising Neural Link Predic-tors. In Conference on Uncertainty in Artificial Intelligence,UAI, 9240–9251.Nickel, M.; Tresp, V.; and Kriegel, H.-P. 2011. A Three-WayModel for Collective Learning on Multi-Relational Data. InProceedings of the 28th International Conference on Interna-tional Conference on Machine Learning, ICML, 809–816.Papernot, N.; McDaniel, P.; Jha, S.; Fredrikson, M.; Celik,Z. B.; and Swami, A. 2016. The Limitations of Deep Learn-ing in Adversarial Settings. In 2016 IEEE European Sympo-sium on Security and Privacy, EuroS&P, 372–387.Pezeshkpour, P.; Chen, L.; and Singh, S. 2018. EmbeddingMultimodal Relational Data for Knowledge Base Completion.In Proceedings of the 2018 Conference on Empirical Methodsin Natural Language Processing, EMNLP, 3208–3218.Pezeshkpour, P.; Tian, Y.; and Singh, S. 2019. Investigat-ing Robustness and Interpretability of Link Prediction viaAdversarial Modifications. In Proceedings of the 2019 Con-ference of the North American Chapter of the Association forComputational Linguistics, NAACL, 3336–3347.Rocktaschel, T.; Singh, S.; and Riedel, S. 2015. InjectingLogical Background Knowledge into Embeddings for Rela-tion Extraction. In Proceedings of the 2015 Conference ofthe North American Chapter of the Association for Computa-tional Linguistics, NAACL, 1119–1129.Ruffinelli, D.; Broscheit, S.; and Gemulla, R. 2020. You CANTeach an Old Dog New Tricks! On Training KnowledgeGraph Embeddings. In 8th International Conference onLearning Representations, ICLR.Toutanova, K.; Chen, D.; Pantel, P.; Poon, H.; Choudhury, P.;and Gamon, M. 2015. Representing Text for Joint Embed-ding of Text and Knowledge Bases. In Proceedings of the2015 Conference on Empirical Methods in Natural LanguageProcessing, EMNLP, 1499–1509.

Trouillon, T.; Welbl, J.; Riedel, S.; Gaussier, E.; andBouchard, G. 2016. Complex Embeddings for Simple LinkPrediction. In International Conference on Machine Learn-ing, ICML, 2071–2080.Yang, B.; Yih, W.; He, X.; Gao, J.; and Deng, L. 2015. Em-bedding Entities and Relations for Learning and Inferencein Knowledge Bases. In 3rd International Conference onLearning Representations, ICLR.Ying, Z.; Bourgeois, D.; You, J.; Zitnik, M.; and Leskovec,J. 2019. GNNExplainer: Generating Explanations for GraphNeural Networks. In Advances in Neural Information Pro-cessing Systems 32, NeurIPS, 9240–9251.Zugner, D.; Akbarnejad, A.; and Gunnemann, S. 2018. Ad-versarial Attacks on Neural Networks for Graph Data. InProceedings of the 24th ACM SIGKDD International Confer-ence on Knowledge Discovery & Data Mining, 2847–2856.

Definitions, Lemmas, and ProofsThe following lemma simplifies the derivation of Lipschitzand smoothness bounds for tuple-based scoring functions.

Lemma 1. Let f : Rk × Rl × Rm → Rn :(x, y, z) → f(x, y, z) be Lipschitz continuous rela-tive to, respectively, x, y, and z. Then f is Lip-schitz continuous as a function Rk+l+m → Rn.More specifically, |f(ux, uy, uz) − f(vx, vy, vz)| ≤2 maxLx, Ly, Lz ‖(ux, uy, uz)− (vx, vy, uz)‖ .

Proof. |f(ux, uy, uz)− f(vx, vy, vz)|

=||f(ux, uy, uz)− f(ux, uy, vz)

+ f(ux, uy, vz)− f(vx, vy, vz)||≤ ‖f(ux, uy, uz)− f(ux, uy, vz)‖

+ ‖f(ux, uy, vz)− f(vx, vy, vz)‖= ‖f(ux, uy, uz)− f(ux, uy, vz)‖

+ ||f(ux, uy, vz)− f(ux, vy, vz)

+ f(ux, vy, vz)− f(vx, vy, vz)||≤ ‖f(ux, uy, uz)− f(ux, uy, vz)‖

+ ‖f(ux, uy, vz)− f(ux, vy, vz)‖+ ‖f(ux, vy, vz)− f(vx, vy, vz)‖

≤Lz ‖uz − vz‖+ Ly ‖uy − vy‖+ Lx ‖ux − vx‖≤maxLx, Ly, Lz (‖ux − vx‖+ ‖uy − vy‖

+ ‖uz − vz‖)= maxLx, Ly, Lz (‖(ux,0)− (vx,0)‖

+ ‖(0, uy)− (0, vy)‖+ ‖uz − vz‖)

≤maxLx, Ly, Lz(√

2 ‖(ux, uy)− (vx, vy)‖

+ ‖uz − vz‖)

= maxLx, Ly, Lz(√

2 ‖(ux, uy,0)− (vx, vy,0)‖

+ ‖(0,0, uz)− (0,0, vz)‖)≤maxLx, Ly, Lz (2 ‖(ux, uy, uz)− (vx, vy, uz)‖)=2 maxLx, Ly, Lz ‖(ux, uy, uz)− (vx, vy, uz)‖ .

We can now use Lemma 1 to derive the Lipschitz con-stant for the DISTMULT scoring function and a bound on itssmoothness.


|φ(w; d)− φ(w′; d)| ≤ 2C2 ‖w −w′‖ .

Proof. Let w,w′ ∈ Ω and let (s, r,o) = (w[s],w[r],w[o])and (s′, r′,o′) = (w′[s],w′[r],w′[o]). Without loss of gen-

erality, assume that ‖sr‖ ≥ ‖s′r′‖. We first show that

|〈s, r,o〉 − 〈s, r,o′〉| = |〈sr, (o− o′)〉|≤ ‖sr‖ ‖o− o′‖= ‖sr‖ ‖(s, r,o)− (s, r,o′)‖≤ ‖s‖ ‖r‖ ‖(s, r,o)− (s, r,o′)‖≤ C2 ‖(s, r,o)− (s, r,o′)‖ .

The first inequality is by Cauchy-Schwarz. The last inequalityis by the assumption that the norm of the embedding vectorsfor all w ∈ Ω is bounded by C. We can repeat the abovederivation for, respectively, r′ and s′. Hence, we can applyLemma 1 to conclude the proof.


‖∇φ(w; d)−∇φ(w′, d)‖ ≤ 4C ‖w −w′‖ .

Proof. Let w,w′ ∈ Ω and let (s, r,o) = (w[s],w[r],w[o])and (s′, r′,o′) = (w′[s],w′[r],w′[o]). We first show that

‖∇〈s, r,o〉 − ∇〈s, r,o′〉‖= ‖(ro, so, sr)− (ro′, so′, sr)‖= ‖r(o− o′), s(o− o′),0)‖≤ ‖r(o− o′)‖+ ‖s(o− o′)‖≤ ‖r‖ ‖o− o′‖+ ‖s‖ ‖o− o′‖= (‖r‖+ ‖s‖) ‖o− o′‖= (‖r‖+ ‖s‖) ‖(s, r,o)− (s, r,o′)‖≤ 2C ‖(s, r,o)− (s, r,o′)‖ .

Hence, we have that∇〈s, r,o〉 is 2C-Lipschitz with respectto o. We can repeat the above derivation for, respectively,r′ and s′. Hence, we can apply Lemma 1 to conclude theproof.

We first need another definition.

Definition 4. An update rule G : Ω→ Ω is σ-bounded if

supw∈Ω‖w −G(w)‖ ≤ σ.

Stochastic gradient descent (SGD) is the algorithm result-ing from performing stochastic gradient updates T timeswhere the indices of training examples are randomly chosen.There are two popular sampling approaches for gradient de-scent type algorithms. One is to choose an example uniformlyat random at each step. The other is to choose a random per-mutation and cycle through the examples repeatedly. Wefocus on the latter sampling strategy but the presented resultscan also be extended to the former.

We can now prove the stability of SGD on two sets oftraining samples.

Theorem 1. Let f(·; d) ∈ [0, 1] be an L-Lipschitz and β-smooth function for every possible triple d and let c be theinitial learning rate. Suppose we run SGD for T steps withmonotonically non-increasing step sizes αt ≤ c/t on twodifferent sets of triples D and D − d′. Then, for any d,

E |f(wT ; d)− f(w′T ; d)| ≤ Λstab-nc

:=1 + 1/βc

n− 1(cL2)

1βc+1T

βcβc+1 ,

with wT and w′T the parameters of the two models afterrunning SGD.

Proof. The proof of the theorem follows the notation andstrategy of the proof of the uniform stability bound of SGDin (Hardt, Recht, and Singer 2016). We provide some ofthe details here as we will use a similar proof idea for theapproximation bound for gradient rollback. By Lemma 3.11in (Hardt, Recht, and Singer 2016), we have for every t0 ∈1, ..., n

E |f(wT ; d)− f(w′T ; d)| ≤ t0n

+ LE[δT | δt0 = 0]

where δt = ‖wt −w′t‖. Let ∆t = E[δt | δt0 = 0]. We willstate a bound ∆t as a function of t0 and then follow exactlythe proof strategy of (Hardt, Recht, and Singer 2016).

At step t with probability 1 − 1/n the triple selected bySGD exists in both sets D and D − d′. In this case wecan use the (1 + αβ)-expansivity of the update rule (Hardt,Recht, and Singer 2016) which follows from our smoothnessassumption. With probability 1/n the selected triple is d′.Since d′ is the triple missing in D − d′ the weightsw′ arenot updated at all. This is justified because we run SGD onboth sets a set number of epochs, each with |D| steps and,therefore, do not changew′ with probability 1/n because thetraining set corresponding to w′ has one sample less. Hence,in this case we use the αL-bounded property of the updateas a consequence of the αL-boundedness of the gradientdescent update rule. We can now apply Lemma 2.5 from(Hardt, Recht, and Singer 2016) and linearity of expectationto conclude that for every t ≥ t0,

∆t+1 ≤(

1− 1

n

)(1 + αtβ)∆t +

1

n∆t +

αtL

n.

The theorem now follows from unwinding the recurrencerelation and optimizing for t0 in the exact same way as donein prior work (Hardt, Recht, and Singer 2016).

Lemma 4. Let f : Ω → R be a function and let G(w) =w − α∇f(w) be the gradient update rule with step sizeα. Moreover, assume that f is β-smooth. Then, for everyw,w′,γ ∈ Ω we have

‖G(w)− γ −G(w′)‖ ≤ ‖w − γ −w′‖+ αβ ‖w −w′‖ .Proof.

‖G(w)− γ −G(w′)‖= ‖w − α∇f(w)− γ − (w′ − α∇f(w′)‖≤‖w − γ −w′‖+ α ‖∇f(w)−∇f(w′)‖≤‖w − γ −w′‖+ αβ ‖w −w′‖ .

The first equality follows from the definition of the gradientupdate rule. The first inequality follows from the triangleinequality. The second inequality follows from the definitionof β-smoothness.

Lemma 5. Let f(·; d) be L-Lipschitz and β-smooth func-tion. Suppose we run SGD for T steps on two setsof triples D and D − d′ for any d′ ∈ D andwith learning rate αt at time step t. Moreover, let∆t = E

[‖wt −w′t‖ |

∥∥wt0 −w′t0∥∥ = 0]

and ∆t =

E[∥∥wt − γ[d′,wt(d)] −w′t

∥∥ | ∥∥wt0 −w′t0∥∥ = 0]

for somet0 ∈ 1, ..., n . Then, for all t ≥ t0,

∆t+1 <

(1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL) .

Proof. We know from the proof of Theorem 1 that the fol-lowing recurrence holds

∆t+1 ≤(

1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL) . (5)

We first show that when using gradient rollback (GR), wehave that

∆t+1 ≤(

1− 1

n

)(∆t + αtβ∆t) +

1

n∆t. (6)

At step t of SGD, with probability 1− 1n , the triple selected is

in both D and D − d′. In this case, by Lemma 4, we havethat ∆t+1 ≤ ∆t + αtβ∆t. With probability 1

n the selectedexample is d′, in which case we have, due to the behavior ofGR, that

∆t+1

= E[∥∥wt+1 − γ[d′,wt+1(d)] −w

′t+1

∥∥ | ∥∥wt0 −w′t0

∥∥ = 0]

= E[∥∥wt+1 − γ[d′,wt+1(d)] −w

′t

∥∥ | ∥∥wt0 −w′t0

∥∥ = 0]

= E[∣∣∣∣wt − αt∇f(wt; d

′)−(γ[d′,wt(d)]

−αt∇f(wt; d′))−w′

t

∣∣∣∣ | ∥∥wt0 −w′t0

∥∥ = 0]

= E[∥∥wt − γ[d′,wt(d)] −w

′t

∥∥ | ∥∥wt0 −w′t0

∥∥ = 0]

= ∆t,

This proves Equation 6.Let us now define the recurrences At+1 and Bt+1 for the

upper bounds of ∆t and ∆t, respectively:

At+1 =

(1− 1

n

)(1 + αtβ)At +

1

n(At + αtL) , (7)

Bt+1 =

(1− 1

n

)(Bt + αtβAt) +

1

nBt. (8)

Now, let

Ct+1 = At+1 −Bt+1

=

(1− 1

n

)(1 + αtβ)At +

1

n(At + αtL)

−(

1− 1

n

)(Bt + αtβAt)−

1

nBt

=

(1− 1

n

)At +

1

n(At + αtL)−Bt

= (At −Bt) +1

nαtL

= Ct +1

nαtL.

Let us now derive the initial conditions of the recurrencerelations. By the assumption

∥∥wt0 −w′t0∥∥ = 0, we have∆t0 = 0, ∆t0 = 0, At0 = 0, Bt0 = 0, and Ct0 = 0.Now, since αtL > 0 we have for all t ≥ t0 that Ct+1 =At+1 − Bt+1 > 0 and, therefore, Bt+1 < At+1. It followsthat for all t ≥ t0 we have

∆t+1 ≤ Bt+1 < At+1

=

(1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL) .

This concludes the proof.

The following theorem establishes that the approximationerror of gradient rollback is smaller than the stability boundof SGD. It uses Lemma 5 and the proof of Theorem 1.Theorem 2. Let f(·; d) ∈ [0, 1] be an L-Lipschitz and β-smooth loss function. Suppose we run SGD for T steps withmonotonically non-increasing step sizes αt ≤ c/t on twosets of triples D and D−d′. LetwT andw′T , respectively,be the resulting parameters. Then, for any triple d that hasat least one element in common with d′ we have,

E |f(wT − γ[d′,wT (d)]; d)− f(w′T ; d)| < Λstab-nc.

Proof. The proof strategy follows that of Theorem 1 by ana-lyzing how the parameter vectors from two different runs ofSGD, on two different training sets, diverge.

By Lemma 3.11 in (Hardt, Recht, and Singer 2016), wehave for every t0 ∈ 1, ..., n

E |f(wT ; d)− f(w′T ; d)| ≤ t0n

+ (9)

LE[‖wt −w′t‖ |

∥∥wt0 −w′t0∥∥ = 0].

Let∆t = E

[‖wt −w′t‖ |

∥∥wt0 −w′t0∥∥ = 0]

and

∆t = E[∥∥wt − γ[d′,wt(d)] −w′t

∥∥ | ∥∥wt0 −w′t0∥∥ = 0].

We will state bounds ∆t and ∆r as a function of t0 and thenfollow the proof strategy of (Hardt, Recht, and Singer 2016).

From Lemma 5 we know that

∆t+1 <

(1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL) .

For the recurrence relation

∆t+1 ≤(

1− 1

n

)(1 + αtβ)∆t +

1

n(∆t + αtL)

we know from the proof of Theorem 3.12 in (Hardt, Recht,and Singer 2016) that

∆T ≤L

β(n− 1)

(T

t0

)βc.

Hence, we have by Lemma 5 that, for some ε > 0,

∆T ≤L

β(n− 1)

(T

t0

)βc− ε.

Plugging this into Equation 9, we get

E |f(wT − γ[d′,wT (d)]; d)− f(w′T ; d)|

≤ t0n

+L2

β(n− 1)

(T

t0

)βc− Lε.

The theorem now follows from optimizing for t0 in theexact same way as done in prior work (Hardt, Recht, andSinger 2016).

Experimental Details & Qualitative ExamplesExperimental Details. Dataset statistics and the numberof negative samples and dimensions can be found in Table3. The implementation uses Tensorflow 2. Furthermore, theoptimizer used for all models is Adam (Kingma and Ba 2015)with an initial learning rate (LR) of 0.001 and a decayingLR wrt the number of total steps. To obtain exact values forGR, the batch size is set to 1. Because the evaluation requiresthe retraining of models, we set the epoch size for FB15K-237 and MOVIELENS to 1. For NATIONS, we set it to 10and use the model from the epoch with the best MRR onthe validation set, which was epoch 10. For NATIONS andFB15K-237, we use the official splits. MOVIELENS7 is the100k split, where the file ua.base is the training set and thefirst 5k of ua.test are the validation set and the remaining arethe test set. We train NATIONS and FB15K-237 with soft-max cross entropy with logits. For MOVIELENS we employsigmoid cross entropy with logits in order to test if GR canalso be used under a different loss and because this loss isstandardly used in recommender systems. MRR, Hits@1 andHits@10 of the models can be found in Table 4. Experimentswere run on CPUs for NATIONS and on GPUs (GeForce GTX1080 Ti) for the other two datasets on a Debian GNU/Linux9.11 machine with 126GB of RAM. Computing times for thedifferent steps and datasets can be found in Table 5. Due tothe expensive evaluation step, each experimental setup wasrun once. All seeds were fixed using number 42.

Qualitative Examples. Beside the question whether GR isable to identify the most influential instances is actually ifthose would be considered as an explanation by a human. For

7https://grouplens.org/datasets/MovieLens/100k/; 27th May2020

https://grouplens.org/datasets/MovieLens/100k/

Train Dev Test #E #R #Neg #Dim

NATIONS 1592 199 201 14 55 13 10FB15K-237 272k 17.5k 20.5k 14k 237 500 100MOVIELENS 90k 5k 4.4k 949 1.7k 500 200

Table 3: Dataset statistics, including number of entities (#E) and number of relations (#R), as well as hyperparameter settings,number of negative samples (#Neg) and number of dimensions (#Dim).

MRR Hits@1 Hits@10

NATIONS 58.88 38.31 97.51FB15K-237 25.84 19.16 40.18MOVIELENS 61.81 37.60 n/aNATIONS 60.41 39.80 97.01FB15K-237 25.11 18.15 40.05MOVIELENS 61.76 37.51 n/a

Table 4: MRR, Hits@1 and Hits@10 for the main models ofthe three datasets, DISTMULT is at the top and COMPLEX atthe bottom. (MOVIELENS has 5 ratings that can be predicted,hence Hits@10 is trivially 100.)

Step 1 Step 2

NATIONS 1 0.07±0.02FB15K-237 17 6±7MOVIELENS 7 12±4

Table 5: Computing times in minutes for the different stepsand datasets. Step 1: Train a main model. Step 2: Generateexplanation for one triple (the time for this step depends onthe number of adjacent training triples, here averaged over 5random triples). Step 3: Evaluating GR/NH takes the time ofStep 1 times the test set size.

that, we performed a qualitative analyzes by inspecting theexplanation picked by GR-1. We focused on FB15K-237as the triples are human-readable and because they describegeneral domains like geography or people. Below we presentthree examples which reflect several other identified cases.to-be-explained: “Lycoming County” - “has currency” -“United States Dollar”explanation: “Delaware County” - “has currency” - “UnitedStates Dollar”

The model had to predicted the currency of LycomingCounty and GR picked as the most likely explanation thatDelaware County has the US-Dollar as currency. Indeed, bothare located in Pennsylvania; thus, it seems natural that theyalso have the same currency.to-be-explained: “Bandai” – “has industry” – “Video game”explanation: “Midway Games” – “has industry” – “Videogame”

Namco (Bandai belongs to Namco) initially distributedits games in Japan, while relying on third-party companies,such as Atari and Midway Manufacturing to publish them

internationally under their own brands.to-be-explained: “Inception” - “released in region” -“Denmark”explanation: “Bird” - “released in region” - “Denmark”

Both movies were distributed by Warners Bros. It is rea-sonable that a company usually distributed the movies in thesame country.

These and other identified cases show that the instancesselected by GR are not only the most influential triples butalso reasonable explanations for users.

GR-1 NH-1GR-10 NH-10

GR-ALL NH-ALL

GR vs. NH

-0.2

-0.1

0.0

0.1

0.2D

elta

[Pr

]

Nations

0

10

100 Deleted Triples [%

]Triples removed (GR) [%]Triples removed (NH) [%]


GR-ALL NH-ALL

GR vs. NH

-0.2

-0.1

0.0

0.1

0.2

Del

ta [

Pr]

FB15k-237

0

10




GR-ALL NH-ALL

GR vs. NH

-0.2

-0.1

0.0

0.1

0.2

Del

ta [

Pr]

Movielens

0

10



Figure 4: The boxplots illustrate the difference between thechange in probability of a test triple caused by removingthe set of triples S selected by, respectively, GR and thebaselines NH. The value is larger than zero, when GR selectstriples that change the probability more after retraining thanthose selected by the baselines. The boxplots also depict thestandard deviations and the average number of deleted triples(on the right y-axis).

explaining neural matrix factorization with gradient rollback

Documents