learningtoexplaining:an information ... · jianbo chen, le song, martin j. wainwright, michael i....

15
Learning to Explaining: An Information-Theoretic Perspective on Model Interpretation Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan Institute of Information Science Academia Sinica Taiwan Aug 10, 2018 Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS) L2X Aug 10, 2018 1 / 15

Upload: others

Post on 19-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Learning to Explaining: AnInformation-Theoretic Perspective on Model

    Interpretation

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan

    Institute of Information Science

    Academia Sinica

    Taiwan

    Aug 10, 2018

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 1 / 15

  • Motivation

    Motivation : curious about which features are the key for the modelto make its prediction on the instance

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 2 / 15

  • Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 3 / 15

  • Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 4 / 15

  • Mutual information

    I(X : Y ) = EX ,Y[

    log pXY (X ,Y )pX (X )pY (Y )

    ],

    where pXY (X ,Y ) is the joint probability density and PX (X ),PY (Y )are the marginal probability density.Intuitively, it corresponds to how much knowledge of one randomvector reduces the uncertainty about the other.

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 5 / 15

  • Notations

    data : (X ,Y )(Y |x) ∼ Pm(·|x)Qk = {S ⊂ 2d : |S| = k}Given the chosen subset S = E(x), we use xS to denote thesub-vector formed by the chosen features.

    objective function :

    maxE I(XS : Y ) subject to S ∼ E(X ) (1)

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 6 / 15

  • flowIt remains to solve ( 1) efficiently.

    maxE I(XS : Y ) subject to S ∼ E(X )

    1 Relax by maximizing a variational lower bound ( 2)

    maxE,QE [logQS(Y |XS)] subject to S ∼ E(X )2 Then, approximate Q by a single neural network gα. ( 3)

    ⇒ maxE,αE [ln gα(Y |XS)] subject to S ∼ E(X )3 Approximate sampling from categorical distribution by sampling

    from a continuous distribution : Gumbel-softmax trick.Therefore, the objective function becomes

    maxθ,αEX ,Y ,η [ln gα(V (θ, η) · X ,Y )]Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 7 / 15

  • flow(1)1 Relax by maximizing a variational lower bound ( 2)

    maxE,QE [logQS(Y |XS)] subject to S ∼ E(X )

    2 Then, approximate Q by a single neural network gα. ( 3)

    ⇒ maxE,αE [ln gα(Y |XS)] subject to S ∼ E(X )

    3 Approximate sampling from categorical distribution by samplingfrom a continuous distribution : Gumbel-softmax trick.Therefore, the objective function becomes

    maxθ,αEX ,Y ,η [ln gα(V (θ, η) · X ,Y )]

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 8 / 15

  • 1. variational lower boundA direct solution to (1) is not possible, so that we need to approachit by a variational approximation.

    I(XS : Y ) =H(Y )+ < ln Pm(Y |XS) >Pm(XS ,Y )≥H(Y )+ < ln QS(Y |XS) >QS(XS ,Y )

    QS(Y |XS) is an arbitrary variational distribution. This bound becomesexact if QS(Y |XS) ≡ Pm(Y |XS). Problem (1) can thus be relaxed to

    maxE,QE [logQS(Y |XS)] subject to S ∼ E(X ) (2)

    where Q is a collection of conditional distributions.

    Q = {QS(·|xS), S ∈ Qk}

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 9 / 15

  • flow(2)1 Relax by maximizing a variational lower bound ( 2)

    maxE,QE [logQS(Y |XS)] subject to S ∼ E(X )

    2 Then, approximate Q by a single neural network gα. ( 3)

    ⇒ maxE,αE [ln gα(Y |XS)] subject to S ∼ E(X )

    3 Approximate sampling from categorical distribution by samplingfrom a continuous distribution : Gumbel-softmax trick.Therefore, the objective function becomes

    maxθ,αEX ,Y ,η [ln gα(V (θ, η) · X ,Y )]

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 10 / 15

  • approximate Q

    Recall that Q is a collection of conditional distributions.Then we introduce a single neural network function gα : Rd → ∆c−1for parametrizing Q, where ∆c−1 is a (c − 1)-simplex for the classdistribution

    ⇒ maxE,αE [ln gα(Y |XS)] subject to S ∼ E(X ) (3)

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 11 / 15

  • flow(3)1 Relax by maximizing a variational lower bound ( 2)

    maxE,QE [logQS(Y |XS)] subject to S ∼ E(X )

    2 Then, approximate Q by a single neural network gα. ( 3)

    ⇒ maxE,αE [ln gα(Y |XS)] subject to S ∼ E(X )

    3 Approximate sampling from categorical distribution by samplingfrom a continuous distribution : Gumbel-softmax trick.Therefore, the objective function becomes

    maxθ,αEX ,Y ,η [ln gα(V (θ, η) · X ,Y )]

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 12 / 15

  • approximate E

    S ∼ E(X ) is actually sampling from a categorical distribution, whichis discrete.⇒use Gumbel-Softmax distribution to approximate categoricaldistribution.We define wθ : Rd → Rd that maps the input to a d-dimensionalvector, with the ith entry of wθ(X ) representing the ”importancescore” of ith feature.Sample a single feature out of d features independently for k times.

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 13 / 15

  • objective function

    The objective function then becomes :

    maxθ,αEX ,Y ,η [ln gα(V (θ, η) · X ,Y )]

    where gα denotes the neural network used to approximate the modelconditional distribution, and the quantity θ is used to parametrize theexplainer.

    In each update, we sample a mini-batch of unlabeled data withtheir class distributions from the model to be explained.During the explaining stage, the learned explainer maps eachsample X to a weight vector wθ(X ) of dimension d , each entryrepresenting the importance of the corresponding feature for thespecific sample X .

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 14 / 15

  • Experiment-IMDBPost-hoc accuracy.We feed in the sample XS to the model with unselected wordsmasked by zero paddings. Then we compute the accuracy ofusing Pm(y |XS) to predict samples in the test data set labeledby Pm(y |X )Human accuracy.We ask humans on Amazon Mechanical Turk (AMT) to inferthe sentiment of a review given the ten key words selected byexplainer.

    Jianbo Chen, Le Song, Martin J. Wainwright, Michael I. Jordan (IIS)L2X Aug 10, 2018 15 / 15