powerpoint presentation€¦ · title: powerpoint presentation author: maf created date: 10/3/2019...

1

Prof. Mark WhitehornEmeritus Professor of Analytics

Computing

University of Dundee

Consultant

Writer (author)[email protected]

It’s all about us…

© Whitehorn and Bruner

2

I teach Masters at Dundee in:

Data Science• Part time

• Distance learning –

aimed at existing data professionals


3

Giovanni Bruner

Data Scientist @ Nexiwww.nexi.it/en.html

It’s all about us…

Email : [email protected]

Linkedin: https://www.linkedin.com/in/giovanni-bruner-22300937/


http://www.nexi.it/en.html

4

Some machine learning algorithms not only work but the models they produce can readily be understood

by mere humans; decision trees are a wonderful example here. The same is not true of neural nets which

conceal their decision making process behind a massive smokescreen of numbers. But we live in an age

of accountability where people have a right to know why their loan was refused or why their Mother’s hip

replacement was rescheduled for the fourth time.

This talk will outline (very briefly) why it is inherently difficult to understand how a given neural net came

to a given decision in a given case. Most of the talk will be spent looking at some of the work that is

going on to try to blow away the smokescreen. Please note this is an introduction to the topic which

means it will involve little to no maths

An introduction to interpretability

LOCATION: GIELGUD

DATE: OCTOBER 1, 2019

TIME: 13.35 - 2:20 PM

45 MINUTES


5

Without necessarily knowing it we normally use the von Neumann computational model*. This provides a

clear separation of the data from the instructions that manipulate it. NNs are different, in these the flow

of the data itself changes the weightings which are, themselves, part of the instructions for the

manipulation of the data.

*the incomplete First Draft of a Report on the EDVAC, 1945 John von Neumann

NN versus traditional programming


6

Who is this

fictional character?

http://wwws.warnerbros.co.uk© Whitehorn and Bruner

7

OK it was an easy question.

For what is Sherlock Holmes

famous?

Who is this



8

OK it was an easy question.

For what is Sherlock Holmes

famous?

Deduction. “the Science of Deduction and Analysis is one which can only be

acquired by long and patient study…”

The Sign of Four

Sir Arthur Conan Doyle

Who is this



9

Perhaps we should let Dr. Watson

have the limelight just this once.

In 'A Study in Scarlet' he is asked

about some pills. He says :-

“From their lightness and

transparency, I should imagine that

they are soluble in water.”

But what is Deduction?


10

Deduction is applying a rule that

you already know to a specific

situation.

Induction is creating the rule in the

first place.

Compare and contrast


11

So one way of looking at this movement away from the von Neumann architecture

is that the machines are now doing the induction. Which means that there is no

human behind the code generating the rules.

How can we explain the result if wey don’t understand the rules?

NN versus traditional programming


12

I said that some ML algorithms are easy to understand and used decision trees as an example. But even

the wonderful decision tree can become difficult to understand when, for example, we bundle them

together as random forests.

Neural nets are a good example where interpretability is almost always an issue so I have used them as

the main example but the problem is endemic in ML as a whole.

And it isn’t just NN


13

Trusting the Black Box –

GDPR

GDPR is a set of EU data privacy regulations that is heavily impacting

data governance in many companies.

GDPR Article 22(1):

“The data subject shall have the right not to be subject to a decision based solely on automated

processing, including profiling, which produces legal effects concerning him or her or similarly

significantly affects him or her.”

Some commentators argue that GDPR therefore requires a “right to explanation”, however there

isn’t a consensus on this interpretation.


14

Irrespective of GDPR, interpretability is still important

• You may want to make sure that your model is not picking up a racial,

gender or religious bias. What if your model always refuses a loan to

people from a specific minority?

• Your model might be predicting the right thing, for the wrong

reasons. For example:


15

https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime

The model predicts the right class but for the wrong reasons



16

• In the Husky vs Wolves experiment*

researchers built an image recognition

model that could correctly classify

Huskies and Wolves very accurately.

• However investigation revealed that

the recognition system was deciding

based on the snow in the background

of the image.

• Would you trust this model?

* Marco Tulio et al.© Whitehorn and Bruner

Would you trust this model?

17

• By the same logic I can prove that

James Frost, one of the speakers at

this very conference is, in fact, …...

* Marco Tulio et al.© Whitehorn and Bruner


18

• By the same logic I can prove that

James Frost, one of the speakers at

this very conference is, in fact, a Wolf.

* Marco Tulio et al.


Wolf


19

• There are several definitions of interpretability in the context of a

Machine Learning model. Possibly the best is Interpretability as trust.

• Trust that the model is predicting a certain value for the “right reasons”.

• Interpretability is key to ensure the social acceptance of Machine

Learning algorithms in our everyday lives (assuming, that as a society, we

actually want to use machine learning in this way).

Defining Interpretability


20

Reference

https://arxiv.org/pdf/1602.04938.pdf© Whitehorn and Bruner

21

Local Interpretable Model-Agnostic Explanations (LIME): An Introduction

A technique to explain the predictions of any machine learning classifier.

Marco Tulio, Ribeiro Sameer, Singh, Carlos Guestrin

www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime

Adversarial Patch

Tom B. Brown, Dandelion Mané , Aurko Roy, Martín Abadi, Justin Gilmer

arxiv.org/pdf/1712.09665.pdf

Robust Physical-World Attacks on Deep Learning Visual Classification

Kevin Eykholt, Ivan Evtimov, et al.

arxiv.org/pdf/1707.08945.pdf



https://arxiv.org/pdf/1712.09665.pdf


22


Adversarial Attacks

Deep Learning models, especially image

recognition, are heavily vulnerable to

adversarial attacks. Brown et al. recently

demonstrated that a patch labelled as a toaster,

randomly applied to an image, could

dramatically affect the performance of a classifier.

More scarily, Evtimov et al. showed that

it’s possible to fool a model built to

identify road signs by perturbing them

with stickers (and a robust general attack

algorithm).

Imagine the consequences for

autonomous vehicles!


23

Trusting the Black Box – Here is a complex model

Convolutional Neural Network?

• This is Inception, one of the most commonly used architecture in image

recognition.

• It has 23,851,783 million parameters, for a total of 159 layers. Just the weights

and the architecture of the trained model are 92 MB.© Whitehorn and Bruner

24

Trusting the Black Box – and another one!

• And what about this one?

• VGG16 is another frequently used architecture in image recognition.

• It has a mere 23 layers, but a total of 138,357,544 million parameters and a size of 528 MB© Whitehorn and Bruner

25


Stacking models

It has become common practice to train stack of models to achieve minor improvements; this

has become common practice in order to win Kaggle competitions (https://www.kaggle.com/).

• This is arguably not a very smart thing to do in a production system; there are too many

models to maintain. But it works !!

1.Xgboost 1

2.Xgboost 2

3.Xgboost 3

4.Xgboost 4

5.Logistic Regression

6.Support Vector Machine

7.CNN 1

8.CNN 2

Predictions 1

Predictions 2

Predictions 3

Predictions 4

Predictions 5

Predictions 6

Predictions 7

Predictions 8

Final

Model or

Simple

Averaging


26

Trusting the Black Box – Auto

ML

• Auto ML packages allow us to completely automate the ML timeline.

• They try many different models, tune them and then combine them.

• Examples are Tpot, DataRobot, H20.ai


27

An Algorithmic Approach - Lime


28

Trusting the Black Box – What

is Lime

• LIME stands for Local Interpretable Model agnostic Explanations.

• Many algorithms allow you to inspect the Global most important features. With

Lime a user can identify the Local most important features that affected the model

output for a specific training case.

• The beauty of LIME is that it’s, as the name suggests, completely model agnostic.

• For each case we want to explain, we build a dataset of perturbed instances and

we learn a local, simple model of this dataset. We extract the n most important

features of this local model.

https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime© Whitehorn and Bruner

29

1. Get an

instance you

want to explain.

2. Split into

interpretable

components.

3. Perturb a number of

instances (default is

5000) and pass them

through the original

model to obtain a

prediction value.

4. Create a local

model of this

newly generated

dataset.

5. Find explanation.

Trusting the Black Box – How

does it work ?


30

The original model's complex function (f) for identifying the image (which is unknown

to LIME) is represented by the blue/pink background.

The bold red cross is the instance being explained.

LIME creates perturbed instances and uses the original model (f) to provide

predictions. It then weighs the instances by their proximity to the instance being

explained. That weight is represented here by size. The dashed line is the learned

explanation that is locally (but not globally) faithful.*

* Images and text adapted from Marco Tulio et al.© Whitehorn and Bruner

31

“Now is the winter of our discontent made glorious summer by this son of York;” P(Richard III) = 0.7

Now is the winter of our discontent. Made glorious summer by this son of York;






Imagine we have created a model to predict a Shakespeare play from a quote. We want to

explain the instance below:

Generate perturbed instance

0.65

0.72

0.80

0.45

0.74

0.88


does it work?


32

Now is the winter of our discontent. Made glorious summer by this son of

York;


York;


York;

.

.

.

0.65

0.72

.

.

0.85

Explanation

1. Summer

2. By

3. Discontent

4. Of

5. York


does it work?


33

Trusting the Black Box – A few

drawbacks

• LIME works with tabular data and regression problems, but the results

are harder to read than images and text.

• When tabular, the continuous variables are discretized into quantiles

(rather than mega-pixels or words).

• For image processing, depending on your choice of network

architecture, it can be slow – so not good to implement into a low

latency production system. But it should really only be used for

internal analysis.


34

Interpretability in practice.

A Machine Learning model works with a set of features in a

multi dimensional space with the objective to minimize a

function or maximizing a likelihood.

It’s like a game, with a set of players (our inputs) trying to

reach an objective (a correct prediction). We need to able to

understand which players contributed the most to the

objective.


35

Ok, ok I got this…in fact, when it’s possible I always plot

features importance, to see which variable my model used the

most to issue predictions.

A possible solution…

Isn’t that enough?


36

Nope


Well, maybe sometimes© Whitehorn and Bruner

37

Here is where SHAP can come in our help



38

• SHAP stands for Shapley Additive Explanations.

It’s a model-agnostic, efficient algorithm, to

compute features contribution to a model

output.

• With non linear black box models SHAP

provides accurate and consistent features

importance values.

• It allows meaningful, local explanations of

individual predictions.

• SHAP borrows concepts from cooperative

game theory: The Shapley Values

SHAP? What is it?

It was developed by Scott

Lundberg and Su-In Lee from

University of Washington (WA)*

https://arxiv.org/pdf/1705.07874.pdf© Whitehorn and Bruner

39

• Shapley values are a concept in cooperative game theory. They where

introduced in 1953 by the Nobel Prize winner Lloyd Shapley, one of the fathers of

Game Theory*.

• The overall intuition behind the concept is that sometimes a player value in a team

could be greater than their value if they were on their own.

• In a Machine Learning setting a Shapley value is “the contribution of a feature

value to the difference between the actual prediction and the mean prediction”…

• …which is equivalent to answer this question: “Given that without any features we

would just predict an average value, once we bring the first feature in how much

our prediction changes compared to the average?”

Shapley Values

https://en.wikipedia.org/wiki/Shapley_value© Whitehorn and Bruner

40

1) Given a set N of players I, each of which can be attributed a value

3) We then calculate the marginal contribution given by that feature in the following way:

4) Where R is an ordering, given by permuting the values in set N, and is the

set of a players preceding i in the order R.

Set of features

preceding i in

order R,

including i

Set of features

preceding and

excluding i

2) We calculate a set of permutations R of N.

Let’s start with the Math


42

Some friends may help explaining this…

Our Coalition Our ObjectiveKill Vader

Algorithmi. Calculate all possible coalitions permutations.

ii. For each permutation take the set of players

preceding our target Jedi.

iii. Include the target Jedi in this subset

iv. Then subtract the contribution of the subset

excluding the target Jedi


43

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +




44

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*

Obi Contribution* Luke Contribution*

Y, O, L

Y, L, O

O,Y, L

O, L, Y

L, Y, O

L, O, Y

* Marginal Contributions



Algorithmi. Calculate all possible coalitions permutations.

ii. For each permutation take the set of players

preceding our target Jedi.

iii. Include the target Jedi in this subset

iv. Then subtract the contribution of the subset

excluding the target Jedi


45

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*


Y, O, L V(Y) = 10

Y, L, O

O,Y, L

O, L, Y

L, Y, O

L, O, Y





46

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*


Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25

Y, L, O

O,Y, L

O, L, Y

L, Y, O

L, O, Y





47

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*


Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10

Y, L, O V(Y) = 10

O,Y, L

O, L, Y

L, Y, O

L, O, Y





48

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*



Y, L, O V(Y) = 10 V(L,Y) – V(Y) = 27 – 10 = 17

O,Y, L

O, L, Y

L, Y, O

L, O, Y





49

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*



Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17

O,Y, L

O, L, Y

L, Y, O

L, O, Y





50

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*




O,Y, L V(O) = 9

O, L, Y

L, Y, O

L, O, Y





51

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*




O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9

O, L, Y

L, Y, O

L, O, Y





52

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*




O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9 V(L, O, Y) –V(O, Y) = 45 – 35 = 10

O, L, Y

L, Y, O

L, O, Y





53

V ( ) = 10

V ( ) = 27+

V ( ) = 35+

V ( ) = 25+

V ( ) = 9

V ( ) = 8

V ( ) = 45+ +

Order R Yoda

Contribution*




O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9 V(L, O, Y) –V(O, Y) = 45 – 35 = 10

O, L, Y V(Y, L, O) –V(L, O) = 45 – 25 =

20

V(O) = 9 V(L, O) –V(O) = 25 – 9 = 16

L, Y, O V(L,Y) –V(L) = 27 – 8 =19 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L) = 8

L, O, Y V(Y, L, O) –V(L, O) = 45 – 25 =

20

V(O, L) –V(L) = 25 – 8 = 17 V(L) = 8





54

Initial Value Payout (SHAP Value)

10 10 + 10 + 26 + 20 + 19 + 20 = 17.5

9 26 + 18 + 9 + 9 + 18 + 17 = 16.2

8 10 + 17 + 10 + 16 + 8 + 8 = 11.5

After calculating each player marginal contributions* we realize that although Luke is 20% weaker than

Yoda he contributed 34% less than Yoda. Obi in terms of contribution is much closer to Yoda!

*”The Shapley value can be misinterpreted. The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training. The

interpretation of the Shapley value is: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean

prediction is the estimated Shapley value” (https://christophm.github.io/interpretable-ml-book/shapley.html#general-idea)

Now we can calculate the payout for each Jedi


55

Initial Value Payout (SHAP Value)

10 10 + 10 + 26 + 20 + 19 + 20 = 17.5

9 26 + 18 + 9 + 9 + 18 + 17 = 16.2

8 10 + 17 + 10 + 16 + 8 + 8 = 11.5

Of course, we won’t really use Jedi knights.

We will be interested in inputs to a Machine Learning algorithm

Q1 Time of accident?

Q2 Location?

Q3 police informed?


56

The great news is that this we have yet to sort this problem

completely; this is ongoing research.

You can come up with contributions.

Summary


57

References

• Shap Paper: https://arxiv.org/pdf/1705.07874.pdf

• Article of Scott Lundberg presenting SHAP: https://towardsdatascience.com/interpretable-

machine-learning-with-xgboost-9ec80d148d27

• Article by Edward Ma on Shapley Values: https://towardsdatascience.com/interpreting-your-

deep-learning-model-by-shap-e69be2b47893

• Book on model Interpretability : https://christophm.github.io/interpretable-ml-

book/shapley.html#general-idea

• Shap Github page: https://github.com/slundberg/shap/tree/master/shap/plots

• Wikipedia on Shapley values: https://en.wikipedia.org/wiki/Shapley_value© Whitehorn and Bruner


https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27

https://towardsdatascience.com/interpreting-your-deep-learning-model-by-shap-e69be2b47893

https://christophm.github.io/interpretable-ml-book/shapley.html#general-idea

https://github.com/slundberg/shap/tree/master/shap/plots

https://en.wikipedia.org/wiki/Shapley_value

powerpoint presentation€¦ · title: powerpoint presentation author: maf created date: 10/3/2019...

Documents