powerpoint presentation€¦ · title: powerpoint presentation author: maf created date: 10/3/2019...
TRANSCRIPT
1
Prof. Mark WhitehornEmeritus Professor of Analytics
Computing
University of Dundee
Consultant
Writer (author)[email protected]
It’s all about us…
© Whitehorn and Bruner
2
I teach Masters at Dundee in:
Data Science• Part time
• Distance learning –
aimed at existing data professionals
© Whitehorn and Bruner
3
Giovanni Bruner
Data Scientist @ Nexiwww.nexi.it/en.html
It’s all about us…
Email : [email protected]
Linkedin: https://www.linkedin.com/in/giovanni-bruner-22300937/
© Whitehorn and Bruner
4
Some machine learning algorithms not only work but the models they produce can readily be understood
by mere humans; decision trees are a wonderful example here. The same is not true of neural nets which
conceal their decision making process behind a massive smokescreen of numbers. But we live in an age
of accountability where people have a right to know why their loan was refused or why their Mother’s hip
replacement was rescheduled for the fourth time.
This talk will outline (very briefly) why it is inherently difficult to understand how a given neural net came
to a given decision in a given case. Most of the talk will be spent looking at some of the work that is
going on to try to blow away the smokescreen. Please note this is an introduction to the topic which
means it will involve little to no maths
An introduction to interpretability
LOCATION: GIELGUD
DATE: OCTOBER 1, 2019
TIME: 13.35 - 2:20 PM
45 MINUTES
© Whitehorn and Bruner
5
Without necessarily knowing it we normally use the von Neumann computational model*. This provides a
clear separation of the data from the instructions that manipulate it. NNs are different, in these the flow
of the data itself changes the weightings which are, themselves, part of the instructions for the
manipulation of the data.
*the incomplete First Draft of a Report on the EDVAC, 1945 John von Neumann
NN versus traditional programming
© Whitehorn and Bruner
6
Who is this
fictional character?
http://wwws.warnerbros.co.uk© Whitehorn and Bruner
7
OK it was an easy question.
For what is Sherlock Holmes
famous?
Who is this
fictional character?
http://wwws.warnerbros.co.uk© Whitehorn and Bruner
8
OK it was an easy question.
For what is Sherlock Holmes
famous?
Deduction. “the Science of Deduction and Analysis is one which can only be
acquired by long and patient study…”
The Sign of Four
Sir Arthur Conan Doyle
Who is this
fictional character?
http://wwws.warnerbros.co.uk© Whitehorn and Bruner
9
Perhaps we should let Dr. Watson
have the limelight just this once.
In 'A Study in Scarlet' he is asked
about some pills. He says :-
“From their lightness and
transparency, I should imagine that
they are soluble in water.”
But what is Deduction?
http://wwws.warnerbros.co.uk© Whitehorn and Bruner
10
Deduction is applying a rule that
you already know to a specific
situation.
Induction is creating the rule in the
first place.
Compare and contrast
http://wwws.warnerbros.co.uk© Whitehorn and Bruner
11
So one way of looking at this movement away from the von Neumann architecture
is that the machines are now doing the induction. Which means that there is no
human behind the code generating the rules.
How can we explain the result if wey don’t understand the rules?
NN versus traditional programming
© Whitehorn and Bruner
12
I said that some ML algorithms are easy to understand and used decision trees as an example. But even
the wonderful decision tree can become difficult to understand when, for example, we bundle them
together as random forests.
Neural nets are a good example where interpretability is almost always an issue so I have used them as
the main example but the problem is endemic in ML as a whole.
And it isn’t just NN
© Whitehorn and Bruner
13
Trusting the Black Box –
GDPR
GDPR is a set of EU data privacy regulations that is heavily impacting
data governance in many companies.
GDPR Article 22(1):
“The data subject shall have the right not to be subject to a decision based solely on automated
processing, including profiling, which produces legal effects concerning him or her or similarly
significantly affects him or her.”
Some commentators argue that GDPR therefore requires a “right to explanation”, however there
isn’t a consensus on this interpretation.
© Whitehorn and Bruner
14
Irrespective of GDPR, interpretability is still important
• You may want to make sure that your model is not picking up a racial,
gender or religious bias. What if your model always refuses a loan to
people from a specific minority?
• Your model might be predicting the right thing, for the wrong
reasons. For example:
© Whitehorn and Bruner
15
https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime
The model predicts the right class but for the wrong reasons
© Whitehorn and Bruner
16
• In the Husky vs Wolves experiment*
researchers built an image recognition
model that could correctly classify
Huskies and Wolves very accurately.
• However investigation revealed that
the recognition system was deciding
based on the snow in the background
of the image.
• Would you trust this model?
* Marco Tulio et al.© Whitehorn and Bruner
Would you trust this model?
17
• By the same logic I can prove that
James Frost, one of the speakers at
this very conference is, in fact, …...
* Marco Tulio et al.© Whitehorn and Bruner
Would you trust this model?
18
• By the same logic I can prove that
James Frost, one of the speakers at
this very conference is, in fact, a Wolf.
* Marco Tulio et al.
Would you trust this model?
Wolf
© Whitehorn and Bruner
19
• There are several definitions of interpretability in the context of a
Machine Learning model. Possibly the best is Interpretability as trust.
• Trust that the model is predicting a certain value for the “right reasons”.
• Interpretability is key to ensure the social acceptance of Machine
Learning algorithms in our everyday lives (assuming, that as a society, we
actually want to use machine learning in this way).
Defining Interpretability
© Whitehorn and Bruner
20
Reference
https://arxiv.org/pdf/1602.04938.pdf© Whitehorn and Bruner
21
Local Interpretable Model-Agnostic Explanations (LIME): An Introduction
A technique to explain the predictions of any machine learning classifier.
Marco Tulio, Ribeiro Sameer, Singh, Carlos Guestrin
www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime
Adversarial Patch
Tom B. Brown, Dandelion Mané , Aurko Roy, Martín Abadi, Justin Gilmer
arxiv.org/pdf/1712.09665.pdf
Robust Physical-World Attacks on Deep Learning Visual Classification
Kevin Eykholt, Ivan Evtimov, et al.
arxiv.org/pdf/1707.08945.pdf
© Whitehorn and Bruner
22
Trusting the Black Box –
Adversarial Attacks
Deep Learning models, especially image
recognition, are heavily vulnerable to
adversarial attacks. Brown et al. recently
demonstrated that a patch labelled as a toaster,
randomly applied to an image, could
dramatically affect the performance of a classifier.
More scarily, Evtimov et al. showed that
it’s possible to fool a model built to
identify road signs by perturbing them
with stickers (and a robust general attack
algorithm).
Imagine the consequences for
autonomous vehicles!
© Whitehorn and Bruner
23
Trusting the Black Box – Here is a complex model
Convolutional Neural Network?
• This is Inception, one of the most commonly used architecture in image
recognition.
• It has 23,851,783 million parameters, for a total of 159 layers. Just the weights
and the architecture of the trained model are 92 MB.© Whitehorn and Bruner
24
Trusting the Black Box – and another one!
• And what about this one?
• VGG16 is another frequently used architecture in image recognition.
• It has a mere 23 layers, but a total of 138,357,544 million parameters and a size of 528 MB© Whitehorn and Bruner
25
Trusting the Black Box –
Stacking models
It has become common practice to train stack of models to achieve minor improvements; this
has become common practice in order to win Kaggle competitions (https://www.kaggle.com/).
• This is arguably not a very smart thing to do in a production system; there are too many
models to maintain. But it works !!
1.Xgboost 1
2.Xgboost 2
3.Xgboost 3
4.Xgboost 4
5.Logistic Regression
6.Support Vector Machine
7.CNN 1
8.CNN 2
Predictions 1
Predictions 2
Predictions 3
Predictions 4
Predictions 5
Predictions 6
Predictions 7
Predictions 8
Final
Model or
Simple
Averaging
© Whitehorn and Bruner
26
Trusting the Black Box – Auto
ML
• Auto ML packages allow us to completely automate the ML timeline.
• They try many different models, tune them and then combine them.
• Examples are Tpot, DataRobot, H20.ai
© Whitehorn and Bruner
27
An Algorithmic Approach - Lime
© Whitehorn and Bruner
28
Trusting the Black Box – What
is Lime
• LIME stands for Local Interpretable Model agnostic Explanations.
• Many algorithms allow you to inspect the Global most important features. With
Lime a user can identify the Local most important features that affected the model
output for a specific training case.
• The beauty of LIME is that it’s, as the name suggests, completely model agnostic.
• For each case we want to explain, we build a dataset of perturbed instances and
we learn a local, simple model of this dataset. We extract the n most important
features of this local model.
https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime© Whitehorn and Bruner
29
1. Get an
instance you
want to explain.
2. Split into
interpretable
components.
3. Perturb a number of
instances (default is
5000) and pass them
through the original
model to obtain a
prediction value.
4. Create a local
model of this
newly generated
dataset.
5. Find explanation.
Trusting the Black Box – How
does it work ?
© Whitehorn and Bruner
30
The original model's complex function (f) for identifying the image (which is unknown
to LIME) is represented by the blue/pink background.
The bold red cross is the instance being explained.
LIME creates perturbed instances and uses the original model (f) to provide
predictions. It then weighs the instances by their proximity to the instance being
explained. That weight is represented here by size. The dashed line is the learned
explanation that is locally (but not globally) faithful.*
* Images and text adapted from Marco Tulio et al.© Whitehorn and Bruner
31
“Now is the winter of our discontent made glorious summer by this son of York;” P(Richard III) = 0.7
Now is the winter of our discontent. Made glorious summer by this son of York;
Now is the winter of our discontent. Made glorious summer by this son of York;
Now is the winter of our discontent. Made glorious summer by this son of York;
Now is the winter of our discontent. Made glorious summer by this son of York;
Now is the winter of our discontent. Made glorious summer by this son of York;
Now is the winter of our discontent. Made glorious summer by this son of York;
Imagine we have created a model to predict a Shakespeare play from a quote. We want to
explain the instance below:
Generate perturbed instance
0.65
0.72
0.80
0.45
0.74
0.88
Trusting the Black Box – How
does it work?
© Whitehorn and Bruner
32
Now is the winter of our discontent. Made glorious summer by this son of
York;
Now is the winter of our discontent. Made glorious summer by this son of
York;
Now is the winter of our discontent. Made glorious summer by this son of
York;
.
.
.
0.65
0.72
.
.
0.85
Explanation
1. Summer
2. By
3. Discontent
4. Of
5. York
Trusting the Black Box – How
does it work?
© Whitehorn and Bruner
33
Trusting the Black Box – A few
drawbacks
• LIME works with tabular data and regression problems, but the results
are harder to read than images and text.
• When tabular, the continuous variables are discretized into quantiles
(rather than mega-pixels or words).
• For image processing, depending on your choice of network
architecture, it can be slow – so not good to implement into a low
latency production system. But it should really only be used for
internal analysis.
© Whitehorn and Bruner
34
Interpretability in practice.
A Machine Learning model works with a set of features in a
multi dimensional space with the objective to minimize a
function or maximizing a likelihood.
It’s like a game, with a set of players (our inputs) trying to
reach an objective (a correct prediction). We need to able to
understand which players contributed the most to the
objective.
© Whitehorn and Bruner
35
Ok, ok I got this…in fact, when it’s possible I always plot
features importance, to see which variable my model used the
most to issue predictions.
A possible solution…
Isn’t that enough?
© Whitehorn and Bruner
36
Nope
A possible solution…
Well, maybe sometimes© Whitehorn and Bruner
37
Here is where SHAP can come in our help
A possible solution…
© Whitehorn and Bruner
38
• SHAP stands for Shapley Additive Explanations.
It’s a model-agnostic, efficient algorithm, to
compute features contribution to a model
output.
• With non linear black box models SHAP
provides accurate and consistent features
importance values.
• It allows meaningful, local explanations of
individual predictions.
• SHAP borrows concepts from cooperative
game theory: The Shapley Values
SHAP? What is it?
It was developed by Scott
Lundberg and Su-In Lee from
University of Washington (WA)*
https://arxiv.org/pdf/1705.07874.pdf© Whitehorn and Bruner
39
• Shapley values are a concept in cooperative game theory. They where
introduced in 1953 by the Nobel Prize winner Lloyd Shapley, one of the fathers of
Game Theory*.
• The overall intuition behind the concept is that sometimes a player value in a team
could be greater than their value if they were on their own.
• In a Machine Learning setting a Shapley value is “the contribution of a feature
value to the difference between the actual prediction and the mean prediction”…
• …which is equivalent to answer this question: “Given that without any features we
would just predict an average value, once we bring the first feature in how much
our prediction changes compared to the average?”
Shapley Values
https://en.wikipedia.org/wiki/Shapley_value© Whitehorn and Bruner
40
1) Given a set N of players I, each of which can be attributed a value
3) We then calculate the marginal contribution given by that feature in the following way:
4) Where R is an ordering, given by permuting the values in set N, and is the
set of a players preceding i in the order R.
Set of features
preceding i in
order R,
including i
Set of features
preceding and
excluding i
2) We calculate a set of permutations R of N.
Let’s start with the Math
© Whitehorn and Bruner
41
A moment of Calm
This is way easier than it looks, really.© Whitehorn and Bruner
42
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
Algorithmi. Calculate all possible coalitions permutations.
ii. For each permutation take the set of players
preceding our target Jedi.
iii. Include the target Jedi in this subset
iv. Then subtract the contribution of the subset
excluding the target Jedi
© Whitehorn and Bruner
43
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
44
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L
Y, L, O
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
Algorithmi. Calculate all possible coalitions permutations.
ii. For each permutation take the set of players
preceding our target Jedi.
iii. Include the target Jedi in this subset
iv. Then subtract the contribution of the subset
excluding the target Jedi
© Whitehorn and Bruner
45
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10
Y, L, O
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
46
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25
Y, L, O
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
47
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
48
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
49
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
50
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L V(O) = 9
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
51
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
52
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
O, L, Y
L, Y, O
L, O, Y
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
53
V ( ) = 10
V ( ) = 27+
V ( ) = 35+
V ( ) = 25+
V ( ) = 9
V ( ) = 8
V ( ) = 45+ +
Order R Yoda
Contribution*
Obi Contribution* Luke Contribution*
Y, O, L V(Y) = 10 V(O,Y) –V(Y) = 35 – 10 = 25 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
Y, L, O V(Y) = 10 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L,Y) – V(Y) = 27 – 10 = 17
O,Y, L V(Y, O) –V(O) = 35-9 = 26 V(O) = 9 V(L, O, Y) –V(O, Y) = 45 – 35 = 10
O, L, Y V(Y, L, O) –V(L, O) = 45 – 25 =
20
V(O) = 9 V(L, O) –V(O) = 25 – 9 = 16
L, Y, O V(L,Y) –V(L) = 27 – 8 =19 V(O, L, Y) –V(L,Y) = 45 – 27 = 18 V(L) = 8
L, O, Y V(Y, L, O) –V(L, O) = 45 – 25 =
20
V(O, L) –V(L) = 25 – 8 = 17 V(L) = 8
* Marginal Contributions
Some friends may help explaining this…
Our Coalition Our ObjectiveKill Vader
© Whitehorn and Bruner
54
Initial Value Payout (SHAP Value)
10 10 + 10 + 26 + 20 + 19 + 20 = 17.5
9 26 + 18 + 9 + 9 + 18 + 17 = 16.2
8 10 + 17 + 10 + 16 + 8 + 8 = 11.5
After calculating each player marginal contributions* we realize that although Luke is 20% weaker than
Yoda he contributed 34% less than Yoda. Obi in terms of contribution is much closer to Yoda!
*”The Shapley value can be misinterpreted. The Shapley value of a feature value is not the difference of the predicted value after removing the feature from the model training. The
interpretation of the Shapley value is: Given the current set of feature values, the contribution of a feature value to the difference between the actual prediction and the mean
prediction is the estimated Shapley value” (https://christophm.github.io/interpretable-ml-book/shapley.html#general-idea)
Now we can calculate the payout for each Jedi
© Whitehorn and Bruner
55
Initial Value Payout (SHAP Value)
10 10 + 10 + 26 + 20 + 19 + 20 = 17.5
9 26 + 18 + 9 + 9 + 18 + 17 = 16.2
8 10 + 17 + 10 + 16 + 8 + 8 = 11.5
Of course, we won’t really use Jedi knights.
We will be interested in inputs to a Machine Learning algorithm
Q1 Time of accident?
Q2 Location?
Q3 police informed?
© Whitehorn and Bruner
56
The great news is that this we have yet to sort this problem
completely; this is ongoing research.
You can come up with contributions.
Summary
© Whitehorn and Bruner
57
References
• Shap Paper: https://arxiv.org/pdf/1705.07874.pdf
• Article of Scott Lundberg presenting SHAP: https://towardsdatascience.com/interpretable-
machine-learning-with-xgboost-9ec80d148d27
• Article by Edward Ma on Shapley Values: https://towardsdatascience.com/interpreting-your-
deep-learning-model-by-shap-e69be2b47893
• Book on model Interpretability : https://christophm.github.io/interpretable-ml-
book/shapley.html#general-idea
• Shap Github page: https://github.com/slundberg/shap/tree/master/shap/plots
• Wikipedia on Shapley values: https://en.wikipedia.org/wiki/Shapley_value© Whitehorn and Bruner