making & breaking machine learning anomaly detectors in real life by clarence chio - code blue...
TRANSCRIPT
My goal
• give an overview of Machine Learning Anomaly Detectors
• spark discussions on when/where/how to create these
• explore how “safe” these systems are
• discuss where we go from here
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0
205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985
how to find anomalies in these???
Why are ML-based techniques attractive compared to threshold/rule-based heuristics?
• adaptive • dynamic • minimal human intervention (theoretically)
Why are threshold/rule-based heuristics good?
• easy to reason • simple & understandable • can also be dynamic/adaptive
The big ML + Anomaly Detection Problem
a lot of machine learning + anomaly detection research, but not a lot of successful systems in the real world.
WHY?
The big ML + Anomaly Detection Problem
Anomaly Detection:
Traditional Machine Learning:
find novel attacks, identify never seen before things
learn patterns, identify similar things
What makes Anomaly Detection so different?
fundamentally different from other ML problems• very high cost of errors • lack of training data • “semantic gap” • difficulties in evaluation • adversarial setting
Really bad if the system is wrong…
•compared to other learning applications, very intolerant to errors
•what happens if we have a high false positive rate?
•high false negative rate?
Hard to interpret the results/alerts…
the “semantic gap”ok… I got the alert… why did I get the alert…?
The evaluation problem
• devising a sound evaluation scheme is even more difficult than building the system itself
• problems with relying on ML Anomaly Detection evaluations in academic research papers
How have real world AD systems failed?
• many false positives • hard to find attack-free training data • used without deep understanding • model-poisoning
Doing it!
• generate time-series • select representative features • train/tune model of ‘normality’ • alert if incoming points deviate from model
density-based
subspace/correlation-based
support vectormachines
clustering
neural networks
Common Techniques
How to select features?
• often ends up being the most challenging piece of the problem
• isn’t it just a parameter optimization problem?
How to select features?
Difficulties: • too many possible combinations to iterate! • hard to evaluate • frequently changing “optimal”
• performance accuracy not the only criteria • improved model interpretability • shorter training times • enhanced generalization / reduced overfitting
Principal Component Analysis
• common statistical method to automatically select features
How?• transform data into different dimensions
• returns an ordered-list of dimensions (principal components) that can best represent data’s variance
Principal Component Analysis
http://setosa.io/ev/principal-component-analysis/
projection
Principal Component Analysis
http://setosa.io/ev/principal-component-analysis/
Principal Component Analysis
http://setosa.io/ev/principal-component-analysis/
true PCA result, maximize variance capture
Principal Component Analysis
choose principal components that cover 80-90% of the dataset's variance
“Scree” Plot
PCA more effective
Number of Principal Components Used
Cum
ulat
ive
% V
aria
nce
Cap
ture
100
8040
200
60
10^0 10^1 10^2 10^3
PCA less effective
How to avoid common pitfalls?
• Understand your threat model well • Keep the detection scope narrow • Reduce the cost of false negatives/positives
How do we attack this?
manipulate learning system to permit a specific attack
degrade performance of the system to compromise it’s reliability
Attacking PCA-based systems
centerbeforeattack center
afterattack
“chaff”attack direction
decisionboundary
Can Machine Learning be secure?
not easy to achieve for unsupervised, online learning
slow adversaries downgives you time to
detect when you’re being targeted
Robust statistics
use median instead of meanPCA’s ‘variance’ maximization vs. Antidote’s ‘median absolute deviation’
find an appropriate distribution that models your dataset
normal/gaussian vs. laplacian distributions
use robust PCA
Projection on 1st Principal Component
Proj
ectio
n on
to “
Targ
et F
low
”
by the way, generating this chaff is hard
Projection on 1st Principal Component
Training Periods
24
6
10
8
#
Proj
ectio
n on
to “
Targ
et F
low
”
Random Detector
RPCA - No Poisoning
RPCA - Boiling Frog,
50% Chaff over 10
training periods
RPCA - 30% chaff
RPCA - 50% chaff
False Alarm Rate (False Positive Rate)
Pois
onin
g D
etec
tion
Rat
e (T
rue
Posi
tive
Rat
e)
1.0
0.8
0.4
0.2
00.
6
0 0.2 0.4 0.6 0.8 1.0
true positive rate vs. false positive rate
Attack Duration (# of training periods)
Evas
ion
Succ
ess
Rat
e (F
alse
Neg
ativ
e R
ate)
1.0
0.8
0.4
0.2
00.
6
0 2 4 6 8 10
Chaff Injected 0% 10% 20% 30% 40% 50%
For Boiling Frog
For Naive Injection
RPCA - Boilin
g Frog,
50% Chaff spread over
x periods
RPCA - Naive Injection
Evasion success rates
Naive PCA Robust(er) PCA
Naive Chaff Injection(50% injection,
single training period)
~ 76% evasion success ~ 14% evasion success
Boiling FrogInjection
(10 training periods)
~ 87% evasion success ~ 38% evasion success
• not so good, but improving…
• pure ML-based anomaly detectors are still vulnerable to compromise
• use ML to find features and thresholds, then run streaming anomaly detection using static rules
Anomaly detection systems today
What next?
• do more tests on AD systems that others have created
• other defenses against poisoning techniques
• experiment on mode resilient ML models