making & breaking machine learning anomaly detectors in real life by clarence chio - code blue...

Making & Breaking Machine Learning

Anomaly Detectors in Real Life

Machine Learning

My goal

• give an overview of Machine Learning Anomaly Detectors

• spark discussions on when/where/how to create these

• explore how “safe” these systems are

• discuss where we go from here

Anomaly Detection Machine Learningvs.

Taxonomy

Taxonomy

Anomaly DetectionMachine Learning

Heuristics/Rule-based

Predictive ML

Intrusion detection?

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985

199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085

burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0

199.120.110.21 - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0" 200 4179

burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0

burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0

205.212.115.106 - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/countdown.html HTTP/1.0" 200 3985

how to find anomalies in these???

Why are ML-based techniques attractive compared to threshold/rule-based heuristics?

• adaptive • dynamic • minimal human intervention (theoretically)

but…

Why are threshold/rule-based heuristics good?

• easy to reason • simple & understandable • can also be dynamic/adaptive

Successful ML Applications

Setting Expectations

The big ML + Anomaly Detection Problem

a lot of machine learning + anomaly detection research, but not a lot of successful systems in the real world.

WHY?

The big ML + Anomaly Detection Problem

Anomaly Detection:

Traditional Machine Learning:

find novel attacks, identify never seen before things

learn patterns, identify similar things

What makes Anomaly Detection so different?

fundamentally different from other ML problems• very high cost of errors • lack of training data • “semantic gap” • difficulties in evaluation • adversarial setting

Really bad if the system is wrong…

•compared to other learning applications, very intolerant to errors

•what happens if we have a high false positive rate?

•high false negative rate?

Lack of training data…

•what data to train model on? •so hard to clean input data!

Hard to interpret the results/alerts…

the “semantic gap”ok… I got the alert… why did I get the alert…?

The evaluation problem

• devising a sound evaluation scheme is even more difficult than building the system itself

• problems with relying on ML Anomaly Detection evaluations in academic research papers

Adversarial impact

advanced actors can (and will) spend the time and effort to bypass the system

How have real world AD systems failed?

• many false positives • hard to find attack-free training data • used without deep understanding • model-poisoning

So… is it hopeless?

Doing it!

• generate time-series • select representative features • train/tune model of ‘normality’ • alert if incoming points deviate from model

Example infrastructure

Sensitivity of PCA for Traffic Anomaly Detection, Ringberg et. al.

density-based

subspace/correlation-based

support vectormachines

clustering

neural networks

Common Techniques

“Model”?clusters

• centroid clusters • good for “online learning”

How to select features?

• often ends up being the most challenging piece of the problem

• isn’t it just a parameter optimization problem?

How to select features?

Difficulties: • too many possible combinations to iterate! • hard to evaluate • frequently changing “optimal”

• performance accuracy not the only criteria • improved model interpretability • shorter training times • enhanced generalization / reduced overfitting

Principal Component Analysis

• common statistical method to automatically select features

How?• transform data into different dimensions

• returns an ordered-list of dimensions (principal components) that can best represent data’s variance


http://setosa.io/ev/principal-component-analysis/

projection




true PCA result, maximize variance capture



choose principal components that cover 80-90% of the dataset's variance

“Scree” Plot

PCA more effective

Number of Principal Components Used

Cum

ulat

ive

% V

aria

nce

Cap

ture

100

8040

200

60

10^0 10^1 10^2 10^3

PCA less effective

How to avoid common pitfalls?

• Understand your threat model well • Keep the detection scope narrow • Reduce the cost of false negatives/positives

CLOSE THE

SEMANTIC GAP

How good is my anomaly detector?

how easily can you filter out false positives?

How good is my anomaly detector?evaluating true positives?

how do we attack this?

the most important question…

How do we attack this?

manipulate learning system to permit a specific attack

degrade performance of the system to compromise it’s reliability

Attacking PCA-based systems

centerbeforeattack center

afterattack

“chaff”attack direction

decisionboundary


beforeattack

afterattack

“chaff”

no clear attack direction

BOILING FROG ATTACK


chaff volume vs. injection period

to avoid detection, go slow!

How do you defend against this?

maintain a calibration test set


afterbefore


decision boundary ratio-detection


beforeattack

afterattack

decision boundaryregion

Can Machine Learning be secure?

not easy to achieve for unsupervised, online learning

slow adversaries downgives you time to

detect when you’re being targeted


Improved PCA

• Antidote • Principal component pursuit • Robust PCA

Robust statistics

use median instead of meanPCA’s ‘variance’ maximization vs. Antidote’s ‘median absolute deviation’

find an appropriate distribution that models your dataset

normal/gaussian vs. laplacian distributions

use robust PCA

My own tests.

I ran my own simulations with some real data… why did I do this?

Projection on 1st Principal Component

Proj

ectio

n on

to “

Targ

et F

low

”


Naive PCARobust PCA

Proj

ectio

n on

to “

Targ

et F

low

”


Proj

ectio

n on

to “

Targ

et F

low

”


Proj

ectio

n on

to “

Targ

et F

low

”

by the way, generating this chaff is hard


Robust PCA

Naive PCA

Proj

ectio

n on

to “

Targ

et F

low

”


Training Periods

24

6

10

8

#

Proj

ectio

n on

to “

Targ

et F

low

”


Proj

ectio

n on

to “

Targ

et F

low

”


Robust PCA

Naive PCA

Proj

ectio

n on

to “

Targ

et F

low

”

Random Detector

RPCA - No Poisoning

RPCA - Boiling Frog,

50% Chaff over 10

training periods

RPCA - 30% chaff

RPCA - 50% chaff

False Alarm Rate (False Positive Rate)

Pois

onin

g D

etec

tion

Rat

e (T

rue

Posi

tive

Rat

e)

1.0

0.8

0.4

0.2

00.

6

0 0.2 0.4 0.6 0.8 1.0

true positive rate vs. false positive rate

Attack Duration (# of training periods)

Evas

ion

Succ

ess

Rat

e (F

alse

Neg

ativ

e R

ate)

1.0

0.8

0.4

0.2

00.

6

0 2 4 6 8 10

Chaff Injected 0% 10% 20% 30% 40% 50%

For Boiling Frog

For Naive Injection

RPCA - Boilin

g Frog,

50% Chaff spread over

x periods

RPCA - Naive Injection

Evasion success rates

Naive PCA Robust(er) PCA

Naive Chaff Injection(50% injection,

single training period)

~ 76% evasion success ~ 14% evasion success

Boiling FrogInjection

(10 training periods)

~ 87% evasion success ~ 38% evasion success

• not so good, but improving…

• pure ML-based anomaly detectors are still vulnerable to compromise

• use ML to find features and thresholds, then run streaming anomaly detection using static rules

Anomaly detection systems today

What next?

• do more tests on AD systems that others have created

• other defenses against poisoning techniques

• experiment on mode resilient ML models

[email protected]

@cchio

mailto:[email protected]

making & breaking machine learning anomaly detectors in real life by clarence chio - code blue...

Technology