a sober look at machine learning

54
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. A SOBER LOOK AT MACHINE LEARNING DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER

Upload: sven-krasser

Post on 15-Apr-2017

658 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

A SOBER LOOK AT MACHINE LEARNING

DR. SVEN KRASSER CHIEF SCIENTIST@SVENKRASSER

Page 2: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Distinguishing Science…

Source: CERN, http://home.cern/sites/home.web.cern.ch/files/image/experiment/2013/01/cms_0.jpeg

Page 3: A Sober Look at Machine Learning

…from FictionSource: “Chain Reaction,” 20th Century Fox

Page 4: A Sober Look at Machine Learning

MACHINE LEARNING 101

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 5: A Sober Look at Machine Learning

EXAMPLES OF MACHINE LEARNING

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

SPAM FILTERING

MOVIE RECOMMENDATIONS

SIRI(iPHONE)

Page 6: A Sober Look at Machine Learning

TODAY’S FOCUS: SUPERVISED LEARNING

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 7: A Sober Look at Machine Learning

TODAY’S FOCUS: GEOMETRIC MODELS

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 8: A Sober Look at Machine Learning

EVERYTHING YOU WILL SEE TODAY IS REAL WORLD DATA

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 9: A Sober Look at Machine Learning

Some Data to Get Started:1988 ANTHROPOMETRIC

SURVEY OF ARMY PERSONNEL

Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 10: A Sober Look at Machine Learning

• Over 4000 soldiers surveyed• Over 100 measurements• Reported by gender

Test subjects are in better shape than the rest of us...

Data

Selection Bias

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 11: A Sober Look at Machine Learning

FIRST LOOK

Height [mm]

Den

sity

• Difference in distribution

• Significant overlap

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 12: A Sober Look at Machine Learning

SECOND DIMENSION

Height [mm]

Wei

ght

[10

-1kg

]

• Correlation

• Overlap

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 13: A Sober Look at Machine Learning

FEATURE SELECTION

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

•Correlation

•Gender-specific slope

•Reduced overlap

• Selection of features matters

•How to make a prediction?

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 14: A Sober Look at Machine Learning

K-NEAREST NEIGHBOR

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

m

f

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 15: A Sober Look at Machine Learning

SUPPORT VECTOR MACHINE

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 16: A Sober Look at Machine Learning

SUPPORT VECTOR MACHINE

2016 CrowdStrike, Inc. All rights reserved.“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

•Overfitting

•Classifier does not generalize

• Let’s take a closer look…

Page 17: A Sober Look at Machine Learning

CROSSVALIDATION

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

TRAIN TRAIN TRAIN TEST

TRAIN TRAIN TEST TRAIN

TRAIN TEST TRAIN TRAIN

TEST TRAIN TRAIN TRAIN

• Divide data into k folds

• Train on k-1 folds, test on the remaining one

• Repeat k times for all folds

Page 18: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

• Classifier generalizes

• Note some misclassifications

• Let’s assume we want to detect males (blue)§ I.e. “blue” is our

positive class

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 19: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 20: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 21: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 22: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Wei

ght

[10

-1kg

]

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 23: A Sober Look at Machine Learning

LET’S CLASSIFY

“Buttock Circumference” [mm]

Weight  [10

-­1kg]

• Get more “blue” right (true positives)

• Get more “red” wrong (false positives)

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 24: A Sober Look at Machine Learning

RECEIVER OPERATING CHARACTERISTICS CURVE

False Positive Rate

Tru

e P

osi

tive

Rat

e

Detect  more  by  accepting  more  false  positives

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 25: A Sober Look at Machine Learning

THREE DIMENSIONS

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 26: A Sober Look at Machine Learning

MORE DIMENSIONS

Decision Value

Den

sity

• Linear model in ~160 dimensions

• Linearly separable

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 27: A Sober Look at Machine Learning

Source: Source: http://playground.tensorflow.org/

Page 28: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

TREES AND TREE ENSEMBLES

Page 29: A Sober Look at Machine Learning

SPARSEFEATURES

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

400 401 402 403 404 405 406 407 408 409 410 411 412 413 414

area codes

0 0 0 1 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

0 0 0 0 1 0 0 0 0 0 0 0 0 0 0

Page 30: A Sober Look at Machine Learning

N-GRAMS

43 72 6F 77 64 53 74 72 69 6B 65

43726F 776453 747269

726F77 645374 72696B

6F7764 537472 696B65

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 31: A Sober Look at Machine Learning

MISSION ACCOMPLISHED:WE JUST ADD MORE DIMENSIONS…

RIGHT?

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 32: A Sober Look at Machine Learning

CURSE OF DIMENSIONALITY

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

REDUCEDpredictive

performance

INCREASEDtraining time

SLOWERclassification

LARGERmemory footprint

Page 33: A Sober Look at Machine Learning

Source: https://commons.wikimedia.org/w/index.php?curid=2257082

Page 34: A Sober Look at Machine Learning

Source: https://commons.wikimedia.org/w/index.php?curid=2257082

Page 35: A Sober Look at Machine Learning
Page 36: A Sober Look at Machine Learning

DIMENSIONALITY AND SPARSENESS

2016 CrowdStrike, Inc. All rights reserved.Height (mm)

Wei

ght

[10

-1kg

]

Page 37: A Sober Look at Machine Learning

DIMENSIONALITY AND SPARSENESS

2016 CrowdStrike, Inc. All rights reserved.Height (mm)

Wei

ght

[10

-1kg

]

Page 38: A Sober Look at Machine Learning

MANAGINGDIMENSIONALITY

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

• FEATURE ELIMINATION

– Feature ranking

– Stop words

• FEATURE REDUCTION

– Principal Component Analysis

– Autoencoders

– Points on lower-dimensional manifold

– Stemming

• ENSEMBLE METHODS

– Classifier of classifiers, e.g. stacking

– Bagging and subspace sampling, e.g. Random Forests

• And much, much more…

Page 39: A Sober Look at Machine Learning

SECURITY APPLICATIONS

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 40: A Sober Look at Machine Learning

FILE ANALYSISAKA Static Analysis

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

• THE GOOD

– Relatively fast

– Scalable

– No need to detonate

– Platform independent, can be done at gateway

– Can support file similarity analysis

• THE BAD

– Limited insight due to narrow view

– Different file types require different techniques

– Different subtypes need special consideration– Packed files

– .Net

– Installers

– EXEs vs DLLs

– Obfuscations (yet good if detectable)

– Ineffective against exploitation and malware-less attacks

– Asymmetry: a fraction of a second to decide for the defender, months to craft for the attacker

Page 41: A Sober Look at Machine Learning

EXAMPLE FEATURES

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

32/64 BIT EXECUTABLE

GUI SUBSYSTEM

COMMAND LINE

SUBSYSTEMFILE SIZE TIMESTAMP

DEBUG INFORMATION

PRESENTPACKER TYPE FILE ENTROPY NUMBER OF

SECTIONSNUMBER

WRITABLE

NUMBER READABLE

NUMBER EXECUTABLE

DISTRIBUTION OF SECTION

ENTROPY

IMPORTED DLL NAMES

IMPORTED FUNCTION

NAMES

COMPILER ARTIFACTS

LINKER ARTIFACTS

RESOURCE DATA

EMBEDDED PROTOCOL STRINGS

EMBEDDED IPS/DOMAINS

EMBEDDED PATHS

EMBEDDED PRODUCT

META DATA

DIGITAL SIGNATURE

ICON CONTENT …

Page 42: A Sober Look at Machine Learning

COMBINING FEATURES

• Projection to show clusters

• For illustration, not the space in that we classify

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 43: A Sober Look at Machine Learning

EXECUTIONANALYSISAKA Dynamic Analysis

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

• THE GOOD

– Captures actual behavior of file

– Obfuscating behavior is hard

– Effective against exploitation

– Effective against malware-less attacks

– Not dependent on awareness of specific file types

• THE BAD

– File needs to be executed

– Takes additional time to observe execution

– Execution depends on environment (e.g. sandbox vs real world)

Page 44: A Sober Look at Machine Learning

EXAMPLE: GLOBAL BEHAVIOR

§ Behavior across many executions of a file

§ Conducted on event data centrally located in the cloud

Krasser, S., Meyer, B., & Crenshaw, P. (2015). Valkyrie: Behavioral Malware Detection using Global Kernel-level Telemetry Data. In Proceedings of the 2015 IEEE International Workshop on Machine Learning for Signal Processing.

Page 45: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

ML VS OTHER TECHNIQUES

§ ML output is probabilistic

§ Use other techniques where appropriate

§ Most ML-based engines use standard hashes or fuzzy hashes on top of a model

§ Example: credentials theft IoA

Page 46: A Sober Look at Machine Learning

EVALUATING ML SOLUTIONS

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

Page 47: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

PRELIMINARIES

§ ML is not a feature, it is an implementation detail

§ Every solution must make trade-offs of conflicting objectives§ FP vs TP

§ Speed vs accuracy

§ Memory footprint vs accuracy

§ Expressiveness vs explainability

§ Benchmarks under different assumptions are very hard to compare, even internally

§ Marchitecture

§ Looking at the right data: 60% of intrusions do not involve malware

Page 48: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

How much data is there to train on?

SCOPE: SCALE

§ Volume of data generated by sources used

§ Aperture: footprint of deployment

§ Data collection

§ Point of analysis (endpoint, on-prem, cloud)

Page 49: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

How many data sources are used?

SCOPE: BREADTH

§ Varied sources and techniques§ Static analysis

§ Behavioral analysis

§ Proliferation

§ Indicators from other techniques

§ Access to historical data§ Baseline

§ Process lineage

§ “Number of characteristics” is not a useful metric

Page 50: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

DETECTION RATE

§ Detection rate w/o false positive rate is meaningless

§ Considering the base rate is important§ System

§ 100k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 100 FPs, 1 TP

§ Downloads§ 1k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 1 FP, 1 TP

§ Sourcing of test files skews results

§ Number of samples used to measure (often too small)

§ False Positive Rate

§T

rue

Po

siti

ve R

ate

Page 51: A Sober Look at Machine Learning

APTS & 99% OF MALWARE DETECTED…

2016 CrowdStrike, Inc. All rights reserved.51

Page 52: A Sober Look at Machine Learning

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

APTS (CONT.)

§ Combine techniques to offset tradeoffs§ Static and behavioral

§ ML and non-ML

§ Lean local techniques and heavy-weight cloud techniques

§ Avoid silent failure: what happens when the adversary made it onto the system?

§ Avoid brittle techniques: does the solution depend on the attacker not having access to detection results?

Page 53: A Sober Look at Machine Learning

KEY POINTS

2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.

•Machine Learning is an important part of the security tool chest

• Hidden untapped structure in your data

• Various trade-offs, most importantly between true and false positives

•Dimensionality is good…until it’s not

•Not all dimensions are created equal

•Comprehensive coverage by combining techniques

Page 54: A Sober Look at Machine Learning