a sober look at machine learning
TRANSCRIPT
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
A SOBER LOOK AT MACHINE LEARNING
DR. SVEN KRASSER CHIEF SCIENTIST@SVENKRASSER
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Distinguishing Science…
Source: CERN, http://home.cern/sites/home.web.cern.ch/files/image/experiment/2013/01/cms_0.jpeg
…from FictionSource: “Chain Reaction,” 20th Century Fox
MACHINE LEARNING 101
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EXAMPLES OF MACHINE LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SPAM FILTERING
MOVIE RECOMMENDATIONS
SIRI(iPHONE)
TODAY’S FOCUS: SUPERVISED LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TODAY’S FOCUS: GEOMETRIC MODELS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EVERYTHING YOU WILL SEE TODAY IS REAL WORLD DATA
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Some Data to Get Started:1988 ANTHROPOMETRIC
SURVEY OF ARMY PERSONNEL
Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• Over 4000 soldiers surveyed• Over 100 measurements• Reported by gender
Test subjects are in better shape than the rest of us...
Data
Selection Bias
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FIRST LOOK
Height [mm]
Den
sity
• Difference in distribution
• Significant overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SECOND DIMENSION
Height [mm]
Wei
ght
[10
-1kg
]
• Correlation
• Overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FEATURE SELECTION
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
•Correlation
•Gender-specific slope
•Reduced overlap
• Selection of features matters
•How to make a prediction?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
K-NEAREST NEIGHBOR
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
m
f
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SUPPORT VECTOR MACHINE
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SUPPORT VECTOR MACHINE
2016 CrowdStrike, Inc. All rights reserved.“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
•Overfitting
•Classifier does not generalize
• Let’s take a closer look…
CROSSVALIDATION
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TRAIN TRAIN TRAIN TEST
TRAIN TRAIN TEST TRAIN
TRAIN TEST TRAIN TRAIN
TEST TRAIN TRAIN TRAIN
• Divide data into k folds
• Train on k-1 folds, test on the remaining one
• Repeat k times for all folds
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
• Classifier generalizes
• Note some misclassifications
• Let’s assume we want to detect males (blue)§ I.e. “blue” is our
positive class
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Wei
ght
[10
-1kg
]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight [10
-1kg]
• Get more “blue” right (true positives)
• Get more “red” wrong (false positives)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
RECEIVER OPERATING CHARACTERISTICS CURVE
False Positive Rate
Tru
e P
osi
tive
Rat
e
Detect more by accepting more false positives
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
THREE DIMENSIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MORE DIMENSIONS
Decision Value
Den
sity
• Linear model in ~160 dimensions
• Linearly separable
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Source: Source: http://playground.tensorflow.org/
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TREES AND TREE ENSEMBLES
SPARSEFEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
area codes
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
N-GRAMS
43 72 6F 77 64 53 74 72 69 6B 65
43726F 776453 747269
726F77 645374 72696B
6F7764 537472 696B65
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MISSION ACCOMPLISHED:WE JUST ADD MORE DIMENSIONS…
RIGHT?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
CURSE OF DIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
REDUCEDpredictive
performance
INCREASEDtraining time
SLOWERclassification
LARGERmemory footprint
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.Height (mm)
Wei
ght
[10
-1kg
]
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.Height (mm)
Wei
ght
[10
-1kg
]
MANAGINGDIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• FEATURE ELIMINATION
– Feature ranking
– Stop words
• FEATURE REDUCTION
– Principal Component Analysis
– Autoencoders
– Points on lower-dimensional manifold
– Stemming
• ENSEMBLE METHODS
– Classifier of classifiers, e.g. stacking
– Bagging and subspace sampling, e.g. Random Forests
• And much, much more…
SECURITY APPLICATIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FILE ANALYSISAKA Static Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway
– Can support file similarity analysis
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the defender, months to craft for the attacker
EXAMPLE FEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
32/64 BIT EXECUTABLE
GUI SUBSYSTEM
COMMAND LINE
SUBSYSTEMFILE SIZE TIMESTAMP
DEBUG INFORMATION
PRESENTPACKER TYPE FILE ENTROPY NUMBER OF
SECTIONSNUMBER
WRITABLE
NUMBER READABLE
NUMBER EXECUTABLE
DISTRIBUTION OF SECTION
ENTROPY
IMPORTED DLL NAMES
IMPORTED FUNCTION
NAMES
COMPILER ARTIFACTS
LINKER ARTIFACTS
RESOURCE DATA
EMBEDDED PROTOCOL STRINGS
EMBEDDED IPS/DOMAINS
EMBEDDED PATHS
EMBEDDED PRODUCT
META DATA
DIGITAL SIGNATURE
ICON CONTENT …
COMBINING FEATURES
• Projection to show clusters
• For illustration, not the space in that we classify
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EXECUTIONANALYSISAKA Dynamic Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Captures actual behavior of file
– Obfuscating behavior is hard
– Effective against exploitation
– Effective against malware-less attacks
– Not dependent on awareness of specific file types
• THE BAD
– File needs to be executed
– Takes additional time to observe execution
– Execution depends on environment (e.g. sandbox vs real world)
EXAMPLE: GLOBAL BEHAVIOR
§ Behavior across many executions of a file
§ Conducted on event data centrally located in the cloud
Krasser, S., Meyer, B., & Crenshaw, P. (2015). Valkyrie: Behavioral Malware Detection using Global Kernel-level Telemetry Data. In Proceedings of the 2015 IEEE International Workshop on Machine Learning for Signal Processing.
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML VS OTHER TECHNIQUES
§ ML output is probabilistic
§ Use other techniques where appropriate
§ Most ML-based engines use standard hashes or fuzzy hashes on top of a model
§ Example: credentials theft IoA
EVALUATING ML SOLUTIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
PRELIMINARIES
§ ML is not a feature, it is an implementation detail
§ Every solution must make trade-offs of conflicting objectives§ FP vs TP
§ Speed vs accuracy
§ Memory footprint vs accuracy
§ Expressiveness vs explainability
§ Benchmarks under different assumptions are very hard to compare, even internally
§ Marchitecture
§ Looking at the right data: 60% of intrusions do not involve malware
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How much data is there to train on?
SCOPE: SCALE
§ Volume of data generated by sources used
§ Aperture: footprint of deployment
§ Data collection
§ Point of analysis (endpoint, on-prem, cloud)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How many data sources are used?
SCOPE: BREADTH
§ Varied sources and techniques§ Static analysis
§ Behavioral analysis
§ Proliferation
§ Indicators from other techniques
§ Access to historical data§ Baseline
§ Process lineage
§ “Number of characteristics” is not a useful metric
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
DETECTION RATE
§ Detection rate w/o false positive rate is meaningless
§ Considering the base rate is important§ System
§ 100k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 100 FPs, 1 TP
§ Downloads§ 1k clean files, 1 malware file§ 99% TPR at 0.1% FPR è 1 FP, 1 TP
§ Sourcing of test files skews results
§ Number of samples used to measure (often too small)
§ False Positive Rate
§T
rue
Po
siti
ve R
ate
APTS & 99% OF MALWARE DETECTED…
2016 CrowdStrike, Inc. All rights reserved.51
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
APTS (CONT.)
§ Combine techniques to offset tradeoffs§ Static and behavioral
§ ML and non-ML
§ Lean local techniques and heavy-weight cloud techniques
§ Avoid silent failure: what happens when the adversary made it onto the system?
§ Avoid brittle techniques: does the solution depend on the attacker not having access to detection results?
KEY POINTS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
•Machine Learning is an important part of the security tool chest
• Hidden untapped structure in your data
• Various trade-offs, most importantly between true and false positives
•Dimensionality is good…until it’s not
•Not all dimensions are created equal
•Comprehensive coverage by combining techniques