applying machine learning to network security monitoring - baythreat 2013
Post on 20-May-2015
6.938 Views
Preview:
DESCRIPTION
TRANSCRIPT
Applying Machine Learning to Network Security Monitoring
Alexandre Pinto Chief Data Scien4st | MLSec Project
@alexcpsec @MLSecProject!
• This is a talk about BUILDING not breaking – NO systems were harmed on the development of this talk. – This is NOT about 1337 Android Malware
• Only thing we are likely to break here is the 4me limit on the talk
• This talk includes more MATH than the daily recommended
intake by the FDA.
• All stunts described in this talk were performed by trained professionals.!
WARNING!
• 13 years in Informa4on Security, done a liRle bit of everything. • Past 7 or so years leading security consultancy and monitoring
teams in Brazil, London and the US. – If there is any way a SIEM can hurt you, it did to me.
• Researching machine learning and data science in general for the past year or so and presen4ng about the intersec4on of it and Infosec throughout the year.
• Created MLSec Project in July 2013 to give structure to the research being done.
Who's Alex?
• Defini4ons • Big Data • Data Science • Machine Learning
• Y U DO DIS? • Network Security Monitoring • PoC || GTFO • Feature Intui4on • How to get started?
Agenda
Big Data + Machine Learning + Data Science
Big Data + Machine Learning + Data Science
Big Data
(Security) Data ScienEst
Data Science Venn Diagram by Drew Conway!
• “Data Scien4st (n.): Person who is beRer at sta4s4cs than any so`ware engineer and beRer at so`ware engineering than any sta4s4cian.”
-‐-‐ Josh Willis, Cloudera
• “Machine learning systems automa4cally learn programs from data” (*)
• You don’t really code the program, but it is inferred from data.
• Intui4on of trying to mimic the way the brain learns: that's where terms like ar#ficial intelligence come from.!
Enter Machine Learning
(*) CACM 55(10) -‐ A Few Useful Things to Know about Machine Learning (Domingos 2012)
• Supervised Learning: – Classifica4on (NN, SVM, Naïve Bayes)
– Regression (linear, logis4c)!
Kinds of Machine Learning
Source – scikit-‐learn.github.io/scikit-‐learn-‐tutorial/general_concepts.html
• Unsupervised Learning : – Clustering (k-‐means) – Decomposi4on (PCA, SVD)
ClassificaEon Example
VS!
Regression Example
ConsideraEons on Data Gathering • Models will (generally) get beRer with more data
– But we always have to consider bias and variance as we select our data points
– Also adversaries – we may be force fed “bad data”, find signal in weird noise or design bad (or exploitable) features
• “I’ve got 99 problems, but data ain’t one”!
Domingos, 2012 Abu-‐Mostafa, Caltech, 2012
• Sales!
ApplicaEons of Machine Learning
• Trading
• Image and Voice Recogni4on
• Common reac4ons from Security Professionals: • “Eh, cool…” *blank stare* *walks away* • “Are you high, bro?”
Y U DO DIS?
• “Why aren’t you doing some cool research like Android Malware?”
Math is HARD
• Fraud detec4on systems: – Is what he just did consistent with past behavior?
• Network anomaly detec4on (?): – More like bad sta4s4cal analysis – Did not advance a lot, IMO
• Predic4ng likelihood of aRack actors – Create different predic4ve models and chain them to gain more confidence in each step.!
Security ApplicaEons of ML
• SPAM filters
• Adversaries -‐ Exploi4ng the learning process • Understand the model, understand the machine, and you can circumvent it
• Something InfoSec community knows very well • Any predic4ve model on InfoSec will be pushed to the limit
• Again, think back on the way SPAM engines evolved.!
ConsideraEons on Data Gathering
Network Security Monitoring
• Rules in a SIEM solu4on invariably are: – “Something” has happened “x” 4mes; – “Something” has happened and other “something2” has happened, with some rela4onship (4me, same fields, etc) between them.
• Configuring SIEM = iterate on combina4ons un4l: – Customer or management is foole.. I mean sa4sfied; – Consul4ng money runs out
• Behavioral rules (anomaly detec4on) helps a bit with the “x”s, but s4ll, very laborious and 4me consuming.!
CorrelaEon Rules: A Primer
• Alert-‐based: – “Tradi4onal” log management – SIEM – Using “Threat Intelligence” (i.e blacklists) for about a year or so
– Lack of context – Low effec4veness – You get the results handed over to you
Kinds of Network Security Monitoring
• Explora4on-‐based: – Network Forensics tools (2/3 years ago)
– Elas4c Search based LM systems
– High effec4veness – Lots of people necessary – Lots of HIGHLY trained people
• Big Data Security Analy4cs (BDSA): – Run explora4on-‐based monitoring on Hadoop – More like Big Data Security Monitoring (BDSM)
Alert-‐based + ExploraEon-‐based
A wild army of robots appears
Using robots to catch bad guys
• We developed a set of algorithms to detect malicious behavior from log entries of firewall blocks
• Over 6 months of data from SANS DShield (thanks, guys!) • A`er a lot of sta4s4cal-‐based math (true posi4ve ra4o, true nega4ve ra4o, odds likelihood), it could pinpoint actors that would be 13x-‐18x more likely to aRack you.
• Today more like 30x on the SANS data, and finding around 80% of “badness” in par4cipant deployments.!
PoC || GTFO
• Assump4ons to aggregate the data • Correla4on / proximity / similarity BY BEHAVIOR • “Bad Neighborhoods” concept: – Spamhaus x CyberBunker – Google Report (June 2013) – Moura 2013
• Group by Geoloca4on • Group by Netblock (/16, /24) • Group by ASN – (thanks, Team Cymru)!
Feature IntuiEon: IP Proximity
Map of the Internet
• (Hilbert Curve) • Block port 22 • 2013-‐07-‐20
0
10
127
MULTICAST AND FRIENDS
CN
RU
CN, BR, TH
You are here!
• Even bad neighborhoods renovate: – ARackers may change ISPs/proxies – Botnets may be shut down / relocate – A liRle paranoia is Ok, but not EVERYONE is out to get you (at least not all at once)!
Feature IntuiEon: Temporal Decay
• As days pass, let's forget, bit by bit, who aRacked
• Last 4me I saw this actor, and how o`en did I see them!
• Behavior: block on port 22
• Trial inference on 100k IP addresses per Class A subnet
• Logarithm scale: brightest 4les are 10 to 1000 4mes more likely to aRack.
MLSec Project
• Who resolves to this IP address? • Number of domains that resolve to the IP address • Distribu4on of their life4me • Entropy, size, ccTLDs • Registrar informa4on
• Reverse DNS informa4on… • History of DNS registra4on… • (Thanks, DNSDB!)
Feature IntuiEon: DNS features
• YAY! We have a bunch of numbers per IP address/domain! • How do you define what is malicious or not?
• “Advanced exper4se in both informa4on security and data science will be a necessary ingredient in enabling accurate discrimina4on between malicious and benign ac4vity. “
-‐ Anton Chuvakin, Gartner
• Kinda easy for security tools (if you trust them) • Web applica4on logs need deeper sta4s4cal analysis • Not normal / standard devia4on thing
!
Training the Model
• Programming is a must (Python / R) • Sta4s4cal knowledge keeps you from making dumb mistakes
• Specific machine learning courses and books: – Coursera (ML/ Data Analysis / Data Science)
• Prac4ce, Prac4ce, Prac4ce: – Explore your data! – (Security Onion) – Kaggle – KDD, VAST, VizSec!
How do I get started on this?
MLSec Project
• Sign up, send logs, receive reports generated by machine learning models!
• Working with several companies on trying out these models on their environment with their data
• We are hiring (KINDA)
• Visit h]ps://www.mlsecproject.org , message @MLSecProject or just e-‐mail me.!
• Inbound aRacks on exposed services (DEFCON/BH 2013): – Informa4on from inbound connec4ons on firewalls, IPS, WAFs – Feature extrac4on and supervised learning
• Malware Distribu4on and Botnets: – Informa4on from outbound connec4ons on firewalls, DNS and Web Proxy
– Ini4al labeling provided by intelligence feeds and AV/an4-‐malware – Semi-‐supervised learning involved
• Kill-‐chain Ensemble Models: – Increased precision by composing different behaviors – Web server path -‐> go through Firewall, then IPS, then WAF – Early confirma4on of aRack failure or success
MLSec Project -‐ Current Research
Thanks! • Q&A? • Feedback?
Alexandre Pinto @alexcpsec
@MLSecProject hRps://www.mlsecproject.org/
" Essen4ally, all models are wrong, but some are useful." -‐ George E. P. Box
top related