Network Intrusion Detection
By: Jack Song, Julina Zhang, Kerry JonesAdvisors: Dr. Don Brown, Dr. Hyojung Kang, Dr.
Malathi VeeraraghavanClient: UVA Information Security, Policy, and
Records Office (ISPRO)Sponsors: UVA SEAS/ Leidos
1
Agenda
● Team Members● Project Objectives● Progress to Date● Deliverables● Potential Sponsors
2
Team Members - Data Science Institute
3
Jack Song● Majored in
Computer Science at UVA
Julina Zhang● Majored in
Statistics and Economics at UVA
Kerry Jones● Majored in
Government and Geography at UMD
Team Members - Advisors
4
Dr. Donald E. Brown● Director of the Data
Science Institute● Dept. of Systems and
Information Engineering
Dr. Malathi Veeraraghavan● Dept. of Electrical &
Computer Engineering
Dr. Hyojung Kang ● Dept. of Systems and
Information Engineering
Team Members
Jason Belford● Chief Information Security
Officer 5
Jeff Collyer● Information Security Engineer
Team Members
6
Sourav Maji● Third-year PhD student in Computer
Engineering
Ron Hutchins● Vice President for Information
Technology
Objectives
● To detect anomalous traffic leaving UVA network using machine learning and data mining.
● Develop a network intrusion detection prototype.
7
Agenda
● Team Members● Project Objectives● Progress to Date● Deliverables● Potential Sponsors
8
Background - Approaches
● Lancope StealthWatch● Previous approaches
○ Density-based Spatial Clustering of Applications with Noise (Erman, Arlitt, Mahanti)
○ K-Means Clustering (Erman, Arlitt, Mahanti) ○ One-class Support Vector Machine (Locke, Wang,
Paschalidis)○ Neural Network (Locke, Wang, Paschalidis)○ Hierarchical Clustering (Ling, Rosti, Swanson)○ Isolation Forest( Liu, Ting, Zhou)
● Our approach○ Isolation Forest - An unsupervised learning method
that utilizes a tree structure to isolate anomalies. 9
Our progress, in a glance
10
- ISPRO- Preprocessing- Wireshark- Filtering
- Unsupervised methods- Isolation Forest
- Didn’t work out well
- Collection server- Power Edge
- TShark- Conversation data- Better ‘Unit’- Preliminary results
Course of Time
Prog
ress
Initial Data
Filtered Data
netFlow data
Initial data phase
Data from ISPRO
+ Data preprocessing
+ Data filtering by source IPs within UVA network
Result: a subset of packet capture data of all conns initiated within the UVA network
11
Init, Data Preprocessing
12
ISPRO data 1 TB
WIRESHARK/TShark
50GB → 5GB.pcap → .csv
One pcap file
50GB/6min
Summary statistics;AlgorithmsPython Script
Filtered data phase
Result from last phase
Created source - destination IP pairs
Calculated frequency and mean length for each pair
+ Isolation Forest
Provided an initial view, but more is needed.
13
Filtered data phase, what we’ve learned
Packet capture data ONLY captures packets
+ Need to capture the entire use session
Need netFlow records data
14
NetFlow data phase -- Now
● Setting up a collection server
○ Power Edge
● Conversation data & TShark
● Better ‘Unit of comparison’
○ include port number
● Preliminary analysis15
16
17
Count 157,313
Unique Source IP 11514
Unique Destination IP 13113
Unique Destination Ports 1631
Unique Source Ports 48925
Average Duration 31 Secs
Average Packets Source to Destination 34 Packets
Average Packets Destination to Source 31 Packets
Average Bytes Source to Destination 10172 Bytes
Average Bytes Destination to Source 58134 Bytes
Summary Statistics
Top Five Most Frequently used Destination Ports
18
Destination Port
Count Number of Unique Source IP pairs
80 ( HTTP) 66390 11238
443 (HTTPS) 38422 954
25 (FTP) 24277 39
6 20387 1
3 957 2
19
NetFlow data phase, next steps
● Finish setting up Power Edge○ Shell script ○ Cron job
■ Automation of daily data collection● Go into specifics, “symptoms”
○ DNS tunneling○ Phishing
20
Identified Cyber Security Needs
● Identifying anomalous behavior in traffic leaving the UVa network
○ Source data: NetFlow records
○ Traffic from hosts with static public IP addresses
● DNS Tunneling
○ Data theft using port 53 as a pathway
● Phishing Attack
○ Obtain sensitive information by disguising and baiting.
21
Challenges
1. Domain knowledge2. Size of data
a. 36 min of data, approx. 270 GB3. IP addresses
a. Dynamic vs. Staticb. Private vs. Public
4. Unlabeled data → unsupervised learning
22
Deliverables
● Paper● Network intrusion detection prototype● Shell script
23
Potential Sponsors
● NSF Cybersecurity Innovation for Cyberinfrastructure (CICI)
● NSF Secure and Trustworthy Cyberspace (SaTC) programs
● DHS CyberSecurity Division programs
● DOE Cybersecurity for Energy program
● Industry, specifically NTT Labs and Cisco
24
References
1. Ashfaq, Rana Aamir Raza, et al. "Fuzziness Based Semi-Supervised Learning Approach for Intrusion Detection System." Information Sciences (2016).
2. Boutaba, Carol Fung and Raouf. Intrusion Detection Networks. CRC Press, 2013.3. —. Intrusion Detection Networks: A Key to Distributed Security. CRC Press, 2013.4. Erman, Jeffrey, Martin Arlitt, and Anirban Mahanti. "Traffic Classification using Clustering Algorithms." Proceedings
of the 2006 SIGCOMM workshop on Mining network data. ACM, 2006. 281-286.5. Farnham, Greg. “Detecting DNS Tunneling”. SANS Institute InfoSec Reading Room. 2013 6. Grimes, Robert. Detect network anomalies with StealthWatch. 2014. IDG. 2016.
<http://www.infoworld.com/article/2848768/security/detect-network-anomalies-with-stealthwatch.html>.7. Locke, R., J. Wang, and I. Paschalidis. "Anomaly Detection Techniques for Data Exfiltration Attempts.." Boston
University Center for Information and Systems Engineering, 2012.8. Sommer, Robin, and Vern Paxson. "Outside the Closed World: On using Machine Learning for Network Intrusion
Detection." 2010 IEEE symposium on security and privacy (2010).9. Yuning Ling, Marcus Rosti, Gregory Swanson. "A Hands-off Approach to Network Intrustion Detection." IEEE
Systems and Information Engineering Design Conference (SIEDS). Charlottesville : IEEE, 2016. 216-220.10. Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation-based anomaly detection.” ACM Transactions on Knowledge
Discovery from Data (TKDD) 6.1 (2012): 3.
25
Isolation Forest
• Unsupervised learning method• Builds an ensemble of ITrees
for a given data set.• The anomalies are those
observations with shortest average length path root node.
26
Preliminary Results of iForest
27