acing the ioc game: toward automatic discovery and ...dvotipka/misc/osuciu_020117.pdfacing the ioc...
TRANSCRIPT
Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence
Liao, Xiaojing, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem BeyahPresented by Octavian Suciu
1
Octavian Suciu:: Acing the IOC Game
What are Indicators of Compromise (IOCs)?• Forensic artifacts of intrusion– virus signatures– IPs/domains used by botnets– MD5s of malware
• Shared across the community– Threat Intelligence platforms
• Used as input to security products– IDS, AVs
2
Octavian Suciu:: Acing the IOC Game
The OpenIOC Format
3
Octavian Suciu:: Acing the IOC Game
The OpenIOC Format
4
Context = IOC Category [write registry key]
Octavian Suciu:: Acing the IOC Game
The OpenIOC Format
5
Content = Specific artifact [file bing modified]
Octavian Suciu:: Acing the IOC Game
How is security information disseminated?• Blogs• Forums• Social Networks• Blacklists & Databases• Papers & Technical reports [FeatureSmith!]• Underground markers
6
Octavian Suciu:: Acing the IOC Game
Problem
7
Natural Language vs Machine-readable Format
Octavian Suciu:: Acing the IOC Game
Problem (2)• Volume & Velocity harden manual conversion
8
Octavian Suciu:: Acing the IOC Game
Problem (3)• Information Extraction tools are ineffective– domain-specific– high false positive rate
9
The Trojan downloads a file ok.zip from the server
It’s available as a Free 30 day trial download.
✓
X
Octavian Suciu:: Acing the IOC Game
iACE Solution• Automated IOC extractor from technical blogs• Key observation:– Discourse in technical blogs is consistent and stable
• iACE in a nutshell:– discover an IOC token (ok.zip)– identify context (downloads)– analyze their grammatical relation – classify relation based on similarity with others
10
The Trojan downloads a file ok.zip from the server
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
11
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
12
Octavian Suciu:: Acing the IOC Game
iACE Architecture
13
Octavian Suciu:: Acing the IOC Game
iACE Architecture
14
Continuously download websites
Octavian Suciu:: Acing the IOC Game
iACE Architecture
15
Filter out non-technical pages (i.e. login pages)
Octavian Suciu:: Acing the IOC Game
iACE Architecture
16
Get sentences likely containing IOCs
Octavian Suciu:: Acing the IOC Game
iACE Architecture
17
Check if the extracted relations are IOCs
Octavian Suciu:: Acing the IOC Game
iACE Architecture
18
Generate the OpenIOC format
Octavian Suciu:: Acing the IOC Game
Blog Scraper (BS)• Download complete websites
• Monitor them for new posts
19
Octavian Suciu:: Acing the IOC Game
Blog Preprocessor (BP)• Remove template from webpages– Retain only user-generated content
• Convert content to text– OCR on images, PDF to text
• Filter pages based on topic– topic words– article length– density of dictionary words
20
Octavian Suciu:: Acing the IOC Game
Relevant Content Picker (RCP)• Split text into sentences
• Determine IOC tokens (ok.zip)– regex matching
• Identify context terms (downloads)– dictionary of relevant terms
21
The Trojan downloads a file ok.zip from the server
Octavian Suciu:: Acing the IOC Game
Relation Checker (RC)
22
Octavian Suciu:: Acing the IOC Game
Relation Checker (RC)
• Graph Mining – Similarity metric for directed graphs– Compute the number of identical random walks
occuring in both graphs
23
Octavian Suciu:: Acing the IOC Game
Relation Checker (RC)
• Train classifier on relations from ground truth• Classify new relations based on their similarity to
ground truth
24
Octavian Suciu:: Acing the IOC Game
IOC Generator (IG)• Generate Definition and Header in OpenIOC
format– map context & IOC terms to XML
25
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
26
Octavian Suciu:: Acing the IOC Game
Datasets• DS-Labeled (used for training)– 450 articles– 1,500 true IOC sentences– 3,000 false IOC sentences
• DS-Unknown (used for testing)– 45 blogs– 71,000 articles
• Training sample size is small
27
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
28
Octavian Suciu:: Acing the IOC Game
Performance of iACE• Precision = fraction of identified IOCs that are
truly IOCs• Recall = fraction of IOCs that are identified
29
Precision Recall
Topic Classifier 98% 100%
iACE on DS-Labeled 98% 92%
iACE on DS-Unknown 95% 90%
Octavian Suciu:: Acing the IOC Game
Performance of Existing Systems• Precision = fraction of identified IOCs that are
truly IOCs• Recall = fraction of IOCs that are identified
30
Precision Recall
iACE 98% 93%
AlienVault OTX 72% 56%
Stanford NER 71% 47%
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
31
Octavian Suciu:: Acing the IOC Game
Discovered IOCs• 45 blogs
• 71,000 articles (DS-Unknown)
• 20,000 identified as containing IOCs
• 900,000 IOCs identified
32
Octavian Suciu:: Acing the IOC Game
How are IOCs related to each other?• Cluster articles on infrastructure-related IOCs– IPs, domains, email addresses– 527 clusters likely corresponding to campaigns
– Little cross-reference between articles in same cluster– This allowed the discovery of new campaigns
33
Octavian Suciu:: Acing the IOC Game
How do IOCs evolve over time?• Cluster articles on attack vector IOCs– malware hashes, CVEs– measure decay time = # of consecutive months while
an IOC was mentioned
– most attack vectors are short lived– long lasting attacks pointed to small set of C&C
servers that were not taken down
34
Octavian Suciu:: Acing the IOC Game
What is the impact of IOCs on defenses?• How fast are IOCs adopted by the industry?
– Measure the time difference between when IOCs are blogged about and when they are detected on VirusTotal
– 47% of IOCs were detected before being blogged about– AVs respond much slower to domains & IPs than to hashes
35
Octavian Suciu:: Acing the IOC Game
What is the quality of the 45 blogs?
• Timeliness = % time being the first to report on IOCs– 10 blogs report first on 60% of campaigns– a blog with 13% timeliness has 84% exclusive IOCs
36
Octavian Suciu:: Acing the IOC Game
What is the quality of the 45 blogs?
• Completeness = % of IOCs reported and how diverse they are (different types)– 6 blogs reported 40% of IOCs– 9 blogs reported 50% of IOC types
37
Octavian Suciu:: Acing the IOC Game
What is the quality of the 45 blogs?
• Robustness = % of robust IOCs that are reported (these that remain unchanged during campaigns)– C&C servers, registry email are robust during
campaign– one blog reports 87% of the robust IOCs
38
Octavian Suciu:: Acing the IOC Game
Outline• System Design• Datasets• Evaluation• Security Findings• Discussion
39
Octavian Suciu:: Acing the IOC Game
Discussion• iACE automates IOC generation and has good
performance• Allows an analysis of impact, evolution and
relations between IOCs from technical blogs
• Limitations– Errors due to natural language ambiguity:
• e.g. masking http as hxxp in URLs– Other intelligence sources are also valuable
• iACE assumptions might not hold
40