politecnico di milano necstconference.hitb.org/hitbsecconf2014kul/materials/d1t2...%dqordgb di gh de...
TRANSCRIPT
JackdawAutomatic, unsupervised, scalable extraction and
semantic tagging of (interesting) behaviors
Mario Polino, Andrea Scorti, Federico Maggi, Stefano Zanero
Politecnico di MilanoDipartimento di Eletronica, Informazione e Bioingegneria
NECSTLab NElaboratoryCST
HITB Kuala Lumpur, October 15th, 2014
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Malware Analysis at Scale
NECSTLab Stefano Zanero Jackdaw 2
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Need to bridge a gap
Static AnalysisPros: high code
coverage,scalability
Cons: obfuscation,labor-intensive,skill-intensive
Dynamic AnalysisPros: no semantic gap to
be reversed withhuman skills
Cons: low codecoverage, can bedetected byadversary
NECSTLab Stefano Zanero Jackdaw 3
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Hybrid Analysis
Hybrid Analysis techniques• Developed by several groups independently• Try to leverage scalability and completeness of static techniques,combined with semantic insights from dynamic techniques
NECSTLab Stefano Zanero Jackdaw 4
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Example of Hybrid Analysis - Reanimator
Goal: identify “dormant” behavior in a program
Behavior
e.g. Spam
Behavior
Phenotype (manual spec.) Genotype (colored CFG)
dynamic analysis: behavior identificationstatic analysis: signatures of implementing code on (colored) CFG
See: P. Milani Comparetti, G. Salvaneschi, E. Kirda, C. Kolbitsch, C. Kruegel and S. Zanero. Identifying Dormant
Functionality in Malware Programs. IEEE Security and Privacy symposium 2010.
NECSTLab Stefano Zanero Jackdaw 5
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Example of Hybrid Analysis - Beagle
Goal: analyzing evolution of malware families
ExecutionMonitoring 1
2
3
x
Binary Comparison011000010110001001100011
011000010110001001100011
011000010110001001100011
Behavior Extraction
Semantic-Aware
Comparison
Code ChangesUnpackedMalwareVariants
System-LevelActivity
Behaviors
Evolutionarychanges
UpdateServer
dynamic analysis: identification of labeled/unlabeled behaviors assequences of system calls
static analysis: mapping of behaviors on implementing code to checkdifferences by behavior vs. overall
See: M. Lindorfer, A. Di Federico, F. Maggi, P. Milani Comparetti, S. Zanero. Lines of Malicious Code: Insights
Into the Malicious Software Industry. ACSAC 2012
NECSTLab Stefano Zanero Jackdaw 6
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Pivot concept: behavior
behavior =̃ sequence of actions =̃ sequence of API calls (on Win binaries)
download_execute
recv
WriteFile
CreateProcess
Receives data from a network socket
Write a file
Create a new process
NECSTLab Stefano Zanero Jackdaw 7
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Defining Behaviors
Previous work: manual specification of behaviors
• Labor-intensive• Only a small subset of behaviors can be defined manually• Biased by previous experience of experts
ObjectiveExtract (interesting) behavior specifications in an automatic way from alarge collection of (untagged) malware
Why?Support the analyst by providing a list of important behaviors, with a roughexplanation, to prioritize the analysis.
NECSTLab Stefano Zanero Jackdaw 8
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Our Approach: Jackdaw
NECSTLab Stefano Zanero Jackdaw 9
Malware Analysis at Scale Jackdaw Evaluation Conclusions
System Architecture
Step 3: Behavior Extraction
Step 4: Semantic Tagging
Behaviors
Malware Families
Step 1: Data Collection
Step 2: Clustering
Malware FamiliesMalware Instances
NECSTLab Stefano Zanero Jackdaw 10
Malware Analysis at Scale Jackdaw Evaluation Conclusions
First step: Data gathering
1 Dynamic Analysis: data flowanalysis
• API functions name• Parameters of API functions
2 Static Analysis: fingerprint ofcode associated to data flow
• sub-graphs of the CFG• can be hashed and matched• reasonably resilient to
polymorphism
<TAINT_DEPENDENCY id="19">
<call api_function="RegOpenKeyExW" caller_address="0x0065006e,0x00b0df8f" id="1" objects="ObjectAttributes='hku\s-1-5-21-842925246-1425521274-308236825-500\software\microsoft\internet explorer\privacy',
SubKey='software\microsoft\internet explorer\privacy'"/>
<call api_function="RegQueryValueExW" caller_address="0x0065006e,0x00b1cb90" id="2"objects="ValueName='cleancookies'"/>
<call api_function="RegOpenKeyExW" caller_address="0x0065006e,0x00b0dfb5,0x00b1ca3a" id="3" objects="ObjectAttributes='hku\s-1-5-21-842925246-1425521274-308236825-500\software\microsoft\internet explorer\privacy',
SubKey='software\microsoft\internet explorer\privacy'"/>
<call api_function="GetProcAddress", caller_address="0x0065006e,0x00b0dfb5,0x00b1ca3a">
<function_boundaries begin="0x00b0de6f" blocks="0x00b0dfb5,0x00b0df8f" end="0x00b0e06f" function_name="sub_B0DE6F"/>
<function_boundaries begin="0x00b1cb7d" blocks="0x00b1cb90" end="0x00b1cba6" function_name="sub_B1CB7D"/>
<function_boundaries begin="0x00b1ca1c" blocks="0x00b1ca3a" end="0x00b1ca6f" function_name="sub_B1CA1C"/>
<path begin="0x00b0de6f" end="0x00b0e06f" function_name="sub_B0DE6F"><\TAINT_DEPENDENCY>
Parameter Propagation
Caller Addresses
NECSTLab Stefano Zanero Jackdaw 11
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Control Flow Graph Fingerprinting
Static Analysis.Identify portions of CFG likely tocome from the same source code.Properties:
• Unique• Robust to insertion / deletion• Robust to modification Both CFGs share a subgraph of given order
NECSTLab Stefano Zanero Jackdaw 12
Malware Analysis at Scale Jackdaw Evaluation Conclusions
First Step: Data Gathering - Data cleaning
• Static data cleaning: remove the fingerprints of benign binaries (e.g.,Windows libraries and exes)
• Dynamic data cleaning (Windows API name Normalization):
Prefixes WSASocket → socket
SuffixesCreateEventA, CreateEventW →
CreateEvent
NECSTLab Stefano Zanero Jackdaw 13
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Second Step: Clustering
Goal: Build clusters of similar data flows
Sample
...
DFA
DFA
DFA... ...
...
DFA
DFA
DFA... ...
...
DFA
DFA
DFA... ...
Step 2: Clustering of data-flow information based on features of the CFG
CFG
CFG
CFG
NECSTLab Stefano Zanero Jackdaw 14
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Second step: Clustering
• Clustering of data flow inmalware
• Feature: fingerprint• simple one pass algorithm• Threshold• Similarity metrics
Jaccard similarity
J(A,B) = |A∩B||A∪B|
Malware instances
Taint con fingerprint
Cluster
NECSTLab Stefano Zanero Jackdaw 15
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Third Step: Behavior Extraction
Goal: find API functions that represent each cluster (behavior model).
Step 3: Behavior generation based on representative A
PIs
Behavior model 1Behavior model 1
Behavior model 2 Behavior model 2
Behavior model 2
Implementations of behavior 2
NECSTLab Stefano Zanero Jackdaw 16
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Third step: Behavior Extraction - MFR Heuristic
MFR Heuristic (Most Frequent Rule)Model = API functions that appear often. How often? We set a threshold.
Cluster| API NtClearEvent CreateEvent NtSetEventData Flow 1 T T TData Flow 2 T T TData Flow 3 T T T
... ... ... ...Data Flow 13 T T FData Flow 14 T T F
Behavior Specification: NtClearEvent ∧ CreateEvent
NECSTLab Stefano Zanero Jackdaw 17
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Fourth Step: Semantic Tagger
Use Crawler to get knowledge and build significant tag for behaviorsFor Each behavior:
Stackoverflow crawler
Posts related to a specific API
Posts related to a specific API
Posts related to a specific API
Posts related to a specific API
Posts related to a specific API
Posts related to a specific API
Tags extraction
Whitelist / BL
Based on semantic, for example we are interested in posts in which there are tags like "windows", "winapi"We are not interested in some tags like "python","php" etc.
We weight the importance of posts according to those wl/bl tags
API Tag : importance measure
NECSTLab Stefano Zanero Jackdaw 18
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Fourth step: Semantic Tagger
BehaviorAPI 1 API 2 API 3 API 4
Tags Tags Tags Tags
U (tags) = semantic (hints)
• We look for tags searching API function name, each* element ofpowerset of API function in a model.
• Compute a score for each tag (based on post relevance andfrequency of tags in post related to the search).
• Build a ranking of tags.
NECSTLab Stefano Zanero Jackdaw 19
Malware Analysis at Scale Jackdaw Evaluation Conclusions
System Evaluation
NECSTLab Stefano Zanero Jackdaw 20
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Dataset
The dataset:• 1,272 samples from 17 malware families
NECSTLab Stefano Zanero Jackdaw 21
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Evaluation of behavior extraction: approach
• Unsupervised learning (no ground truth)• We built a pseudo ground truth, asking experts to manually describea model.
• We compare these manually defined behavior models withbehavior models automatically identified by Jackdaw.
NECSTLab Stefano Zanero Jackdaw 22
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Evaluation of behavior extraction: results
Correctness:
Ground-truth Behavior:firewall_settings
ShellExecute (advfirewall firewall add rule name: 1)
Automatic Behavior
RegOpenKey (hku\s-1-5-21-842925246-1425521274-308236825-
500\software\microsoft\internet explorer\main)
GetProcAddress
ShellExecute (advfirewall firewall add rule name: 1)
Completeness:34 over 45 behavior models manually created by experts have beenidentified also by Jackdaw.
NECSTLab Stefano Zanero Jackdaw 23
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Empirical evaluation.Example of behavior HTTP connection:
InternetOpen (szAgent: atlsys13.exe: 1) InternetOpenUrl,MapMemRegion,connect,recv,send
(szUrl: http://robertokunihira.sites.uol.com.br/nordeste.jpg, ForeignPort: ['80']: 1, LocalAddress: (['tcp'], public, ['1029']): 1, ForeignIP: public: 1])
'Banload_09af6de40ab414f41ba48b447345e75d'
Position Tag (hint) Score1 http 182 proxy 13.53 ftp 84 file 6.85 mfc 6.86 post 6.27 internet 5.6= upload 5.69 file-download 4.5
10 arrays 411 download 3.912 rich-internet-application 3.3= networking 3.314 httprequest 2.815 httpwebrequest 2.216 internet-explorer 1.7.. .. ..
NECSTLab Stefano Zanero Jackdaw 24
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Recognizing behaviors in unknown malware
Banlo
ad_0
9a
Cycb
ot_
b5f
Dapato
_fd1
Gam
aru
e_d
f6G
eneri
cDow
nlo
ader_
e4c
Generi
cTro
jan_0
84
Gra
ftor_
50a
Kelih
os_
b21
Llac_
a02
Onlin
eG
am
es_
4d6
ZangoH
otb
ar_
507
Zeus_
1be
Zeus_
992
Zeus_
c96
Zeus_
dbe
Zeus_
e77
Zeus_
f57
0.0
0.2
0.4
0.6
0.8
1.0
found
NECSTLab Stefano Zanero Jackdaw 25
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Conclusions
NECSTLab Stefano Zanero Jackdaw 26
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Limitations and Future works
Limitations:• needs buckets of variants of each malware family• analyzed malware needs to be unpacked
Future Works:• Introduce sequence/time concept in behavior models• NLP to improve semantic tagging
NECSTLab Stefano Zanero Jackdaw 27
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Third step: Behavior Extraction - PLR Heuristic
PLR Heuristic (Propositional Logic Rule)Let T be a set of elements; given a set of elements L ⊆ P(T ), the solutionis all sets Q ⊆ P(T ) such that:• ∀l ∈ L,∀qi , qj ∈ Q with qi 6= qj , if qi ⊂ l then qj ∩ l = ∅• ∀l ∈ L,∃!q ∈ Q, q ⊂ l
Cluster| API ...Atom API1 ...* ...Atom API2... RegCloseKey ...NtKey API1... ...NtKey API2...Taint 1 T T F F F
... ... ... ... ... ...Taint 5 T T F F FTaint 6 F F T T T
... ... ... ... ... ...Taint 8 F F T T T
Behavior Specification:(AtomAPI1 ∧ AtomAPI2)⊕ (RegCloseKey ∧ NtKeyAPI1 ∧ NtKeyAPI2)
NECSTLab Stefano Zanero Jackdaw 28
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Conclusions
Jackdaw:• Automatically extracts behavior models of widespread behaviors,exploiting both dynamic and static analysis.
• Assigns a set of semantic tags to each model to help analyst• Maps behavior model on binary code, building a catalog ofimplementations of same behavior which can be used to attribute tofamily/group
NECSTLab Stefano Zanero Jackdaw 29
Malware Analysis at Scale Jackdaw Evaluation Conclusions
Thanks!
• Thanks to you for your attention, please get in touch:[email protected] or @raistolo
• Thanks to my student Mario who is the main author and coder ofJackDaw, and to my colleague Federico for 10 years of work together.
• Thanks to my brother Dhillon and the awesome HITB crew for one ofthe most enjoyable events in the scene worldwide. I’ve spoken inMalaysia in 2006, then I was back in 2012 for the ten yearscelebration, in 2013 and this year, and I look forward to whatever youguys do next. Kudos!
NECSTLab Stefano Zanero Jackdaw 30