politecnico di milano necstconference.hitb.org/hitbsecconf2014kul/materials/d1t2...%dqordgb di gh de...

30
Jackdaw Automatic, unsupervised, scalable extraction and semantic tagging of (interesting) behaviors Mario Polino, Andrea Scorti, Federico Maggi, Stefano Zanero Politecnico di Milano Dipartimento di Eletronica, Informazione e Bioingegneria NECSTLab NElaboratory CST HITB Kuala Lumpur, October 15th, 2014

Upload: others

Post on 27-Apr-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

JackdawAutomatic, unsupervised, scalable extraction and

semantic tagging of (interesting) behaviors

Mario Polino, Andrea Scorti, Federico Maggi, Stefano Zanero

Politecnico di MilanoDipartimento di Eletronica, Informazione e Bioingegneria

NECSTLab NElaboratoryCST

HITB Kuala Lumpur, October 15th, 2014

Page 2: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Malware Analysis at Scale

NECSTLab Stefano Zanero Jackdaw 2

Page 3: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Need to bridge a gap

Static AnalysisPros: high code

coverage,scalability

Cons: obfuscation,labor-intensive,skill-intensive

Dynamic AnalysisPros: no semantic gap to

be reversed withhuman skills

Cons: low codecoverage, can bedetected byadversary

NECSTLab Stefano Zanero Jackdaw 3

Page 4: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Hybrid Analysis

Hybrid Analysis techniques• Developed by several groups independently• Try to leverage scalability and completeness of static techniques,combined with semantic insights from dynamic techniques

NECSTLab Stefano Zanero Jackdaw 4

Page 5: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Example of Hybrid Analysis - Reanimator

Goal: identify “dormant” behavior in a program

Behavior

e.g. Spam

Behavior

Phenotype (manual spec.) Genotype (colored CFG)

dynamic analysis: behavior identificationstatic analysis: signatures of implementing code on (colored) CFG

See: P. Milani Comparetti, G. Salvaneschi, E. Kirda, C. Kolbitsch, C. Kruegel and S. Zanero. Identifying Dormant

Functionality in Malware Programs. IEEE Security and Privacy symposium 2010.

NECSTLab Stefano Zanero Jackdaw 5

Page 6: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Example of Hybrid Analysis - Beagle

Goal: analyzing evolution of malware families

ExecutionMonitoring 1

2

3

x

Binary Comparison011000010110001001100011

011000010110001001100011

011000010110001001100011

Behavior Extraction

Semantic-Aware

Comparison

Code ChangesUnpackedMalwareVariants

System-LevelActivity

Behaviors

Evolutionarychanges

UpdateServer

dynamic analysis: identification of labeled/unlabeled behaviors assequences of system calls

static analysis: mapping of behaviors on implementing code to checkdifferences by behavior vs. overall

See: M. Lindorfer, A. Di Federico, F. Maggi, P. Milani Comparetti, S. Zanero. Lines of Malicious Code: Insights

Into the Malicious Software Industry. ACSAC 2012

NECSTLab Stefano Zanero Jackdaw 6

Page 7: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Pivot concept: behavior

behavior =̃ sequence of actions =̃ sequence of API calls (on Win binaries)

download_execute

recv

WriteFile

CreateProcess

Receives data from a network socket

Write a file

Create a new process

NECSTLab Stefano Zanero Jackdaw 7

Page 8: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Defining Behaviors

Previous work: manual specification of behaviors

• Labor-intensive• Only a small subset of behaviors can be defined manually• Biased by previous experience of experts

ObjectiveExtract (interesting) behavior specifications in an automatic way from alarge collection of (untagged) malware

Why?Support the analyst by providing a list of important behaviors, with a roughexplanation, to prioritize the analysis.

NECSTLab Stefano Zanero Jackdaw 8

Page 9: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Our Approach: Jackdaw

NECSTLab Stefano Zanero Jackdaw 9

Page 10: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

System Architecture

Step 3: Behavior Extraction

Step 4: Semantic Tagging

Behaviors

Malware Families

Step 1: Data Collection

Step 2: Clustering

Malware FamiliesMalware Instances

NECSTLab Stefano Zanero Jackdaw 10

Page 11: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

First step: Data gathering

1 Dynamic Analysis: data flowanalysis

• API functions name• Parameters of API functions

2 Static Analysis: fingerprint ofcode associated to data flow

• sub-graphs of the CFG• can be hashed and matched• reasonably resilient to

polymorphism

<TAINT_DEPENDENCY id="19">

<call api_function="RegOpenKeyExW" caller_address="0x0065006e,0x00b0df8f" id="1" objects="ObjectAttributes='hku\s-1-5-21-842925246-1425521274-308236825-500\software\microsoft\internet explorer\privacy',

SubKey='software\microsoft\internet explorer\privacy'"/>

<call api_function="RegQueryValueExW" caller_address="0x0065006e,0x00b1cb90" id="2"objects="ValueName='cleancookies'"/>

<call api_function="RegOpenKeyExW" caller_address="0x0065006e,0x00b0dfb5,0x00b1ca3a" id="3" objects="ObjectAttributes='hku\s-1-5-21-842925246-1425521274-308236825-500\software\microsoft\internet explorer\privacy',

SubKey='software\microsoft\internet explorer\privacy'"/>

<call api_function="GetProcAddress", caller_address="0x0065006e,0x00b0dfb5,0x00b1ca3a">

<function_boundaries begin="0x00b0de6f" blocks="0x00b0dfb5,0x00b0df8f" end="0x00b0e06f" function_name="sub_B0DE6F"/>

<function_boundaries begin="0x00b1cb7d" blocks="0x00b1cb90" end="0x00b1cba6" function_name="sub_B1CB7D"/>

<function_boundaries begin="0x00b1ca1c" blocks="0x00b1ca3a" end="0x00b1ca6f" function_name="sub_B1CA1C"/>

<path begin="0x00b0de6f" end="0x00b0e06f" function_name="sub_B0DE6F"><\TAINT_DEPENDENCY>

Parameter Propagation

Caller Addresses

NECSTLab Stefano Zanero Jackdaw 11

Page 12: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Control Flow Graph Fingerprinting

Static Analysis.Identify portions of CFG likely tocome from the same source code.Properties:

• Unique• Robust to insertion / deletion• Robust to modification Both CFGs share a subgraph of given order

NECSTLab Stefano Zanero Jackdaw 12

Page 13: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

First Step: Data Gathering - Data cleaning

• Static data cleaning: remove the fingerprints of benign binaries (e.g.,Windows libraries and exes)

• Dynamic data cleaning (Windows API name Normalization):

Prefixes WSASocket → socket

SuffixesCreateEventA, CreateEventW →

CreateEvent

NECSTLab Stefano Zanero Jackdaw 13

Page 14: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Second Step: Clustering

Goal: Build clusters of similar data flows

Sample

...

DFA

DFA

DFA... ...

...

DFA

DFA

DFA... ...

...

DFA

DFA

DFA... ...

Step 2: Clustering of data-flow information based on features of the CFG

CFG

CFG

CFG

NECSTLab Stefano Zanero Jackdaw 14

Page 15: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Second step: Clustering

• Clustering of data flow inmalware

• Feature: fingerprint• simple one pass algorithm• Threshold• Similarity metrics

Jaccard similarity

J(A,B) = |A∩B||A∪B|

Malware instances

Taint con fingerprint

Cluster

NECSTLab Stefano Zanero Jackdaw 15

Page 16: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Third Step: Behavior Extraction

Goal: find API functions that represent each cluster (behavior model).

Step 3: Behavior generation based on representative A

PIs

Behavior model 1Behavior model 1

Behavior model 2 Behavior model 2

Behavior model 2

Implementations of behavior 2

NECSTLab Stefano Zanero Jackdaw 16

Page 17: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Third step: Behavior Extraction - MFR Heuristic

MFR Heuristic (Most Frequent Rule)Model = API functions that appear often. How often? We set a threshold.

Cluster| API NtClearEvent CreateEvent NtSetEventData Flow 1 T T TData Flow 2 T T TData Flow 3 T T T

... ... ... ...Data Flow 13 T T FData Flow 14 T T F

Behavior Specification: NtClearEvent ∧ CreateEvent

NECSTLab Stefano Zanero Jackdaw 17

Page 18: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Fourth Step: Semantic Tagger

Use Crawler to get knowledge and build significant tag for behaviorsFor Each behavior:

Stackoverflow crawler

Posts related to a specific API

Posts related to a specific API

Posts related to a specific API

Posts related to a specific API

Posts related to a specific API

Posts related to a specific API

Tags extraction

Whitelist / BL

Based on semantic, for example we are interested in posts in which there are tags like "windows", "winapi"We are not interested in some tags like "python","php" etc.

We weight the importance of posts according to those wl/bl tags

API Tag : importance measure

NECSTLab Stefano Zanero Jackdaw 18

Page 19: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Fourth step: Semantic Tagger

BehaviorAPI 1 API 2 API 3 API 4

Tags Tags Tags Tags

U (tags) = semantic (hints)

• We look for tags searching API function name, each* element ofpowerset of API function in a model.

• Compute a score for each tag (based on post relevance andfrequency of tags in post related to the search).

• Build a ranking of tags.

NECSTLab Stefano Zanero Jackdaw 19

Page 20: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

System Evaluation

NECSTLab Stefano Zanero Jackdaw 20

Page 21: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Dataset

The dataset:• 1,272 samples from 17 malware families

NECSTLab Stefano Zanero Jackdaw 21

Page 22: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Evaluation of behavior extraction: approach

• Unsupervised learning (no ground truth)• We built a pseudo ground truth, asking experts to manually describea model.

• We compare these manually defined behavior models withbehavior models automatically identified by Jackdaw.

NECSTLab Stefano Zanero Jackdaw 22

Page 23: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Evaluation of behavior extraction: results

Correctness:

Ground-truth Behavior:firewall_settings

ShellExecute (advfirewall firewall add rule name: 1)

Automatic Behavior

RegOpenKey (hku\s-1-5-21-842925246-1425521274-308236825-

500\software\microsoft\internet explorer\main)

GetProcAddress

ShellExecute (advfirewall firewall add rule name: 1)

Completeness:34 over 45 behavior models manually created by experts have beenidentified also by Jackdaw.

NECSTLab Stefano Zanero Jackdaw 23

Page 24: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Empirical evaluation.Example of behavior HTTP connection:

InternetOpen (szAgent: atlsys13.exe: 1) InternetOpenUrl,MapMemRegion,connect,recv,send

(szUrl: http://robertokunihira.sites.uol.com.br/nordeste.jpg, ForeignPort: ['80']: 1, LocalAddress: (['tcp'], public, ['1029']): 1, ForeignIP: public: 1])

'Banload_09af6de40ab414f41ba48b447345e75d'

Position Tag (hint) Score1 http 182 proxy 13.53 ftp 84 file 6.85 mfc 6.86 post 6.27 internet 5.6= upload 5.69 file-download 4.5

10 arrays 411 download 3.912 rich-internet-application 3.3= networking 3.314 httprequest 2.815 httpwebrequest 2.216 internet-explorer 1.7.. .. ..

NECSTLab Stefano Zanero Jackdaw 24

Page 25: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Recognizing behaviors in unknown malware

Banlo

ad_0

9a

Cycb

ot_

b5f

Dapato

_fd1

Gam

aru

e_d

f6G

eneri

cDow

nlo

ader_

e4c

Generi

cTro

jan_0

84

Gra

ftor_

50a

Kelih

os_

b21

Llac_

a02

Onlin

eG

am

es_

4d6

ZangoH

otb

ar_

507

Zeus_

1be

Zeus_

992

Zeus_

c96

Zeus_

dbe

Zeus_

e77

Zeus_

f57

0.0

0.2

0.4

0.6

0.8

1.0

found

NECSTLab Stefano Zanero Jackdaw 25

Page 26: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Conclusions

NECSTLab Stefano Zanero Jackdaw 26

Page 27: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Limitations and Future works

Limitations:• needs buckets of variants of each malware family• analyzed malware needs to be unpacked

Future Works:• Introduce sequence/time concept in behavior models• NLP to improve semantic tagging

NECSTLab Stefano Zanero Jackdaw 27

Page 28: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Third step: Behavior Extraction - PLR Heuristic

PLR Heuristic (Propositional Logic Rule)Let T be a set of elements; given a set of elements L ⊆ P(T ), the solutionis all sets Q ⊆ P(T ) such that:• ∀l ∈ L,∀qi , qj ∈ Q with qi 6= qj , if qi ⊂ l then qj ∩ l = ∅• ∀l ∈ L,∃!q ∈ Q, q ⊂ l

Cluster| API ...Atom API1 ...* ...Atom API2... RegCloseKey ...NtKey API1... ...NtKey API2...Taint 1 T T F F F

... ... ... ... ... ...Taint 5 T T F F FTaint 6 F F T T T

... ... ... ... ... ...Taint 8 F F T T T

Behavior Specification:(AtomAPI1 ∧ AtomAPI2)⊕ (RegCloseKey ∧ NtKeyAPI1 ∧ NtKeyAPI2)

NECSTLab Stefano Zanero Jackdaw 28

Page 29: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Conclusions

Jackdaw:• Automatically extracts behavior models of widespread behaviors,exploiting both dynamic and static analysis.

• Assigns a set of semantic tags to each model to help analyst• Maps behavior model on binary code, building a catalog ofimplementations of same behavior which can be used to attribute tofamily/group

NECSTLab Stefano Zanero Jackdaw 29

Page 30: Politecnico di Milano NECSTconference.hitb.org/hitbsecconf2014kul/materials/D1T2...%DQORDGB DI GH DE I ED E H G Position Tag(hint) Score 1 http 18 2 proxy 13.5 3 ftp 8 4 file 6.8

Malware Analysis at Scale Jackdaw Evaluation Conclusions

Thanks!

• Thanks to you for your attention, please get in touch:[email protected] or @raistolo

• Thanks to my student Mario who is the main author and coder ofJackDaw, and to my colleague Federico for 10 years of work together.

• Thanks to my brother Dhillon and the awesome HITB crew for one ofthe most enjoyable events in the scene worldwide. I’ve spoken inMalaysia in 2006, then I was back in 2012 for the ten yearscelebration, in 2013 and this year, and I look forward to whatever youguys do next. Kudos!

NECSTLab Stefano Zanero Jackdaw 30