tivit interactive: malware detection through call graphs comparison future internet program, wp6

Malware Detection through Call Graphs Comparison Future Internet Program, WP6

WEBINAARI

Orestis Kostakis

F-Secure Corporation

5.10.2011

Outline

• Introduction and problem statement

• Approach and main ingredients

• Data preparation and structural comparison techniques

• Clustering and classification

• Future work

• The team

Problem Statement

One of the major Data Security areas is recognition and blocking of computer viruses and other malicious programs.

• There exist millions of malicious files and there are many groups of similar ones.

• Anti-malware companies receive tens of thousands of unknown files every day.

• Manual analysis is prohibitively slow and expensive.

• A pressing need to identify malicious files automatically. An important step in that direction is finding groups of similar files.

Main Challenges and Approach

The main focus of this work is Windows applications, that is, files of the PE-format. Looking at their binary images, it is hard to see similarities or detect malicious nature: • Many samples, including benign ones, are heavily obfuscated

• Small changes in the code or compiler/linker options may lead to significant changes in PE files

• Virus writers have tools for mutating their malware, making it “personal”

To abstract out inessential discrepancies, we study structural properties in the form of call graphs.

A small call graph: Bifrose variant

Ingredients: Making Sense of Call Graphs

To use call graphs for identifying malware or detecting groups of similar files, we have to develop methods for:

• Removing obfuscation and extracting call graphs from PE files.

• Defining what ”similarity” means for call graphs and computing it for any pair of graphs efficiently.

• Clustering and classification of executables, based on their call graphs similarity, that scale for high volumes of incoming files.

Unpacking & call graph extraction

Computing similarity

Clustering

Classification

Recovering Call Graphs from PE Files

• Many executables are packed. Thus, obfuscation must be first removed. – Static unpacking: re-implement the packer code. Accurately restores the image and fast

to run, but slow to develop.

– Dynamic unpacking: let the executable run and unpack itself. The unpacked image is not always complete. Easier to develop but slower to run.

• Object oriented code linkage (Delphi, VB, C++, …) is done through data references. Results in disconnected subgraphs.

• F-Secure’s “Unpacker” and IDA Pro tools are the main components for extracting call graphs.

Unpacked Packed

Similarity Definition for Call Graphs

• Intuitively, it should measure how easy it is to transform one graph into another.

• Basic transformation operations: addition and removal of vertices and edges.

• Define “distance” between two graphs as the length of the shortest sequence of basic operations transforming one graph into another: Graph Edit Distance (GED).

• A shorter distance means a higher similarity.

• Computing GED is, predictably, an NP-hard problem. Exact efficient solutions are unlikely to exist.

Simulated Annealing for Computing GED

As our problem is NP-hard, we have to resort to approximation algorithms. One of the earlier approaches is Bipartite Matching. We selected a different approach, Simulated Annealing, and found it faster and more accurate. It is a local search algorithm in the solution space: • From a given point, select a neighboring one at random.

• If the selected solution is better, move to it. Otherwise, move with a specified probability.

A comparison between Bipartite Matching and Simulated Annealing for 1000 random pairs of call graphs:

GED

Graph pair

Clustering Approach and Current Results

• Past results given by offline clustering algorithms were highly promising.

• In the online setup:

– Calculating full distance matrices is prohibitive.

– Need to keep graph comparisons to a minimum.

• Our approach:

– For each cluster we keep a short list of reference samples.

– For each incoming sample, identify a list of candidate clusters.

– Assign to first suitable candidate cluster, once found.

• Results:

– Able to cluster over 4000 samples/day .

– On average, 5 new clusters of >100 samples, daily.

– Current heuristic is dependent on the time-of-arrival of each sample.

Future Work

• Optimizing heuristic parts of the method.

• Improving the call graph extraction procedure.

• Developing methods for turning the current technology into tools for automated classification of executable files.

• Studying time evolution of executable call graphs and their clusters.

The Project Team

The work was done in WP6 (”Security” WP) of the Tivit’s Future Internet program in a partnership between:

• Department of Information and Computer Science, Aalto University

• F-Secure Corporation

• Nokia Research Center

Thank you!

[email protected]

5.10.2011

mailto:[email protected]



tivit interactive: malware detection through call graphs comparison future internet program, wp6

Technology

computing graphs similarity

extractingcall graphs

pair of graphs

unknown files

groups of similar files

millions of malicious

pe files virus writers

nphard problem