on challenges in evaluating malware clustering

On Challenges in Evaluating Malware Clustering

Peng LiUniversity of North Carolina at Chapel Hill, NC, USALimin LiuState Key Lab of Information Security, Graduate School of Chinese Academy of SciencesDebin GaoSchool of Information Systems, Singapore Management University, SingaporeMike ReiterUniversity of North Carolina at Chapel Hill, NC, USA

UNC.edu 2

Malware Clustering

?

How the distance is defined?

Malware instances(executables)

UNC.edu 3

Static vs. Dynamic

Static Dynamic

[Dullien et al., SSTIC2005][Zhang et al., ACSAC2007][Briones et al., VB2008][Griffin et al., RAID2009]

Packers

[Gheorghescu et al., VB2005][Rieck et al., DIMVA2008][Martignoni et al., RAID2008][Bayer et al., NDSS2009]

Dynamic analysis system

Traces (API, system call, etc.)

UNC.edu 4

Ground-truth?Single Anti-virus

Scanner

[Lee et al., EICAR2006][Rieck et al., DIMVA2008][Hu et al., CCS2009]

UNC.edu 5

Ground-truth?Single Anti-virus

Scanner

Inconsistency issue[Bailey et al., RAID2007]

UNC.edu 6

Ground-truth?Multiple Anti-virus

Scanners

…[Bayer et al., NDSS2009][Perdisci et al., NSDI2010][Rieck et al., TR18-2009]

UNC.edu 7

Our Work

Proposed a conjecture that such “multiple-anti-virus-scanner-voting” method of selecting ground-truth data biases their results toward high accuracy

Designed experiments to test this conjecture

Conflicting signals Revealed the effect of cluster size

distribution on the significance of the malware clustering results

UNC.edu 8

To Test Our Conjecture

A dataset “D” generated via “multiple-anti-virus-scanner-voting”

Can we always get high accuracy, using a variety of techniques to do clustering?

UNC.edu 9

Dataset “D1”

[Bayer et al., NDSS 2009] 2,658 malware instances A subset of 14,212 malware instances Majority voted by 6 different anti-virus programs

UNC.edu 10

A Variety of Techniques

MC1 (Malware Clustering #1) [Bayer et al., NDSS 2009]• Monitor the execution of a program and create its

behavioral profile• Abstracting system calls, their dependences, and the

network activities to a generalized representation consisting of OS objects and OS operations

PD1 – PD3 are Plagiarism Detectors (also attempt to detect some degree of similarity in software programs among a large number of instances)• PD1: similarity (string matching) of the sequences of API calls

[Tamada et al., ISFST 2004]• PD2: Jacaard similarity of short sequences of system calls

[Wang et al., ACSAC 2009]• PD3: Jacaard similarity of short sequences of API calls

UNC.edu 11

Clustering on D1

Dynamic traces of D1

MC1 PD1 PD2 PD3

Reference distribution

Hierarchical Clustering

DistanceMatrices

UNC.edu 12

Precision and Recall

| |i iRP MAX C R

Reference Clustering

Test Clusterin

g

Precision: 1Prec( , )

c

ii

P

n

1Recall( , )

r

jj

R

n

Recall:

2 Prec( , ) Recall( , )Prec( , ) Recall( , )

F-measure:

r clusters c clusters

| |j jCR MAX R C

UNC.edu 13

Results on D1

AV

MC1 0.984 0.930 0.956PD1 0.965 0.922 0.943PD2 0.978 0.922 0.952PD3 0.982 0.938 0.960

MC1PD1 0.988 0.939 0.963PD2 0.989 0.941 0.964PD3 0.988 0.938 0.963

Both MC and PDs perform well, which supports our conjecture

Is this the case for any collection of malware to be analyzed?

Prec( , ) Recall( , ) F-measure( , )

UNC.edu 14

Dataset D2 and Ground-truth

More ConservativeMC1 PD1 PD3

1,114 instances

Samples randomly chosen from VXH selection (5,121)

Dynamic Analysis System

UNC.edu 17

Results on D2

AVMC1 0.604 0.659 0.630PD1 0.704 0.536 0.609PD3 0.788 0.502 0.613

MC1 PD1 0.770 0.798 0.784PD3 0.790 0.826 0.808

Both perform more poorly on D2 than they did on D1

Does not support our conjecture

Prec( , ) Recall( , ) F-measure( , )

UNC.edu 18

Differences Between D1 and D2

CDF of reference cluster sizes for dataset D1 and D2 Dataset D1 is highly biased, two large clusters comprising 48.5% and 27% of the malware instances, respectively, and remaining clusters of size at most 6.7% For dataset D2, the largest cluster comprises only 14% of the instances Other investigations (the length of API call sequence, detailed behaviors, etc) are in the paper

D1 D2

UNC.edu 19

The Significance of the Precision and Recall

Case one: Biased ground-truth

…Test clustering

Prec = 7/8Recall = 7/8


UNC.edu 20

The Significance of the Precision and Recall

Case one: Unbiased ground-truth

…Test clustering



4 4 82 2 4/ 36 / 70C C C

Considerably “harder” to produce a clustering yielding good precision and recall in the latter case

A good precision and recall in the latter case is thus much more significant than in the former.

UNC.edu 21

Perturbation Test

MC1(D1) MC1(D2)

UNC.edu 22

Results of Perturbation Test

The cluster-size distribution characteristic of D2 is more sensitive to perturbations in the underlying data

Other experiments to show the effect of cluster-size distribution are in the paper

D1 D2

UNC.edu 25

Summary

Conjectured that utilizing the concurrence of multiple anti-virus tools in classifying malware instance may bias the dataset towards easy-to-cluster instances

Our tests using plagiarism detectors on two datasets arguably leaves our conjecture unresolved, but we believe highlighting this possibility is important

Examined the impact of the ground-truth cluster-size distribution on the significance of results suggesting high accuracy

[email protected]

on challenges in evaluating malware clustering

Documents

malware instancesmajority

collection of malware

different antivirus

dataset d2

single antivirus scannerlee

malware instancesa subset

groundtruth data biases

degree of similarity