on challenges in evaluating malware clustering

22
On Challenges in Evaluating Malware Clustering Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore Mike Reiter University of North Carolina at Chapel Hill, NC, USA

Upload: gent

Post on 17-Mar-2016

32 views

Category:

Documents


0 download

DESCRIPTION

On Challenges in Evaluating Malware Clustering. Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On Challenges in  Evaluating Malware Clustering

On Challenges in Evaluating Malware Clustering

Peng LiUniversity of North Carolina at Chapel Hill, NC, USALimin LiuState Key Lab of Information Security, Graduate School of Chinese Academy of SciencesDebin GaoSchool of Information Systems, Singapore Management University, SingaporeMike ReiterUniversity of North Carolina at Chapel Hill, NC, USA

Page 2: On Challenges in  Evaluating Malware Clustering

UNC.edu 2

Malware Clustering

?

How the distance is defined?

Malware instances(executables)

Page 3: On Challenges in  Evaluating Malware Clustering

UNC.edu 3

Static vs. Dynamic

Static Dynamic

[Dullien et al., SSTIC2005][Zhang et al., ACSAC2007][Briones et al., VB2008][Griffin et al., RAID2009]

Packers

[Gheorghescu et al., VB2005][Rieck et al., DIMVA2008][Martignoni et al., RAID2008][Bayer et al., NDSS2009]

Dynamic analysis system

Traces (API, system call, etc.)

Page 4: On Challenges in  Evaluating Malware Clustering

UNC.edu 4

Ground-truth?Single Anti-virus

Scanner

[Lee et al., EICAR2006][Rieck et al., DIMVA2008][Hu et al., CCS2009]

Page 5: On Challenges in  Evaluating Malware Clustering

UNC.edu 5

Ground-truth?Single Anti-virus

Scanner

Inconsistency issue[Bailey et al., RAID2007]

Page 6: On Challenges in  Evaluating Malware Clustering

UNC.edu 6

Ground-truth?Multiple Anti-virus

Scanners

…[Bayer et al., NDSS2009][Perdisci et al., NSDI2010][Rieck et al., TR18-2009]

Page 7: On Challenges in  Evaluating Malware Clustering

UNC.edu 7

Our Work

Proposed a conjecture that such “multiple-anti-virus-scanner-voting” method of selecting ground-truth data biases their results toward high accuracy

Designed experiments to test this conjecture

Conflicting signals Revealed the effect of cluster size

distribution on the significance of the malware clustering results

Page 8: On Challenges in  Evaluating Malware Clustering

UNC.edu 8

To Test Our Conjecture

A dataset “D” generated via “multiple-anti-virus-scanner-voting”

Can we always get high accuracy, using a variety of techniques to do clustering?

Page 9: On Challenges in  Evaluating Malware Clustering

UNC.edu 9

Dataset “D1”

[Bayer et al., NDSS 2009] 2,658 malware instances A subset of 14,212 malware instances Majority voted by 6 different anti-virus programs

Page 10: On Challenges in  Evaluating Malware Clustering

UNC.edu 10

A Variety of Techniques

MC1 (Malware Clustering #1) [Bayer et al., NDSS 2009]• Monitor the execution of a program and create its

behavioral profile• Abstracting system calls, their dependences, and the

network activities to a generalized representation consisting of OS objects and OS operations

PD1 – PD3 are Plagiarism Detectors (also attempt to detect some degree of similarity in software programs among a large number of instances)• PD1: similarity (string matching) of the sequences of API calls

[Tamada et al., ISFST 2004]• PD2: Jacaard similarity of short sequences of system calls

[Wang et al., ACSAC 2009]• PD3: Jacaard similarity of short sequences of API calls

Page 11: On Challenges in  Evaluating Malware Clustering

UNC.edu 11

Clustering on D1

Dynamic traces of D1

MC1 PD1 PD2 PD3

Reference distribution

Hierarchical Clustering

DistanceMatrices

Page 12: On Challenges in  Evaluating Malware Clustering

UNC.edu 12

Precision and Recall

| |i iRP MAX C R

Reference Clustering

Test Clusterin

g

Precision: 1Prec( , )

c

ii

P

n

1Recall( , )

r

jj

R

n

Recall:

2 Prec( , ) Recall( , )Prec( , ) Recall( , )

F-measure:

r clusters c clusters

| |j jCR MAX R C

Page 13: On Challenges in  Evaluating Malware Clustering

UNC.edu 13

Results on D1

AV

MC1 0.984 0.930 0.956PD1 0.965 0.922 0.943PD2 0.978 0.922 0.952PD3 0.982 0.938 0.960

MC1PD1 0.988 0.939 0.963PD2 0.989 0.941 0.964PD3 0.988 0.938 0.963

Both MC and PDs perform well, which supports our conjecture

Is this the case for any collection of malware to be analyzed?

Prec( , ) Recall( , ) F-measure( , )

Page 14: On Challenges in  Evaluating Malware Clustering

UNC.edu 14

Dataset D2 and Ground-truth

More ConservativeMC1 PD1 PD3

1,114 instances

Samples randomly chosen from VXH selection (5,121)

Dynamic Analysis System

Page 15: On Challenges in  Evaluating Malware Clustering

UNC.edu 17

Results on D2

AVMC1 0.604 0.659 0.630PD1 0.704 0.536 0.609PD3 0.788 0.502 0.613

MC1 PD1 0.770 0.798 0.784PD3 0.790 0.826 0.808

Both perform more poorly on D2 than they did on D1

Does not support our conjecture

Prec( , ) Recall( , ) F-measure( , )

Page 16: On Challenges in  Evaluating Malware Clustering

UNC.edu 18

Differences Between D1 and D2

CDF of reference cluster sizes for dataset D1 and D2 Dataset D1 is highly biased, two large clusters comprising 48.5% and 27% of the malware instances, respectively, and remaining clusters of size at most 6.7% For dataset D2, the largest cluster comprises only 14% of the instances Other investigations (the length of API call sequence, detailed behaviors, etc) are in the paper

D1 D2

Page 17: On Challenges in  Evaluating Malware Clustering

UNC.edu 19

The Significance of the Precision and Recall

Case one: Biased ground-truth

…Test clustering

Prec = 7/8Recall = 7/8

Prec = 7/8Recall = 7/8

Page 18: On Challenges in  Evaluating Malware Clustering

UNC.edu 20

The Significance of the Precision and Recall

Case one: Unbiased ground-truth

…Test clustering

Prec = 4/8Recall = 4/8

Prec = 4/8Recall = 4/8

4 4 82 2 4/ 36 / 70C C C

Considerably “harder” to produce a clustering yielding good precision and recall in the latter case

A good precision and recall in the latter case is thus much more significant than in the former.

Page 19: On Challenges in  Evaluating Malware Clustering

UNC.edu 21

Perturbation Test

MC1(D1) MC1(D2)

Page 20: On Challenges in  Evaluating Malware Clustering

UNC.edu 22

Results of Perturbation Test

The cluster-size distribution characteristic of D2 is more sensitive to perturbations in the underlying data

Other experiments to show the effect of cluster-size distribution are in the paper

D1 D2

Page 21: On Challenges in  Evaluating Malware Clustering

UNC.edu 25

Summary

Conjectured that utilizing the concurrence of multiple anti-virus tools in classifying malware instance may bias the dataset towards easy-to-cluster instances

Our tests using plagiarism detectors on two datasets arguably leaves our conjecture unresolved, but we believe highlighting this possibility is important

Examined the impact of the ground-truth cluster-size distribution on the significance of results suggesting high accuracy