on challenges in evaluating malware clustering
DESCRIPTION
On Challenges in Evaluating Malware Clustering. Peng Li University of North Carolina at Chapel Hill, NC, USA Limin Liu State Key Lab of Information Security, Graduate School of Chinese Academy of Sciences Debin Gao School of Information Systems, Singapore Management University, Singapore - PowerPoint PPT PresentationTRANSCRIPT
On Challenges in Evaluating Malware Clustering
Peng LiUniversity of North Carolina at Chapel Hill, NC, USALimin LiuState Key Lab of Information Security, Graduate School of Chinese Academy of SciencesDebin GaoSchool of Information Systems, Singapore Management University, SingaporeMike ReiterUniversity of North Carolina at Chapel Hill, NC, USA
UNC.edu 2
Malware Clustering
?
How the distance is defined?
Malware instances(executables)
UNC.edu 3
Static vs. Dynamic
Static Dynamic
[Dullien et al., SSTIC2005][Zhang et al., ACSAC2007][Briones et al., VB2008][Griffin et al., RAID2009]
Packers
[Gheorghescu et al., VB2005][Rieck et al., DIMVA2008][Martignoni et al., RAID2008][Bayer et al., NDSS2009]
Dynamic analysis system
Traces (API, system call, etc.)
UNC.edu 4
Ground-truth?Single Anti-virus
Scanner
[Lee et al., EICAR2006][Rieck et al., DIMVA2008][Hu et al., CCS2009]
UNC.edu 5
Ground-truth?Single Anti-virus
Scanner
Inconsistency issue[Bailey et al., RAID2007]
UNC.edu 6
Ground-truth?Multiple Anti-virus
Scanners
…[Bayer et al., NDSS2009][Perdisci et al., NSDI2010][Rieck et al., TR18-2009]
UNC.edu 7
Our Work
Proposed a conjecture that such “multiple-anti-virus-scanner-voting” method of selecting ground-truth data biases their results toward high accuracy
Designed experiments to test this conjecture
Conflicting signals Revealed the effect of cluster size
distribution on the significance of the malware clustering results
UNC.edu 8
To Test Our Conjecture
A dataset “D” generated via “multiple-anti-virus-scanner-voting”
Can we always get high accuracy, using a variety of techniques to do clustering?
UNC.edu 9
Dataset “D1”
[Bayer et al., NDSS 2009] 2,658 malware instances A subset of 14,212 malware instances Majority voted by 6 different anti-virus programs
UNC.edu 10
A Variety of Techniques
MC1 (Malware Clustering #1) [Bayer et al., NDSS 2009]• Monitor the execution of a program and create its
behavioral profile• Abstracting system calls, their dependences, and the
network activities to a generalized representation consisting of OS objects and OS operations
PD1 – PD3 are Plagiarism Detectors (also attempt to detect some degree of similarity in software programs among a large number of instances)• PD1: similarity (string matching) of the sequences of API calls
[Tamada et al., ISFST 2004]• PD2: Jacaard similarity of short sequences of system calls
[Wang et al., ACSAC 2009]• PD3: Jacaard similarity of short sequences of API calls
UNC.edu 11
Clustering on D1
Dynamic traces of D1
MC1 PD1 PD2 PD3
Reference distribution
Hierarchical Clustering
DistanceMatrices
UNC.edu 12
Precision and Recall
| |i iRP MAX C R
Reference Clustering
Test Clusterin
g
Precision: 1Prec( , )
c
ii
P
n
1Recall( , )
r
jj
R
n
Recall:
2 Prec( , ) Recall( , )Prec( , ) Recall( , )
F-measure:
r clusters c clusters
| |j jCR MAX R C
UNC.edu 13
Results on D1
AV
MC1 0.984 0.930 0.956PD1 0.965 0.922 0.943PD2 0.978 0.922 0.952PD3 0.982 0.938 0.960
MC1PD1 0.988 0.939 0.963PD2 0.989 0.941 0.964PD3 0.988 0.938 0.963
Both MC and PDs perform well, which supports our conjecture
Is this the case for any collection of malware to be analyzed?
Prec( , ) Recall( , ) F-measure( , )
UNC.edu 14
Dataset D2 and Ground-truth
More ConservativeMC1 PD1 PD3
1,114 instances
Samples randomly chosen from VXH selection (5,121)
Dynamic Analysis System
UNC.edu 17
Results on D2
AVMC1 0.604 0.659 0.630PD1 0.704 0.536 0.609PD3 0.788 0.502 0.613
MC1 PD1 0.770 0.798 0.784PD3 0.790 0.826 0.808
Both perform more poorly on D2 than they did on D1
Does not support our conjecture
Prec( , ) Recall( , ) F-measure( , )
UNC.edu 18
Differences Between D1 and D2
CDF of reference cluster sizes for dataset D1 and D2 Dataset D1 is highly biased, two large clusters comprising 48.5% and 27% of the malware instances, respectively, and remaining clusters of size at most 6.7% For dataset D2, the largest cluster comprises only 14% of the instances Other investigations (the length of API call sequence, detailed behaviors, etc) are in the paper
D1 D2
UNC.edu 19
The Significance of the Precision and Recall
Case one: Biased ground-truth
…Test clustering
Prec = 7/8Recall = 7/8
Prec = 7/8Recall = 7/8
UNC.edu 20
The Significance of the Precision and Recall
Case one: Unbiased ground-truth
…Test clustering
Prec = 4/8Recall = 4/8
Prec = 4/8Recall = 4/8
4 4 82 2 4/ 36 / 70C C C
Considerably “harder” to produce a clustering yielding good precision and recall in the latter case
A good precision and recall in the latter case is thus much more significant than in the former.
UNC.edu 21
Perturbation Test
MC1(D1) MC1(D2)
UNC.edu 22
Results of Perturbation Test
The cluster-size distribution characteristic of D2 is more sensitive to perturbations in the underlying data
Other experiments to show the effect of cluster-size distribution are in the paper
D1 D2
UNC.edu 25
Summary
Conjectured that utilizing the concurrence of multiple anti-virus tools in classifying malware instance may bias the dataset towards easy-to-cluster instances
Our tests using plagiarism detectors on two datasets arguably leaves our conjecture unresolved, but we believe highlighting this possibility is important
Examined the impact of the ground-truth cluster-size distribution on the significance of results suggesting high accuracy