[ieee 2011 ieee 11th international conference on data mining (icdm) - vancouver, bc, canada...

6
Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software Taeho Kwon Zhendong Su Department of Computer Science University of California, Davis {kwon,su}@cs.ucdavis.edu Abstract—The analysis of software similarity has many appli- cations such as detecting code clones, software plagiarism, code theft, and polymorphic malware. Because often source code is unavailable and code obfuscation is used to avoid detection, there has been much research on developing effective models to capture runtime behavior to aid detection. Existing models focus on low- level information such as dependency or purely occurrence of function calls, and suffer from poor precision, poor scalability, or both. To overcome limitations of existing models, this paper introduces a precise and succinct behavior representation that characterizes high-level object-accessing patterns as regular expres- sions. We first distill a set of high-level patterns (the alphabet Σ of the regular language) based on two pieces of information: function call patterns to access objects and typestate information of the objects. Then we abstract a runtime trace of a program P into a regular expression e over the pattern alphabet Σ to produce P ’s behavior signature. We show that software instances derived from the same code exhibit similar behavior signatures and develop effective algorithms to cluster and match behavior signatures. To evaluate the effectiveness of our behavior model, we have applied it to the similarity analysis of polymorphic malware. Our results on a large malware collection demonstrate that our model is both precise and succinct for effective and scalable matching and detection of polymorphic malware. Keywords-sequence clustering; software behavior model; mal- ware analysis and clustering; I. I NTRODUCTION Detecting similar software is an important problem with many software engineering and security applications, ranging from detecting code clones, plagiarism and code theft, to the analysis of polymorphic malware. To detect similar software based on runtime behavior, the first and the most important step is to design precise models of execution behavior. Previously proposed models [1]–[5], [7], [11], [15], [19] focus on low- level function call and dependency information to approximate a runtime trace. Examples include the use of control and data dependencies among function calls [19], call graphs [7], and occurrence counts of call patterns [1], [2], [15]. Similarity analysis based on these models has either limited scalability or precision, or both: limited scalability because they capture detailed, low-level dependencies and do not capture high-level patterns; limited precision because they do not model temporal dependency and repetition of function calls. Recognizing the aforementioned limitations of existing work, we present a novel behavior model that is both precise and succinct: we characterize high-level semantic-aware behavior patterns to reduce model complexity (for scalability) and capture the patterns’ temporal dependency and repetition (for precision). At the high-level, we model the behavior signature of a program as a regular expression e: 1) the behavior patterns form the alphabet Σ; and 2) the concatenation and Kleene star () operators naturally and succinctly model temporal dependency and repetition of behavior patterns. We also use a special meta- symbol |in Σ to delimit thread boundaries. For example, the behavior signature of the malware Mytob.af is given as the regular expression |UH|VB * T|P * (|T) * |DT|B(|T) * (we will explain the semantics of the letters in Section II). Besides being precise and succinct, our proposed model has additional benefits for software similarity analysis. First, software derived from the same code is likely to have similar behavior signatures. Second, our model reduces the problem of similarity analysis of program behavior to similarity analysis of regular strings, which is more amenable to effective algorithms. We present such an algorithm based on a string similarity metric utilizing prediction probability-based similarity [21] and the Jaccard index [8]. The algorithm can be used to cluster and match software exhibiting similar behavior. It can also be used to generate behavior signatures. We have implemented our technique and evaluated its effectiveness in the context of similarity analysis of polymorphic malware. We have evaluated our model’s abstraction capability and the effectiveness of our clustering algorithm for malware clustering and signature generation/matching. Results on ana- lyzing a large malware collection are promising: 1) our model precisely characterizes malware behavior with small regular expressions; and 2) our clustering algorithm produces precise clusters and behavior signatures. To summarize, we make the following main contributions: We introduce a precise and succinct software behavior model that extracts high-level, semantic-aware behavior patterns and represents their temporal and dependency relationship using concise regular expressions. We introduce an effective clustering algorithm based on our model and describe how it can be applied for generating behavior signatures and matching similar software. We show empirically that our technique is effective for software similarity analysis by applying it to the analysis of behavior similarity of polymorphic malware. The remainder of this paper is structured as follows. Section II presents our regular expression-based behavior model. We present an algorithm for behavior-based software clustering and its extension to behavior signature/matching of software in Section III. Section IV describes our implementation and evaluation results on behavior similarity analysis of polymorphic malware. Finally, we survey related work (Section V) and conclude with a discussion of future work (Section VI.) II. MODEL FORMULATION This section illustrates the key idea behind our regular expression-based behavior signature model. Our technical report [12] provides a running example and more details. 2011 11th IEEE International Conference on Data Mining 1550-4786/11 $26.00 © 2011 IEEE DOI 10.1109/ICDM.2011.104 1134

Upload: zhendong

Post on 08-Dec-2016

221 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

Modeling High-Level Behavior Patterns for Precise Similarity Analysis of Software

Taeho Kwon Zhendong Su

Department of Computer ScienceUniversity of California, Davis

{kwon,su}@cs.ucdavis.edu

Abstract—The analysis of software similarity has many appli-cations such as detecting code clones, software plagiarism, codetheft, and polymorphic malware. Because often source code isunavailable and code obfuscation is used to avoid detection, therehas been much research on developing effective models to captureruntime behavior to aid detection. Existing models focus on low-level information such as dependency or purely occurrence offunction calls, and suffer from poor precision, poor scalability,or both. To overcome limitations of existing models, this paperintroduces a precise and succinct behavior representation thatcharacterizes high-level object-accessing patterns as regular expres-sions. We first distill a set of high-level patterns (the alphabet Σ ofthe regular language) based on two pieces of information: functioncall patterns to access objects and typestate information of theobjects. Then we abstract a runtime trace of a program P into aregular expression e over the pattern alphabet Σ to produce P ’sbehavior signature. We show that software instances derived fromthe same code exhibit similar behavior signatures and developeffective algorithms to cluster and match behavior signatures. Toevaluate the effectiveness of our behavior model, we have appliedit to the similarity analysis of polymorphic malware. Our resultson a large malware collection demonstrate that our model isboth precise and succinct for effective and scalable matching anddetection of polymorphic malware.

Keywords-sequence clustering; software behavior model; mal-ware analysis and clustering;

I. INTRODUCTION

Detecting similar software is an important problem withmany software engineering and security applications, rangingfrom detecting code clones, plagiarism and code theft, to theanalysis of polymorphic malware. To detect similar softwarebased on runtime behavior, the first and the most important stepis to design precise models of execution behavior. Previouslyproposed models [1]–[5], [7], [11], [15], [19] focus on low-level function call and dependency information to approximatea runtime trace. Examples include the use of control and datadependencies among function calls [19], call graphs [7], andoccurrence counts of call patterns [1], [2], [15]. Similarityanalysis based on these models has either limited scalabilityor precision, or both: limited scalability because they capturedetailed, low-level dependencies and do not capture high-levelpatterns; limited precision because they do not model temporaldependency and repetition of function calls.

Recognizing the aforementioned limitations of existing work,we present a novel behavior model that is both precise andsuccinct: we characterize high-level semantic-aware behaviorpatterns to reduce model complexity (for scalability) and capturethe patterns’ temporal dependency and repetition (for precision).At the high-level, we model the behavior signature of a programas a regular expression e: 1) the behavior patterns form thealphabet Σ; and 2) the concatenation and Kleene star (∗)operators naturally and succinctly model temporal dependency

and repetition of behavior patterns. We also use a special meta-symbol ‘|’ in Σ to delimit thread boundaries. For example, thebehavior signature of the malware Mytob.af is given as theregular expression |UH|VB*T|P*(|T)*|DT|B(|T)* (we willexplain the semantics of the letters in Section II).

Besides being precise and succinct, our proposed modelhas additional benefits for software similarity analysis. First,software derived from the same code is likely to have similarbehavior signatures. Second, our model reduces the problem ofsimilarity analysis of program behavior to similarity analysis ofregular strings, which is more amenable to effective algorithms.We present such an algorithm based on a string similarity metricutilizing prediction probability-based similarity [21] and theJaccard index [8]. The algorithm can be used to cluster andmatch software exhibiting similar behavior. It can also be usedto generate behavior signatures.

We have implemented our technique and evaluated itseffectiveness in the context of similarity analysis of polymorphicmalware. We have evaluated our model’s abstraction capabilityand the effectiveness of our clustering algorithm for malwareclustering and signature generation/matching. Results on ana-lyzing a large malware collection are promising: 1) our modelprecisely characterizes malware behavior with small regularexpressions; and 2) our clustering algorithm produces preciseclusters and behavior signatures.

To summarize, we make the following main contributions:

• We introduce a precise and succinct software behaviormodel that extracts high-level, semantic-aware behaviorpatterns and represents their temporal and dependencyrelationship using concise regular expressions.

• We introduce an effective clustering algorithm based on ourmodel and describe how it can be applied for generatingbehavior signatures and matching similar software.

• We show empirically that our technique is effective forsoftware similarity analysis by applying it to the analysisof behavior similarity of polymorphic malware.

The remainder of this paper is structured as follows. Section IIpresents our regular expression-based behavior model. Wepresent an algorithm for behavior-based software clusteringand its extension to behavior signature/matching of softwarein Section III. Section IV describes our implementation andevaluation results on behavior similarity analysis of polymorphicmalware. Finally, we survey related work (Section V) andconclude with a discussion of future work (Section VI.)

II. MODEL FORMULATION

This section illustrates the key idea behind our regularexpression-based behavior signature model. Our technicalreport [12] provides a running example and more details.

2011 11th IEEE International Conference on Data Mining

1550-4786/11 $26.00 © 2011 IEEE

DOI 10.1109/ICDM.2011.104

1134

Page 2: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

Object a Type b Type c Type

FileNtCreateFile, NtReadFile

NtCloseNtOpenFile NtWriteFileNtDeleteFile

Mutex NtCreateMutant NtClose

Registry KeyNtCreateKey NtSetValueKey

NtCloseNtOpenKey NtDeleteValueKey

Network Device NtCreateFile NtDeviceIoControlFile NtClose

Table IEXAMPLES OF API-ALPHABET MAPPING.

1 FILE *pFile;2 char mystring[100];3 pFile = fopen("newfile.txt", "w");4 fputs ("create a new file", pFile);5 fclose(pFile);6 ...7 pFile = fopen("newfile.txt", "r");8 // pFile = fopen("existingfile.txt", "r");9 fgets(mystring, 100, pFile);

10 fclose(pFile);

Figure 1. Sample C code to read a created file.

Extracting Behavior Patterns Dynamic techniques based onanalyzing runtime traces have been well studied in modelingsoftware behavior. We follow this general approach, but insteadof working directly at the level of low-level API calls, we extractAPI usage patterns to work at a higher-level of abstraction.Because how objects are created, modified, and accessedare central to the understanding of runtime behavior, wefocus on analyzing object-accessing behavior. In particular,we analyze resource ownership patterns captured by the regularexpression (ab∗c?), where a, b, and c represent API callsfor object acquisition, usage, and release. Instances of suchpatterns provide higher-level understanding of program be-havior. For example, the pattern “(CreateFile WriteFile*CloseFile)” where all APIs are invoked on a particular fileobject can be mapped to the higher-level behavior file-writing.Table I shows examples of the API-alphabet mapping we useto specify the behavior of accessing file, registry key, mutex,and network device in Microsoft Windows.

The (ab∗c?) pattern is quite flexible and can capture certainimportant patterns for imperfect (i.e., incomplete or buggy)traces. Suppose that a program writes a log file periodicallywithout closing the file opened earlier. This behavior is describedby the regular expression ab+, which is captured by (ab∗c?).Adding Typestate Information The same API usage patternmay operate on objects in different states at different points ofprogram execution. Figure 1 displays a sample C code snippetto create a file newfile.txt and then read from it. We canextract from Lines 3–5 and 7–10 the patterns to create a fileand to read from a file respectively. Because the file is createdfirst and then read from, we know that the file-read patternreads from a file created by the program. If we comment outline 7 and uncomment line 8, the file-read pattern reads froman existing file. Although the two file-read patterns are bothabout reading a file, the accessed objects have different states.Based on the above observation, we introduce semantic-awarebehavior to specify not only the behavior type, such as readinga file, but also additional semantic information of the associatedobject, such as reading a newly created or existing file.

To accomplish this, we record the access history of each

����

� ��������

� ��

� ���� � �

� ������� �� � �

� ���� � �

������� � ������

� ������� �� � �

(a) File object.

����

� ������������

� ��������������

��������

(b) Registry key object.

����

� ��

������

��� ����� ��

������

������

(c) Mutex object.

Figure 2. Example automata for analyzing object access history.

object using a special finite automaton for each type of objects.Each state and each transition in such an automaton correspondto the object’s typestate and its access type. The object’stypestate is determined by its access history and describesspecific state information of the object (e.g., created or existingfile). Figure 2 shows the automata that we use for file, registrykey, and mutex objects.

Based on the automaton for each object, we determinesemantic-aware behaviors by combining the object access typeand the current typestate in the corresponding automaton. Inparticular, when accessing a particular object, the current stateof the automaton for the object is retrieved to obtain the currenttypestate of an accessed object, and the state of the automaton ischanged based on the object access type. For the initial access toan object, the object’s typestate is updated accordingly from itsinitial state w.r.t. the given object access pattern. Table II showssemantic-aware behaviors corresponding to object-accessingbehaviors specified by the API-alphabet mapping in Table I. Inaddition to the typestates in Figure 2, we consider two specialtypestates, SELF and CONST. The SELF state corresponds tothe file object for the running program. For example, when aprogram reads its own binary executable, the semantic-awarebehavior is a combination of FILE_READ and SELF. TheCONST state represents the constant value that is not related toaccess to system resource (e.g., a Boolean or constant value).Although our current setting only considers semantic-awarebehaviors in Table II, they can be extended by modifying theAPI-alphabet mapping.

Dealing with Threads As many applications are multi-threaded, it is important to consider threads; otherwise, theextracted behavior information may be polluted by threadinterleavings. In addition, parent/child dependencies amongthreads can provide useful behavior information. Due to theseconsiderations, we separate the runtime trace into per-threadsub-traces. We also extract parent/child dependency amongthreads. In order to represent the complete trace as a singleregular expression, we force a total order on the threads suchthat a parent thread will be placed before its children, and athread created by the same parent will be placed before all itssiblings that are created at later times (e.g., a pre-order traversalsatisfies these requirements). Based on such a total orderingof the threads, we analyze semantic-aware patterns (i.e., APIusage patterns with object typestate information).

Generating Behavior Signatures Once we have a sequence

1135

Page 3: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

Object Access Type Object Typestate Mapping Alphabet

FILE READSELF, SYSFILE A, BMODIFIED SYSFILE CUSERFILE D

FILE WRITESELF, SYSFILE E, FMODIFIED SYSFILE GUSERFILE H

FILE DELSELF, SYSFILE I, JMODIFIED SYSFILE KUSERFILE L

REG SETSELF, SYSFILE M, NMODIFIED SYSFILE OUSERFILE, CONST P, Q

REG DELSYS REG KEY RUSER REG KEY S

NETWORK ACCESS T

MUTEX CREATE USER, EXISTING USER U, V

Table IIEXAMPLES OF SEMANTICS-AWARE BEHAVIORS AND THE CORRESPONDING

MAPPING ALPHABET.

of semantic-aware behavior patterns, we find repetitions in thesequence to reduce the size of the model and to characterizebehavior repetition. For example, an application can createmultiple threads that perform the same task. In this case,repeated thread creation is one high-level behavior pattern. Tosearch for behavior repetitions in the sequence, we transformthe behavior sequence into a special string such that a letter inthe string represents a semantic-aware behavior, and a specialcharacter ‘|’ delimits the substrings corresponding to differentthreads’ semantic behavior. We then use a string algorithm [17]to find repeated substrings. We use Kleene star (∗) to representsuch repetitions. This process transforms a sequence into aregular expression e, which denotes the program’s behaviorsignature. This model effectively reduces various softwarebehavior analyses to sequence mining problems.

III. SOFTWARE SIMILARITY ANALYSIS

Our string-based model has the following important propertyfor software behavior similarity analysis: programs with similarbehavior patterns are modeled by similar strings. Based onthis observation, we investigate effectiveness of our behaviormodel in analyzing software behavior similarity. This sectiondefines a similarity metric between a string and a cluster, anddescribes algorithms for clustering/signature generation basedon the defined metric. Throughout this section, we use behaviorstring to refer to an instance of our regular expression-basedbehavior model.

A. String Similarity Metric

To define a metric for string similarity, we utilize a novelsequence similarity metric proposed by Yang and Wang [21].The metric is based on PS(σ) and P r(σ):

• PS(σ), prediction probability of σ based on S, is theprobability that a string σ is predicted under a condi-tional probability distribution modeling a cluster S. For astring σ = s1s2 . . . sl, PS(σ) = PS(s1) × PS(s2|s1) ×. . . PS(sl|s1s2 . . . sl−1) where PS(si|s1 . . . si−1) is theconditional probability that the symbol si is followed bythe substring s1s2 . . . si−1 in the cluster S.

• P r(σ) is the probability that a string σ is obtained by

random selection. For a string σ, Pr(σ) =∏l

i=1 p(si)where p(si) is the probability that symbol si is observedin a set of strings to be clustered.

Our similarity metric SIM S(σ) = max1≤j≤i≤l

PS(sj . . . si)

P r(sj . . . si)cap-

tures the maximum similarity between any substring of σ andthe cluster S.

Although this metric is suitable for other sequence data,it is not appropriate in our setting. Comparing with othersequence data, the length of a behavior string is typically muchshorter, which can cause potential noise to estimating predictionprobability. To address this issue, we utilize two effectivestring similarity measures, Max similarity ratio and the Jaccardindex [8]. Max similarity ratio is defined as |σ′|2/(|σ| × |S′|),where σ, σ′ and S′ represent an input string, a substring withthe maximum similarity, and the seed behavior string of thecluster S. This metric represents how large portion of σ and S′is matched with high similarity. The Jaccard index, J(A,B) =|A∩B|/|A∪B|, is a metric for similarity between two sets [8].We compute the Jaccard index considering a string as a multiset.Based on SIM S(σ) and these two string similarity measures, wedefine the similarity metric we use: SIM ′

S(σ) = SIM S(σ)×Max similarity ratio(S.seed, σ) × J(S.seed, σ), whereS.seed is the initial string of S. This similarity metric notonly retains advantages of the original SIM S(σ), but alsoreduces noise impact on the prediction probability estimationby considering other string similarity metrics with the seed ofS. Although other similarity metrics combining these measuresare possible, our result in Section IV indicates that SIM ′

S(σ)is very effective in practice.

B. Behavior-based Software ClusteringWe present an agglomerative hierarchical clustering algo-

rithm [9] for behavior strings. Our algorithm is designed toshow good performance when there exist many identical orsimilar sequences in the string database. This requirement isappropriate in our problem setting because the behavior stringsof software derived from the same origin tend to be identicalor similar.

Our clustering algorithm models each cluster based on threeitems of 〈seed, members, CPD〉, where seed is a seed string,members denotes member strings, and CPD is the conditionalprobability distribution obtained from members. Based on thismodel, we define a Merge Index that represents validity ofmerging a cluster X into a class Y w.r.t. our similarity measure:

MergeIndex(X,Y ) = minσ∈X.member

(SIM ′Y (σ)).

This index is used to select a pair of clusters to merge in ourclustering algorithm.

Our algorithm is composed of two phases: Cluster Initializa-tion and Cluster Consolidation. For more details, refer to ourtechnical report [12].

Cluster Initialization In the initialization phase, we groupidentical strings in the string database, and initialize a clusterfor each group. We assign a corresponding string to its seed,update its members using the group, and compute its CPD basedon members.

Cluster Consolidation Because our similarity metric is basedon the CPD of members in the cluster, the similarity com-puted on larger cluster can be considered more reliably.Based on this intuition, we adopt a policy for merging thesmallest cluster X into the largest cluster Y that satisfyMergeIndex(X,Y ) ≥ t × SIM ′

Y (Y.seed), where t is a

1136

Page 4: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

given threshold and SIM ′Y (Y.seed) can be considered as the

maximum similarity to the cluster Y . With this merging policy,we can improve the similarity computation during clusteringand control the intra cluster similarity between its seed andmembers in terms of the threshold t. The cluster with the leastreliable CPD is removed, and its members are merged into thesimilar cluster where the similarity is computed from the mostreliable CPD in terms of the threshold. When we merge clusters,instead of using σ ∈ X.members, we update Y.CPD using thesubstring of the σ that induces the maximum similarity Y.seedto reduce outliers in Y.CPD.

C. Signature Generation and MatchingOur cluster model, 〈seed, members, CPD〉, can be extended

to behavioral signatures that detect members in the cluster.Because a behavioral signature can detect multiple programsthat perform similar behavior, this can reduce the number ofsignatures. Based on the cluster model, we define the behaviorsignature BehaviorSig(C) of a cluster C as follows:

〈C.seed, C.CPD, min sim = minσ∈C.members

(SIM ′C(σ))〉.

To match the behavior signature for a given behavior stringσ, we evaluate the following signature matching condition:SIM ′

C(σ) ≥ BehaviorSig(C).min sim. If the condition holds,we consider the string σ is matched, because the similarityof σ to C.seed is higher than the least similar string inC.members. Note that we can evaluate the condition fromthe BehaviorSig(C), because C.members is not necessary forcomputing SIM ′

C(σ).

IV. EMPIRICAL EVALUATION

This section describes the evaluation of our behavior modeland its effectiveness for software similarity analysis.

We implemented our technique mainly in Python. To collectdynamic traces, we developed a driver that performs SSDThooking [6] to track Win32 Native API calls. For file andregistry key objects, a number of system calls are gathered,including benign operations. We focus on accesses to 1) fileswithin the local filesystem and 2) security-related registry keys1.

In our evaluation, we focus on polymorphic malware behaviorand apply our technique to behavioral malware clustering andbehavioral signature generation/matching, which are importantproblems in software similarity analysis. For the evaluation,we collected 5,419 malware samples from VX Heavens [18]and Offensive Computing [13] and transformed them into theirmalware behavior strings. We use the term a malware behaviorstring to refer to an instance of our string-based behavior modelobtained from a malware sample. To collect its dynamic trace,we monitored each malware instance on Microsoft WindowsXP SP2 running on QEMU [14] for 30 seconds.

Abstraction Capability We first evaluate our model’s abstrac-tion power. In particular, we measure the amount of lengthreduction of malware behavior strings over the correspondingtraces. This evaluation shows how effective our model is forabstracting the dynamic traces because our model characterizesimportant high-level behavioral properties.

Figure 3 depicts distributions of length of malware behaviorstrings and the reduction ratio for the collected malware samples.

1We define security-related keys as those monitored by the Registry Guardcomponent of Kaspersky Anti-Virus 8.0 [10].

(a) String length.

(b) Ratio of string length to trace size.

Figure 3. Empirical evaluation on abstraction capability.

Length 37 in Figure 3(a) corresponds to the number of malwaresamples whose string lengths are greater than 36, and werounded off the results to four decimal places to measure thereduction ratio in Figure 3(b). The results show that our modelcan represent malware behavior with relatively short stringseven for long sequences of API invocations.

Cluster Quality and Quantity We evaluate the proposedclustering technique in Section III in terms of both clusterquality and quantity. Later we evaluate the precision of ourtechnique by comparing our results to clusters obtained by asingle-linkage hierarchical clustering algorithm [9] using theJaccard index [8], which was shown effective for clusteringmalware behavior profiles [2]. Because the similarity metricsare different, we only compare the clusters obtained by applyingthe maximum similarity threshold, i.e., MergeIndex(Ci, Cj) =ComputeSim(Cj , Cj .seed) and Jaccard index = 1.0. The Jac-card index-based similarity metric measures similarity in termsof the frequency of each object accessing pattern, the number ofthreads a malware sample creates, and the number of behaviorrepetitions.

To construct the reference clusters, we first grouped allmalware samples with identical behavior strings. For everypair of samples, we partitioned their behavior strings intothread-level substrings. If all the corresponding thread-levelsubstrings are similar, which we define as having at least a 50%common prefix. For example, we would assign |BUH|DVPHBT|P∗and |BUH|DVPBRUT|P∗ to two different clusters because themaximal shared prefix DVP of DVPHBT and DVPBRUT is notlong enough. From this extensive analysis, we obtained 1,302

1137

Page 5: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

(a) Cluster quantity. (b) Cluster quality. (c) Signature error rates.

Figure 4. Empirical evaluation on clustering and signature matching.

reference clusters.

Clustering Result We apply our technique to malware clus-tering on a machine with an Intel Xeon 5160 CPU and 16GBRAM. Figure 4(a) and Figure 4(b) show quantity and the qualityinformation (i.e., precision and recall) of clusters w.r.t. variousthreshold values. We observe that our algorithm significantlyreduced the number of clusters compared to the number oforiginal malware samples. The reduction rate was about from70% to 90% for the thresholds. Our algorithm also producesoptimal clusters in terms of precision/recall for threshold 0.2and shows overall high precision/recall.

Our algorithm can effectively detect polymorphic or slightlymodified malware. For example, 33 malware samples labeled asbelonging to three malware families (RBot, SDBot, and Agobot)having the following malware behavior strings are clusteredfor threshold 0.2: |HUH|VPBT, |HUH|VPBT|T, |HUH|VPBT(|T)∗,|HUH|VPBT|P|T, |HUH|VPBT|P∗. These malware samples com-monly perform the following tasks represented by “|HUH|VPBT”.They spawn a hack tool SVKP.sys, create a mutex, replicateitself to the system directory, create the previous mutex,install itself by modifying startup system registry key, readc : \autoexec.bat, and connect to the remote host. Despitethe shared behavior pattern, their concrete instances for malwarebehavior are different. For example, Rbot.abw and Agobot.afbcreate different files (pnpsrv.exe and nlsmon.exe) and mu-texes (SKY2K4 and bkaaslc), respectively. However, in spiteof these differences, our model can easily capture the invariantbehavior sequence patterns. In addition, some added featurescan be easily identified using our model. For example, the “P∗”represents the repetitive registry key modification for malwareinstallation, and the “(|T)∗” specifies the repetitive threadcreation that exploits remote code execution vulnerabilitiesfor malware propagation.

Comparison According to Figure 4(a) and Figure 4(b), com-paring to the Jaccard index-based approach, our algorithmproduces more clusters, but it is more precise when we considerthe maximum similarity. This is because the Jaccard similarityconsiders no sequence information of behavior, which causesdifferent behavior patterns to be misclassified into same clusters.For example, malware samples that have different string-based behavior profiles (e.g., “|H|PBT(|T)∗”, “|HP|BT(|T)∗”,“|BH|PT(|T)∗”, and “|PH(|T) ∗ |BT”) are merged into a singlecluster, although we apply the maximum Jaccard similarityfor clustering. However, this is not appropriate, because theirbehavior sequence patterns are different. From our evaluation,

we believe that considering the temporal behavior sequenceis necessary to specify detailed malware behavior models andprecisely analyze malware behavior similarity.

Signature Matching Errors As we mentioned in Sec-tion III-C, our cluster model can be used as behavioral signaturesto detect malware. To evaluate the reliability of the generatedsignatures, we conducted the following analysis: 1) we collected307 default programs of Microsoft Windows XP SP2 and 15popular freeware out of the top 20 of download.com; 2) wegenerated their malware behavior strings as before and matchedthem with the cluster models obtained from the evaluation.

Figure 4(c) shows the results of matching the signatureswith benign samples in terms of false positive and negativerates. These results indicate that the proposed cluster model canproduce reliable behavioral malware signatures. First, even forsmall threshold value of 0.01, our signature produces low falsepositive rates (9/322). The main reason for having these falsepositives is that some benign behavior strings are matched withthose clusters whose seeding strings are too short to representmeaningful malicious behavior. In particular, the following fiveseeding strings cause false positives: |BHU, |BUB, |H|U, |B|T,and |HUH. After a detailed analysis, we realized that suchshort strings are caused by crashes of malware executablesor fixed time duration to collect the system call traces. Webelieve these problems can be solved straightforwardly bycollecting correctly behavioring malware samples and increasingthe duration. Second, we did not have any false negativesduring our evaluation, and our signature model can preciselycharacterize all malware samples used in building the signatures.

Performance To evaluate the performance of our algorithm,we measured its execution time for clustering. Despite thepair-wise comparisons for cluster consolidation, our algorithmshows good performance. Even for higher threshold values,the clusters can be collected in 4 minutes (e.g., it took 187.22seconds to cluster 5,419 strings based on a threshold value0.9.) There are two reasons of this practical usability of ouralgorithm. First, we efficiently group same behavior stringsinto a cluster, which highly reduces the number of input datato cluster. This initial cluster generation practically showspromising in malware clustering, because polymorphic malwarederived from the same source code tend to generate the samebehavior strings. In our evaluation, 1,656 initial clusters weregenerated from 5,419 malware samples. Second, we onlyperform pairwise comparisons with a seeding string in eachcluster to find the largest similar cluster. In comparison with

1138

Page 6: [IEEE 2011 IEEE 11th International Conference on Data Mining (ICDM) - Vancouver, BC, Canada (2011.12.11-2011.12.14)] 2011 IEEE 11th International Conference on Data Mining - Modeling

standard agglomerative hierarchical algorithms, the number ofcomparisons in our algorithm is much smaller.

V. RELATED WORK

This section discusses the two threads of closely related work.

Software Behavior Specification To specify software behav-ior, Christodorescu et al. [3] construct, from dynamic traces, adirected acyclic graph (DAG) where each node corresponds toa collected system call and an edge to a dependency amongthe system calls. Kolbitsch et al. [11] improve this DAG-basedbehavior model by considering dataflow dependency determinedby system call arguments. Schuler et al. [16] present dynamicbirthmark, a set of short call sequences received by objects of theJava API, to characterize program execution. Wang et al. [20]propose birthmarks of a program, which are based on systemcall sequences that are not commonly seen in executions of theprogram with respect to different inputs or executions exhibitedby many other programs. Comparing to these approaches, ourmodel specifies software behavior more abstractly and at ahigher-level. We represent each behavior based not on a singlesystem call but a particular API usage pattern. Furthermore, ourmodel specifies a sequence of software behaviors as a a specialregular expression to capture high-level behavior patterns ofthe target program.

Software Behavior Similarity To detect similar behaviorpatterns, prior research [1], [2], [7], [15], [19] works bygrouping similar behavior models obtained from capturedbehavior properties of software samples. Although following thisgeneral approach, our technique has several advantages. First,while previous research has focused on graph similarity [7],[19] or numerical vector similarity [1], [2], [15], our tech-nique detects the sequence similarity, which provides precisesimilarity analysis of software behavior. Second, our modelcan effectively characterize high-level abstraction of softwarebehavior such as semantic-aware and repetitive behaviors, whileprevious approaches do not model such high-level properties ofsoftware execution. Ideally, we would have liked to empiricallycompare our technique with the existing work, but neithertheir implementations nor their evaluation subjects are madeavailable, which precludes such a comparison. Nevertheless, theconceptual advantages of our model over existing ones are quiteevident, and we have demonstrated empirically its effectiveness.

Previous work on dynamic signature generation/matchinghas focused on generating malware signatures [4], [5], [11],[23]. These approaches determine malware signatures as part ofthe executed instructions or control- and data-flow dependencyinformation from dynamic traces. However, the precision ofthese signature models has not yet been demonstrated.

VI. CONCLUSION AND FUTURE WORK

In this paper, we have proposed a novel behavior model thatuses regular expressions to precisely and succinctly specify high-level behavior patterns. Based on the model, we have developedan effective clustering algorithm and applied it to malwareclustering and signature generation/matching. Evaluation resultson a large malware collection show that our technique iseffective: 1) it is concise, capable of representing long traceswith short signatures; and 2) it can precisely cluster malwarewith similar behavior patterns and generate reliable behavior-based malware signatures. Our model is general and applicableto other software similarity analysis problems.

For future work, we plan to consider various softwaresimilarity analyses based on sequence data mining, which hasbeen widely studied in other research fields such as biologicalcomputation and bioinformatics. For example, analysis ofsoftware evolution can be conducted by adapting algorithms foranalyzing DNA sequence evolution [22]. Our model opens thedoor for interesting cross-fertilizations between the two areas.

ACKNOWLEDGEMENTS

This research was supported in part by NSF grants CCF-0546844, CNS-0627749, CCF-0702622, CNS-0917392, CCF-1117603, and US Air Force under grant FA9550-07-1-0532.The information presented here does not necessarily reflectthe position or the policy of the Government and no officialendorsement should be inferred.

REFERENCES

[1] M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian,and J. Nazario. Automated Classification and Analysis of InternetMalware. In Proc. RAID, 2007.

[2] U. Bayer, P. M. Comparetti, C. Hlauscheck, C. Kruegel, andE. Kirda. Scalable, Behavior-Based Malware Clustering. In Proc.NDSS, 2009.

[3] M. Christodorescu, S. Jha, and C. Kruegel. Mining specificationsof malicious behavior. In Proc. ESEC/FSE, 2007.

[4] M. Christodorescu, S. Seshia, S. Jha, D. Song, and R. E. Bryant.Semantics-aware malware detection. In Proc. S&P, 2005.

[5] M. Feng and R. Gupta. Detecting virus mutations via dynamicmatching. In Proc. ICSM, 2009.

[6] G. Hoglund and J. Butler. Rootkits: Subverting The WindowsKernel. Addison Wesley, 2006.

[7] X. Hu, T.-c. Chiueh, and K. G. Shin. Large-scale malwareindexing using function-call graphs. In Proc. CCS, 2009.

[8] Jaccard index. http://en.wikipedia.org/wiki/Jaccard index.[9] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a

review. ACM Comput. Surv., 31(3):264–323, 1999.[10] Kaspersky Anti-Virus 8.0. http://usa.kaspersky.com/products

services/anti-virus.php.[11] C. Kolbitsch, P. M. Comparetti, C. Kruegel, E. Kirda, X. Zhou,

and X. Wang. Effective and efficient malware detection at theend host. In Proc. 18th Usenix Security Symposium, 2009.

[12] T. Kwon and Z. Su. Modeling high-level behavior patterns forprecise similarity analysis of software. In UC Davis techicalreport CSE-2010-16, 2010.

[13] Offensive Computing. http://www.offensivecomputing.net.[14] QEMU. http://bellard.org/qemu/.[15] K. Rieck, T. Holz, C. Willems, P. Dussel, and P. Laskov. Learning

and Classification of Malware Behavior. In Proc. DIMVA, 2008.[16] D. Schuler, V. Dallmeier, and C. Lindig. A dynamic birthmark

for java. In Proc. ASE, 2007.[17] J. Stoye and D. Gusfield. Simple and flexible detection of

contiguous repeats using a suffix tree. Theoretical ComputerScience, 270(1-2):843–856, 2002.

[18] VX Heavens. http://vx.netlux.org.[19] X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Behavior based software

theft detection. In Proc. CCS, 2009.[20] X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. Detecting software theft

via system call based birthmarks. In Proc. ACSAC, 2009.[21] J. Yang and W. Wang. CLUSEQ: Efficient and Effective Sequence

Clustering. In Proc. ICDE, 2003.[22] Z. Yang. Paml 4: Phylogenetic analysis by maximum likelihood.

Mol Biol Evol, May 2007.[23] Q. Zhang and D. S. Reeves. MetaAware: Identifying Metamorphic

Malware. In Proc. ACSAC, 2007.

1139