arxiv:1808.01186v1 [cs.cr] 3 aug 2018android apps can be easily disassembled, reverse engi-neered,...

Stimulation and Detection of Android Repackaged Malware with ActiveLearning

Aleieldin SalemTechnische Universitat Munchen

Garching bei [email protected]

Abstract

Repackaging is a technique that has been increasinglyadopted by authors of Android malware. The main prob-lem facing the research community working on devisingtechniques to detect this breed of malware is the lackof ground truth that pinpoints the malicious segmentsgrafted within benign apps. Without this crucial knowl-edge, it is difficult to train reliable classifiers able to ef-fectively classify novel, out-of-sample repackaged mal-ware. To circumvent this problem, we argue that reliableclassifiers can be trained to detect repackaged malware,if they are allowed to request new, more accurate repre-sentations of an app’s behavior. This learning techniqueis referred to as active learning.

In this paper, we propose the usage of active learn-ing to train classifiers able to cope with the ambiguousnature of repackaged malware. We implemented an ar-chitecture, Aion, that connects the processes of stimu-lating and detecting repackaged malware using a feed-back loop depicting active learning. Our evaluation ofa sample implementation of Aion using two malwaredatasets (Malgenome and Piggybacking) shows that ac-tive learning can outperform conventional detection tech-niques and, hence, has great potential to detect Androidrepackaged malware.

1 Introduction

Malware authors targeting Android have recently beenadopting a more sophisticated technique to develop anddistribute their malicious instances. In essence, theyleverage the ease of decompiling, modifying the sourcecode of, and recompiling Android applications (hereafterapps) to repackage legtimate, trusted apps with maliciouspayloads [29, 13, 17, 26, 35]. This breed of malware isoften referred to as either piggybacked apps [13, 14] orrepackaged malware [34, 35]. In this paper, we use thetwo terms interchangeably to refer to the same type of

Android malware: legitimate apps that have been graftedwith malicious payloads.

Repackaged malware can be thought of as an evolu-tion of Trojan horses. The two malware breeds differin one key aspect viz., the originality of the app’s be-nign functionality. On one hand, Trojan horses usuallycomprise benign functionalities that have been developedfrom scratch by the malware author for the sole purposeof presenting the app as a benign one. Furthermore, mal-ware authors of Trojan horses seldom invest adequateamounts of time and effort in developing such fake, be-nign facade. The resulting benign segment is, therefore,of low quality and limited functionality (e.g., a Sudokuapp with five puzzles). On the other hand, repackagedmalware comprise pre-existing legitimate apps (e.g., An-gry Birds), that have been amalgamated with a newly in-jected malicious payload (e.g., sending SMS messagesto premium rate numbers [9]). The primary threat thatrepackaged malware poses is undermining user trust inlegitimate apps, their developers, and the app distribu-tion infrastructure, which can potentially have devastat-ing effects on the entire Android ecosystem. Unfortu-nately, malware authors have been increasingly adopt-ing repackaging particularly targeting third-party mar-ketplaces as their distribution platform. In [35], Zhouet al. studied 1260 Android malware instances, and con-cluded that more than 86% of them were repackaged. In2015, TrendMicro reported that 77% of the top free 50apps on the Google Play marketplace had fake versions,with a total of 890,482 fake apps being discovered 51%of which were found to exhibit unwanted and/or mali-cious behaviors [15]. More recently, Li et al. managed togather piggybacked versions of around 1,400 legitimateapps [13].

Consequently, the research community has been work-ing towards devising methods to detect Android repack-aged malware. To the best of our knowledge, the major-ity of such methods analyze, and perhaps execute, appsbelonging to a dataset (e.g., Malgenome [35]), to ex-

arX

iv:1

808.

0118

6v1

[cs

.CR

] 3

Aug

201

8

tract numerical features used to train a machine learningclassifier. The trained classifier is used to classify out-of-sample (hereafter test) apps as malicious and benign[26, 17, 6]. One fundamental problem facing the researchcommunity working on detecting repackaged malware isthe quality of the ground truth of the currently availabledatasets.That is to say, a repackaged app is merely la-beled as malicious without the knowledge of which pathsdepict the grafted malicious behaviors, whether thereare triggers that control the execution of such behaviors,and how sophisticated those triggers are. Lacking suchknowledge hinders training reliable classifiers based onthe apps’ dynamic behaviors, which is necessary giventhat malware instances have increasingly been adoptingtechniques (e.g., obfuscation, dynamic code loading, us-age of triggers, etc.), to evade detection by conventionalstatic-based methods [29, 12, 18, 33].

We argue that, in essence, the problem of stimulating,analyzing, and detecting Android repackaged malware isa search problem. During the phase of training a clas-sifier, we are in pursuit of the segments in the trainingapps’ source code (e.g., paths in a CFG), that depict orinclude the injected malicious code. Finding and execut-ing those segments facilitates the extraction of featuresthat resemble their runtime behavior, and enables theclassifier to have a collective, accurate view of the ma-licious behaviors exhibited by Android repackaged mal-ware. Similarly, to reliably classify a test app, we needto execute multiple paths to increase the probability offinding (a representation of) the injected code prior todeciding upon the app’s malignance.

Needless to say, the nature and locations of the graftedmalicious code in repackaged apps might vary. Giventhat we lack such ground truth, the training and test pro-cesses need to adopt a trial-and-error technique. In thetraining phase, for example, if a feature vector (xi) repre-senting the runtime behavior of a training app (ai) is mis-classified, we assume that such feature vector representsthe execution path that does not reflect the app’s true na-ture (i.e., malicious or benign). Thus, the app (ai) needsto be re-executed to explore another path that yields afeature vector (x′i) that might help the classifier assign theapp to its proper class. This process can be iterated untilthe maximum training accuracy is achieved, which sig-nals the best possible representation of benign and ma-licious behaviors embedded in the training apps. Thenumber of iterations needed to achieve such maximumaccuracy depicts the number of times an app needs tobe executed in order to have a comprehensive overviewof its behavior. For repackaged malware, it can be in-terpreted as the number of executions within which themalicious code is likely to execute. The feature vectorscorresponding to the different executions can be used toclassify the app (e.g., using majority votes), as malicious

or benign.Within the context of machine learning, the aforemen-

tioned trial and error technique is referred to as activelearning. In this paper, we propose, implement, and eval-uate an active learning architecture, called Aion1 [?], tostimulate, analyze, and detect Android repackaged mal-ware.

Our findings show that–even with primitive stimula-tion and classification techniques–training a classifier us-ing active learning improves the performance of conven-tional dynamic-based detection methods on test datasetsand can outperform their static-based counterpars. Fur-thermore, the conducted experiments helped us answerquestions about the number of iterations needed to trainthe best performing classifier, the most suitable kind offeatures, and the type of the best performing classifier(i.e., Random Trees, K-Nearest Neighbors, Ensemble,etc.). We believe that our experiments provide insightson how to stimulate and detect Android repackaged mal-ware using active learning, and a benchmark againstwhich further enhancements of the architecture (e.g., viausing more sophisticated stimulation techniques) can becompared.

The contributions of this paper are:

• We propose a novel architecture and approach tostimulate, analyze, and detect Android repackagedmalware using active learning and investigate its ap-plicability and performance.

• We implemented a framework that utilizes the pro-posed architecture, and made the source code, thedata (i.e., gathered during evaluation), and a multi-tude of interactive figures plotting our results avail-able online [?].

• We evaluate Aion on the recently analyzed and re-leased Piggybacking dataset instead of its obsoleteMalgenome counterpart, and use the former to seta classification benchmark of 72%, similar to Zhouet al.’s 79.6% benchmark on the latter in November2011.

Organization The rest of the paper is organized as fol-lows. In section 2, we briefly discuss fundamentalconcepts on which our research is built. In section 3the methodology adopted during the implementation ofAion, its architecture, and internal structure are intro-duced. In section 4, we discuss how Aion was evaluatedby introducing the datasets we used and the experimentsconducted, and discuss the observed results in section 5.

1Aion is a Greek deity associated with eternity and everlasting time.His orb, depicting repetition, resembles the behavior of our proposedactive learning approach, where a feedback loop closes a circle betweenthe processes of stimulation and detection of Android repackaged mal-ware.

2

Research efforts related to our work are enumerated insection 6. Lastly, section 7 concludes the paper.

2 Background

In this section, we briefly discuss concepts fundamentalto follow the remainder of the paper. We discuss ourdefinition of repackaged malware and what it comprises,and introduce the notion of active learning and relate itto our work.

2.1 Repackaged MalwareAndroid apps can be easily disassembled, reverse engi-neered, modified, and reassembled [7]. Consequently,repackaging benign apps with malicious code segmentsis a straightforward process that typically involves thefollowing activities. Firstly, a reverse engineering tool,such as Apktool [1], is used to disassemble an app into areadable, intermediate representation (i.e., Smali). Sec-ondly, a segment of Smali code that delivers some mali-cious functionality is added and merged with the originalSmali code. Lastly, the app is reassembled, re-signed,and uploaded to an app marketplace, possibly with a dif-ferent name.

The malicious code segments injected into benignapps usually comprise two parts viz., a malicious pay-load and a trigger (or a hook [13]). The former part de-picts the malicious intents of the malware author (e.g.,sending SMS messages to premium numbers, deletingthe user’s contacts, leaking the current user’s GPS loca-tion, etc.). In [9], Fratantonio et al. argue that a maliciouspayload need not be disconnected from the original applogic; it can distort the actual app functionality, in whatthey refered to as logic bombs.

The trigger is meant to transfer control to the graftedmalicious payload. We can define the trigger as aset of (boolean) conditions that need to hold in orderfor the malicious payload to execute. For authors ofrepackaged malware, designing a trigger is a compro-mise between the likelihood of execution and stealth-iness of the injected malicious payload [13]. On onehand, some malware authors may elect to maximizethe likelihood of the malicious payload being executedby unconditionally calling/branching to the maliciouscode (e.g., deployed in a separate method or compo-nent) [13]. In this case, the trigger is a null condi-tion. On the other hand, some authors may give stealth-iness more priority, and develop more complex trig-gers that activate the grafted malicious functionality de-pending on specific system properties (e.g., date, time,GPS location, etc.), app/system intents and notifica-tions (e.g., android.intent.action.BOOT COMPLETED),custom values forwarded to the app via its authors

(e.g., via SMS messages [19]), or app logic (e.g.,i f (sum o f calculation == 50)). This trend of schedul-ing the triggering of malicious segments has been foundto be adopted by the majority of Android malware [29].

2.2 Active Learning

To the best of our knowledge, the majority of Androidmalware detection techniques adopt the following pro-cess. Firstly, a dataset of benign and malicious apps aregathered. Secondly, the apps in the dataset are analyzedto extract numerical features from each app. That is tosay, each app is represented by a vector of features (xi)and a label (yi) depicting its nature (i.e., malicious or be-nign). The extracted feature vectors, often organized ina matrix representation (X), are used to train a machinelearning classifier (C) (e.g., a support vector machine).Given a feature vector of an app (x∗) that has not beenused during the training phase (i.e., test app), the classi-fier (C), attempts to predict its corresponding label (y∗)effectively classifying it as malicious or benign. To assesthe prediction capability of a classifier, the original labelsof the test apps, known as ground truth, are comparedto those predicted by the classifier. The percentage ofcorrectly predicted labels is known as the classificationaccuracy.

Needless to say, a number of the test apps’ labels areincorrectly predicted (i.e., the corresponding app is mis-classified), which negatively affects the classification ac-curacy. If the classification accuracy is inplausible, otherclassifiers can be probed, the existing features extractedfrom the apps are further processed (e.g., to eliminatenoisy, irrelevant features), a new method of feature ex-traction is adopted, or a combination of those methods.This setting is referred to as passive learning [28]. Thepassiveness dwells in the inability of the trained classifierto ask for more accurate representations of the misclassi-fied apps in the form of different feature vectors.

In an active learning setting, the classifier is allowed toquery the sources from whence the feature vectors havebeen extracted for alternative representations of the sameentities [28]. Within the context of Android malware de-tection, if a feature vector of a test app (x∗) is misclas-sified, the classifier (C) is allowed to instruct the featureextraction mechanism to generate another feature vectors(x∗new) for the same app. This process can be repeated un-til either the misclassified app is correctly classified, orthe overall classification accuracy of (C) has convergedto a maximum value.

3

3 Design and Implementation

3.1 Overview

The typical malware detection process introduced in theprevious section can be grouped into two main sub pro-cesses viz., Data Generation and Data Inference, as seenin figure 1 which depicts our proposed architecture Aion.

The former process commences with the acquisition ofa dataset of malicious and benign Android apps. Using aStimulation method, the collected apps are analyzed stat-ically and executed within a controlled, virtualized en-vironment to gather information about the apps’ runtimebehavior. Within the domain of malware analysis, thereare different non-/invasive approaches to stimulating anapp [31]. For example, Abraham et al. [4] implementedan approach meant to force the execution of particularbranches within an app that contain suspicious API calls,whereas Rasthofer et al. [18] utilized symbolic executionto devise environments (i.e., a set of values) required toexecute branches containing specific type of API calls.The monitoring component depicted in the figure is re-sponsible for recording the the stimulated behavior forfurther analysis. It can be deployed on a remote server,on the environment within which an app is executed, em-bedded within the app itself (e.g., as logging statements),or a combination of those techniques. Lastly, the moni-tored behavior may need to be further processed to repre-sent it in a parseable, understandable format. For exam-ple, the API calls issued by an app during runtime areusually reorganized in the format of comma-separatedstrings (i.e., traces).

The stimulated, monitored, and reconstructed repre-sentations of the apps’ behaviors are usually raw, andneed to be further processed before they are used fortraining a classifier. Data Inference is concerned with in-ferring relevant information from the raw data receivedfrom its Data Generation counterpart that might facili-tate segregating the two classes of apps. The Projectioncomponent is reponsible for processing the raw data andprojecting it into a different representation or dimension-ality. For example, a trace of API calls can be processedto omit those calls that do not manipulate sensitive sys-tem resources (e.g., camera) or extract numerical features(e.g., counts of different API calls). To further removenoise from projected data, the Inference component at-tempts to extract patterns and information that might fa-cilitate the classification of apps. Depending on the for-mat of the projected data, this component can either ap-ply feature selection techniques to rule out irrelevant fea-tures, or infer rules that depict characteristics unique toa class of apps. For instance, by studying the API calltraces of apps, a rule can be inferred that malicious appsusually use a particular set/sequence of API calls. Lastly,

1.2 Stimulation

1.3 Monitoring

1.4 Reconstruction

1. Data Generation

2. Data Inference

2.1 Projection

2.2 Inference

2.3 Training

Raw/Noisy Data

Training Gaps

1.1 Collection

Figure 1: An architecture that uses active learning tostimulate, analyze, and detect Android repackaged mal-ware.

the Training component handles training a classifier us-ing the processed raw data, validates the results (e.g., viaa test dataset), and reports classification results.

The two dashed arrows looping out of and into thetwo processes imply that the operations within such pro-cesses can be perfomed multiple times before the processreports its final output. For example, the Data Genera-tion process can continue to add new apps to its repos-itory, stimulate them multiple times, monitor their run-time behaviors, and report an augmented version of thedifferent stimulation sessions. Similarly, the Data Infer-ence process is allowed to consult various combinationsof projection, inference, and classification techniques inpursuit of the best classification accuracy.

Lastly, the dashed arrow extending from the Data In-ference process to its Data Generation counterpart de-picts the active learning aspect of our proposed approach.We illustrate the functionality and significance of suchmechanism using the control flow graph (CFG) in fig-ure 2. Consider such CFG as that of a benign app (ai)that has been injected with malicious segments (coloredblocks). Regardless of the technique it employs to tar-get and execute specific branches, the stimulation com-ponent should devise test cases that execute the path (P2)which includes the malicious segments. The feature vec-tors extracted from the representation of (P2) (e.g., APIcall trace) should depict the malicious behavior graftedinto the benign app and, in turn, help the trained classi-fier assign this app to the malicious class. Nevertheless,

4

P3P1 P2

Figure 2: An example of a repackaged malware’s controlflow graph (CFG). The colored segment depicts the ma-licious payload injected into the app, whereas the dashedarrows represent different paths through the CFG.

since the location of the injected malicious payload isunknown, and since the malicious payload is assumed tobe smaller than the original benign code, the stimulationcomponent is likely to target other branches/statementswithin the CFG, effectively executing other paths (i.e., P1or P3). Feature vectors extracted from such paths are ex-pected to represent benign behaviors, leading to the mis-classification of the app. The feedback mechanism re-ports misclassification of (ai) to the Data Generation pro-cess, particularly the stimulation component. Using thisinformation, the latter attempts to target different branch-es/statements to execute different paths within the app’sCFG, and forward the reconstructed behaviors of suchnewly-executed paths to the Data Generation process forre-training. In theory, (ai) will continue to be misclas-sified until (P2) is executed, which is expected to helpthe trained classifier deem the app as malicious. Con-sequently, the feedback mechanism will continue to in-struct the stimulation component to target different codesegments until such path is executed.

Given a dataset of training apps, the feedback mech-anism is meant to provide the Data Generation mod-ule with the best representation of each app that re-sembles its nature (i.e., malicious or benign). A clas-sifier trained with such representations is expected topossess a comprehensive view of the malicious and be-nign behaviors exhibited by the apps within the trainingdataset.However, the large combination of different lo-cations of the injected malicious payloads, their contentsand behaviors, and their triggers significantly impedes

the process of executing multiple paths per app. In otherwords, it can take an unforeseeable amount of time tofind the best representation of each app in the trainingdataset. Furthermore, achieving a perfect training accu-racy is unlikely, especially since the behaviors of someapps can mimick those of apps belonging to the otherclass. That is to say, classification errors are inevitable.In this context, we design the feedback process to termi-nate once the best possible training accuracy is achieved,which is subject to two constraints. Firstly, the feedbackprocess continues as long as the performance of the clas-sifier is increases across two iterations or decreases bya small percentage (e.g., 1%). Secondly, we specify amaximum number of iterations, after which the trainingprocess is terminated.

3.2 Implementation

The exact methods, techniques, and algorithms used bydifferent components in the two processes of Data Gen-eration and Data Inference can vary. In other words,the proposed architecture is completely agnostic to spe-cific methods utilized by different components. How-ever, the use of different stimulation approaches mightaffect the implementation of the Data Inference compo-nents and the feedback mechanism. In this paper, wefocus on demonstrating the proposed architecture andinvestigating the applicability of active learning to theproblem at hand. Consequently, we adopt an examplein which we use basic techniques for stimulation, detec-tion, and feedback mechanisms, and leave the utilizationof more sophisticated stimulation/detection methods forfuture work.

The stimulation technique used in this paper is basedon a non-invasive, UI-based tool, called Droidutan [2],which is designed to emulate the user-app interaction.The tool starts the main activity of an app, retrieves its UIelements, chooses a random element out of the retrievedones, and interactes with it. The interaction hinges onthe class of the chosen UI element. For example, if theUI element is a Button, Droidutan will tap it, if it isa TextField, a random text will be typed into it, and soforth. The tool replicates the same behavior with anyactivity that may start during the stimulation period. Tosimulate the occurrence of system notifications or inter-app communication, the tool randomly broadcasts intentsdeclared and registered to by the app in its manifest file.

Interacting with an app generates a runtime behaviorthat we define in terms of API calls. We use droidmon asour monitoring component to intercept and record a par-ticular set of API calls that interact with system resourcessuch as the device’s camera, the user’s contacts, the pack-age installer, the GPS module, etc. The tool records theintercepted calls along with their arguments and return

5

results, and writes them to the system log. For each app,we retrieve the API calls from the log and represent theapp’s behavior as a series of calls, each having the format[package.module.class].[method].

The APK archives of the apps under test and theircorresponding traces retrieved from droidmon’s outputare used to extract static and dynamic features, respec-tively. After surveying the literature, we concluded thatstatic features can be categorized into basic metadata fea-tures extracted from the AndroidManifest.xml file [21],features related to the permissions required by the app[20, 11], and features related to the API calls foundwithin the app’s code [30]. For each app, we extractstatic features depicting the aforementioned three cate-gories. We refer to those features as basic, permission,and api features, and we list them in appendix A. The dy-namic features extracted from each app’s API calls traceis a feature vector of 37 attributes, each of which depictsthe number of times a method belonging to one of theAPI packages hooked by droidmon, listed in appendixB, has been encountered in the trace.

The static and dynamic feature vectors extracted fromthe apps’ APK’s and their traces are used to train a major-ity vote classifier (i.e., Ensemble classifier), using a totalof 12 classifiers. The classifiers used to train the En-semble classifier are K-Nearest Neighbors (KNN) withvalues of K = {10,25,50,100,250,500}, random forestswith values of Estimators (E) = {10,25,50,75,100},and a Support Vector Machine (SVM) with the linear ker-nel.

We assess the performance of different classifiers ac-cording to two metrics viz., F1 score and specificity. Thefirst metric is defined as F1= 2× precision×recall

precision+recall such thatprecision = T P

T P+FP and recall = T PT P+FN . True positives

(TP) denote malicious apps correctly classified as mali-cious, false positives (FP) denote benign apps mistakenlyclassified as malicious, and false negatives (FN) denotemalicious apps mistakenly classified as benign. Thus,the F1 score is meant to assess the ability of a classi-fier to recognize malicious apps and correctly classifythem. We also keep track of the classifier’s performanceon benign apps (i.e., whether it classifies them correctlyas benign), via specificity = T N

T N+FP where (TN) standsfor true negatives: benign apps correctly classified asbenign. We keep track of these two metrics to assesswhether a classifier is biased towards one class in favorof the other. For example, if a classifier scores high F1and low specificity scores, this signals its bias towardsclassifying the majority of apps as malicious. Detect-ing a classifier bias can be used to amend the methodsadopted to train a classifier (e.g., the type of extractedfeatures).

The apps misclassified by the Ensemble classifiershould be, according to our proposed approach, re-

stimulated to execute different segments within theircode yielding different paths, different traces of API callsand, consquently, different feature vectors. Given thatDroidutan is a random-based testing tool, each run ofan app is likely to execute a different path. Consequently,the feedback mechanism in this implementation of Aioncomprises merely re-running an app. To ensure that ran-dom stimulation results into the execution of differentpaths, we kept track of the API traces recorded by droid-mon for each app. We averaged the number of API callsthat change in an apps trace across two consecutive stim-ulations, and found out that an average of 60 API calls(disregarding their arguments and return values) changewith every stimulation.

4 Evaluation

In this section, we evaluate the proposed active learningbased approach using two datasets of real world Androidmalware. The main hypothesis of this paper is that activelearning enhances the performance of classifiers trainedto effectively detect Android repackaged malware. Con-sequently, we aspire to answer the following researchquestions:Q1: How effective and efficient are the classifierstrained using active learning in comparison to conven-tional analysis and detection techniques?Q2: How reliable are the results achieved by suchclassifiers?Q3: How well do active learning classifiers generalizeon test datasets?Q4: What is the type of features (e.g., static versusdynamic) and classifiers that yield the best detectionaccuracies on test datasets?

4.1 DatasetsWe used two datasets of malicious and benign Androidapps to evaluate our approach. The malicious and benignapps of the first dataset were gathered separately. Themalicious apps of such dataset belong to the Malgenomedataset [35]. Malgenome originally comprised of morethan 1200 malicious apps, almost 86% of which werefound to be repackaged malware instances. Prior to be-ing discontinued in 2015, Malgenome was consideredthe de facto dataset of Android repackaged malware.Fortunately, the repackaged malware belonging to thisdataset continue to exist within the Drebin dataset [5],from whence we acquired them. To complement thefirst dataset with benign apps, we randomly selected anddownloaded 1800 apps from Google Play store in De-cember 2016. Consequently, we refer to this dataset asMalgenome+GPlay in the following sections.

6

The second dataset, Piggybacking [13], is more recentand comprises around 1300 pairs of original apps alongwith their repackaged versions. The process of gather-ing the apps, labeling them as malicious and benign, andmatching apps to their repackaged counterparts has beencarried out between 2014 and 2017. The dataset we usedcontains 1355 original, benign apps and 1399 repack-aged, malicious apps. The reason behind such imbalanceis that some original apps have more than one repackagedversion.

Lastly, the two datasets were used separately duringevaluation. Consequently, some of the experiments weconducted have two variants (i.e., one for each dataset).

4.2 Experiments

To answer the aforementioned research questions, we de-signed two sets of experiments. The first set of experi-ments is meant to decide upon the datasets to use in eval-uating the proposed active learning approach, and to set aperformance benchmark against which the proposed ap-proach is to be compared. The benchmark is set using re-sults achieved from conventional analysis and detectionmethods that have been previously utilized to detect An-droid (repackaged) malware. We refer to this set of ex-periments as preliminary experiments. The second set ofexperiments evaluates the performance of our proposedactive learning approach.

Both experiment sets share a common workflow, seenin figure 3. All experiments commence with randomlysplitting a dataset of APK’s into a training dataset (TR)comprising two thirds of the total number of APK’s anda test dataset (TE ) comprising the remaining one third.The second stage initializes different variables includingan initial F1 score of −1 and a variable to keep track ofthe current iteration’s number (i.e., t). The third stageis responsible for extracting a vector of numerical fea-tures (xi) for each APK (ai) in both the training and testdatasets (TR ∪TE ). The method used to extract such nu-merical features hinges on the type of the experimentrun. That is to say, if the experiment is dynamic, thenthe APK’s will be installed on an Android Virtual De-vice (AVD), run using Droidutan, and monitored us-ing droidmon. However, if the experiment is static,the APK’s will be statically analyzed using androguard.Note that for dynamic experiments, droidmon may failto produce API traces for an unforeseeable number ofAPK’s due to either (a) the app crashing during runtime,or (b) the app not using any of the API calls monitoredby the tool. The third phase concludes with produc-ing feature vectors of numerical features for the APK’sin (TR) and (TE ). The feature vectors are stored in twodatasets (|XR| ≤ |TR|) and (|XE | ≤ |TE |). Stage four usesthe feature vectors in (XR) to train a majority vote classi-

(TR)

TrainingDataset

(TE)

TestDataset

Test Feature Vectors

XE = {(x1, y1), . . . (xm, ym)}Training Feature Vectors

XR = {(x1, y1), . . . (xn, yn)}

Dataset D

+

∀xi ∈ XR, if xi is misclassified,

gather API calls using droidmon+

extract numerical features x′i

+re-run app ai corresponding to xi

Exit

(1)

(F1Et)

TestF1 Score F1 Score

Training

(F1Rt)

Initialize parameterse.g. iteration (t) = 1

Store Results

using xi ∈ TR

classifier Et

Train Ensemble

True

False

iterationt=

t+1t ≤ tMAX

|F1Rt − F1Rt−1 | ≤ E

(2)

(3)

(4)(5.1)(5.2)

(6)

(7)

Preliminary Experiments

Extract

features xi

numerical

static∀ai ∈ TR ∪ TE

using droidmonGather API callsRun ai on AVD +

dynamic

+replace xi with x′

i

Figure 3: The workflow of our experiments. Ellipses de-pict inputs and outputs, rectangles depict operations per-formed on such inputs and outputs, and rhombi depictboolean decisions. The edges between different nodesindicate the directionality of data flow. Lastly, the bluenumbers are used as references to different stages of theexperiment. Preliminary experiments, whether static ordynamic, conclude after stage (5) and are, hence, high-lighted using the dashed gray box.

7

fier (Et : Ensemble classifier trained at iteration t), as dis-cussed in section 3.2. In stage five, the trained classifieris validated using the vectors from (XR) and tested usingthe vectors from (XE ) yielding corresponding training F1score (F1Rt ) and test F1 score (F1Et ).

Preliminary experiments conclude after stage five,storing the results into a SQLite database for future study.Active learning experiments compare the training ac-curacy (F1Rt ) scored at the current iteration (t) to thatscored at the previous iteration (F1Rt−1 ). Test accura-cies are not used in such comparison to prevent the testdataset (i.e., containing out-of-sample apps), from affect-ing the training process. This comparison always evalu-ates to true after the first iteration, especially since thetraining F1 score is initialized to −1. For values of(t > 1), if the absolute difference between (F1Rt ) and(F1Rt−1 ) is less than or equal to a threshold percentage(E) (default is 1%), the misclassified feature vectors willbe re-run in a manner similar to stage three. Intuitively,we continue running the experiment as long as the cur-rent training F1 score (F1Rt ) is increasing. In otherwords, there is a potential for achieving better training,and perhaps test, scores. Furthermore, in order to accom-modate for fluctuations in training accuracies, we allowthe current training F1 score (F1Rt ) to drop for a maxi-mum of (E). This means, however, that the experimentcould go on for a prolonged period of time. To avoid thisbehavior, we set an upper bound (tMAX ) for the numberof iterations the training phase is allowed to run. In eval-uating our approach, we use the values of 10 and 1% for(tMAX ) and (E), respectively.

Regardless of the experiment set, a dataset is randomlysplit into training and test datasets at the beginning ofeach experiment. The training and test datasets persistthroughout the experiment (i.e., through all iterations).Consequently, we run all experiments at least 25 timesto increase the likelihood of different segments of thedatasets being used as, both, training and test samples.Throughout implementing and evaluating Aion, we gen-erated a multitude of figures, which we could not includein this paper. We host and make available all our (inter-active) figures, data, and source code on [?].

4.2.1 Preliminary Static Experiments

In this set of experiments, only the three categories ofstatic features, introduced in section 3.2, are used to clas-sify apps as malicious and benign. That is to say, eachapp is represented by three feature vectors depicting suchcategories. Moreover, we combine the three feature vec-tors to come up with a fourth category of static featuresthat we refer to as All.

The scores in figure 4 depict the median F1 scoresrecorded by different classifiers using all four categories

of static features on the Malgenome+GPlay dataset andits more recent counterpart Piggybacking. To preservespace for more thorough discussion, we opted to movethe specificity scores to appendix C. We speculated thatsolely using static features to classify repackaged mal-ware would yield mediocre classification accuracies, es-pecially since repackaged malware usually poses as be-nign apps. Nevertheless, as seen in figure 4a, most clas-sifiers perform well on the Malgenome+Play using staticfeatures, apart from permission-based features, whichapparently are not informative enough to separate ma-licious and benign apps in this dataset. We noticed,however, that the classifiers comparably underperform interms of, both, the F1 and specificity scores upon apply-ing the same method to classify apps in the Piggybackingdataset.

We believe there are two reasons behind such notice-able difference. Firstly, the malicious apps in our hy-brid Malgenome+GPlay dataset have been gathered be-tween August 2010 and October 2011 [35], whereas theirbenign counterparts were gathered in December 2016.Given that the Android ecosystem is continuously evolv-ing, any two apps developed 5 years apart might lookvery different courtesy of newly-introduced technolo-gies, programming interfaces, or even hardware capabili-ties. In fact, we noticed that there is a large difference be-tween the sizes of APK archives belonging to apps in theMalgenome dataset (average size: 1.25MB) and thosebelonging to the benign apps we gathered from GooglePlay (average size: 16.35MB). This difference implies,we believe, the existence of more components (e.g., ac-tivities and their classes) in recent Android apps. Staticfeatures are built on counts and ratios of those compo-nents, the permissions they request, and the API callsthey issue. Needless to say, the feature vectors extractedfrom apps developed in 2011 would look entirely differ-ent from those extracted from apps developed in 2015or 2016, effectively facilitating segregating them in twoclasses which happen to be malicious and benign.

Secondly, despite being defined as repackaged mal-ware, the malicious apps in Malgenome do not com-ply with the more recent definition of repackaged/piggy-backed apps. In other words, we believe that the 86% ofapps comprising repackaged malware in the Malgenomedataset are, in fact, closer to being classic Trojan horsesthan being repackaged malware. The malicious apps inthe Piggybacking dataset, however, comply with the def-inition of repackaged malware (i.e., standalone benignapps grafted with malicious payloads [13]). To empiri-cally verify this claim, we used a compiler fingerprintingtool, APKiD, to identify the compilers used to compilethe apps in both datasets Malgenome+GPlay and Piggy-backing along with a dataset of malicious apps calledDrebin [5]. The distribution of compilers utilized by

8

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.6

0.7

0.8

0.9

1.0

Basic F1 Score

Perm F1 Score

API F1 Score

All F1 Score

(a) Malgenome+GPlay median F1 scores

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.50

0.55

0.60

0.65

0.70

0.75

Basic F1 Score

Perm F1 Score

API F1 Score

All F1 Score

(b) Piggybacking median F1 scores

Figure 4: The median (after 25 runs) F1 scores (Y-axis) recorded by 13 classifiers (X-axis) using static features on thetest datasets. The specificity scores are plotted in appendix C.

Table 1: Percentages of compilers used to compileAPK’s in different malicious/benign datasets as finger-printed by APKiD. The dexmerge compiler is used byIDE’s after dx for incremental builds.

Dataset dx dexmerge dexlib 1.X/2.X TotalDrebin 84% – 16% 4326

Malgenome 52% – 48% 1234Google Play 61% 34% 5% 1882

Piggybacking(malicious)

22% 6% 72% 1399

Piggybacking(original)

61% 22% 17% 1355

apps in such datasets is tabulated in table 1. Using thelist of compilers, we rely on the following hypothesis,originally made in [25]: if the compiler used to compilean APK is not the standard Android SDK compiler (dx)but rather a compiler used by reverse engineering tools(e.g., Apktool) such as (dexlib), then an app is probablyrepackaged by a party other than the legitimate develop-ers in possession of the app’s original source code.

As seen in table 1, the majority benign apps gatheredfrom Google Play and the original apps in the Piggyback-ing dataset have been compiled using a compiler usu-ally used within IDE’s. This implies that the develop-ers were likely in possession of the apps’ source code.The same applies to the Drebin dataset which mainlycomprises of malicious apps developed from scratch.However, the datasets that presumably comprise repack-aged malware indicate a major difference. More thana half of the Malgenome dataset comprises apps that–

according to the definitions in sections 1 and 2–are Tro-jans, whereas the majority of malicious apps in the Pig-gybacking dataset comply with the notion of a benignapp being grafted with malicious payload and recom-piled using non-standard compilers.

Such findings led us to deem the Malgenome datasetas obsolete and inaccurate. In fact, during their analysisof a dataset of 24,650 Android malware samples, Wei etal. also deemed the Malgenome dataset outdated and notrepresentative of the current Android malware landscape[29]. Thus, devising methods to analyze and detect ma-licious apps in the Malgenome dataset has little benefitto the community. Consequently, we opted to conductthe remainder of our experiments on the Piggybackingdataset, and only consider the static scores recorded onsuch dataset as our benchmark.

4.2.2 Preliminary Dynamic Experiments

The preliminary dynamic experiments are meant to re-semble conventional dynamic approaches to stimulatingand detecting Android repackaged malware. That is tosay, apps are stimulated once prior to being processedfor feature extraction and classification. In essence, suchexperiments depict the first iteration of our active learn-ing experiments (i.e., without the feedback loop). Thefeatures used during such experiments are the dynamicfeatures introduced in section 3.2 along with a combina-tion of the All static and dynamic features that we referto as hybrid features.

Figure 5 depicts the median F1 scores achieved by dif-ferent classifiers on the Piggybacking dataset after 25

9

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.55

0.60

0.65

0.70

0.75

Dynamic F1 Score

Hybrid F1 Score

Static F1 Score

Figure 5: The median (after 25 runs) F1 scores (Y-axis)recorded by 13 classifiers (X-axis) using dynamic andhybrid features versus the best scoring static features(permission-based) on the test datasets of Piggybacking.The specificity scores are plotted in appendix C.

runs using dynamic and hybrid features. As a refer-ence to the performance of static features, we also plotthe scores achieved by such classifiers using permission-based static features, which achieved the highest scoreson the Piggbacking dataset. In figure 5, dynamic fea-tures appear to be incapable of matching the detectionability of static features. One can also speculate thatcombining dynamic features with static features (i.e., hy-brid features) decreases the latter’s detection capability.Nevertheless, dynamic features perform better or slightlyworse than other features in correctly classifying benignapps. To conclude, the dynamic preliminary experimentsimply that dynamic features extracted after running theapps only once are initially biased towards classifyingall apps as benign. We argue that stimulating a repack-aged malware once is unlikely to execute the paths un-der which the grafted malicious payload resides, ulti-mately leading to the misclassification of such app. In thenext set of experiments, we attempt to verify whether re-stimulating the misclassified apps yields different pathsthat enhance the performance of the detection moduleand its classifiers.

4.2.3 Active Learning Experiments

Using the Piggybacking dataset, we ran at total of 25 ac-tive learning experiments, during which 4 AVD’s weresimultaneously used to stimulate apps using Droidutaneach for 60 seconds. The misclassified apps were al-lowed to be re-stimulated until either the difference in theF1 score of the Ensemble classifier drops for more than

a threshold of 1% between two iterations, or an upperbound of re-stimulations is reached. We experimentedwith such upper bound, and found that with 10 iterations,an experiment takes on average 26 hours to complete.Given that increasing such number did not have a notice-able effect on the achieved scores, and that it substan-tially increased the time taken to complete one run of theexperiment, we adopted an upper bound of 10 iterations.

The scores of the lowest scoring classifier, 500-Nearest Neighbors, the highest scoring classifier, Ran-dom forest of 100 trees, and the Ensemble classifier areplotted in figure 6. As seen in the preliminary dynamicexperiments, the figures depict the median F1 and speci-ficity scores achieved using dynamic, hybrid, and staticpermission-based features achieved by the classifiers atevery iteration. The static scores are plotted as straightlines, because static experiments did not comprise anyiterations. Thus, the scores achieved by a classifier per-sist across all iterations.

We made the following observations from the plottedfigures. Firstly, regardless of the used classifier, dynamicfeatures fail to match or overperform their static counter-parts. However, combining static and dynamic featuresboosts the performance of the latter by around 7%. Inother words, dynamic features are incapable of matchingstatic features on their own.

Secondly, the iteration on which the maximum F1score is achieved depends on the utilized classifier andfeatures type. Furthermore, both the F1 and specificityscores fluctuate across iterations, and do not maintainany specific ascending or descending pattern.

Lastly, regardless of the utilized feature type, someclassifiers maintain more stable performance on both thebenign and malicious apps than others. For example, forthe majority of iterations, the difference between the F1and specificity scores achieved by the K-Nearest Neigh-bors classifier in figure 6a is around 20% and 25% for thedynamic and hybrid features, respectively. This behaviorindicates bias towards one class in favor of the other. TheEnsemble and Random Forest classifiers maintain morebalanced performances on malicious and benign apps,which manifests in closer scores across different itera-tions.

5 Discussion

In this section, we attempt to draw conclusions from theconducted experiments to answer the research questionsposed earlier.

With regard to research question Q1, we define theeffectiveness of an active learning classifier in terms ofthe F1 and specificity scores it achieves using dynamicand hybrid features on the training datasets in compari-son to the scores achieved by the same classifier during

10

itn.1

itn.2

itn.3

itn.4

itn.5

itn.6

itn.7

itn.8

itn.9

itn.1

0

0.4

0.5

0.6

0.7

0.8

0.9

Dynamic F1 Score

Dynamic Specificty

Hybrid F1 Score

Hybrid Specificity

Static F1 Score

Static Specificity

(a) K-NN (K=500)

itn.1

itn.2

itn.3

itn.4

itn.5

itn.6

itn.7

itn.8

itn.9

itn.1

0

0.58

0.60

0.62

0.64

0.66

0.68

0.70

0.72

0.74

Dynamic F1 Score

Dynamic Specificty

Hybrid F1 Score

Hybrid Specificity

Static F1 Score

Static Specificity

(b) Ensemble

itn.1

itn.2

itn.3

itn.4

itn.5

itn.6

itn.7

itn.8

itn.9

itn.1

0

0.66

0.68

0.70

0.72

0.74

0.76

0.78

Dynamic F1 Score

Dynamic Specificty

Hybrid F1 Score

Hybrid Specificity

Static F1 Score

Static Specificity

(c) Random Forest (T=100)

Figure 6: The median (after 25 runs) F1 and specificity scores (Y-axis) recorded by 13 classifiers (X-axis) for eachiteration using dynamic and hybrid features versus the best scoring static features (permission-based) on the testdatasets of Piggybacking.

the static and dynamic preliminary experiments. We alsoconsider the classifier’s ability to avoid bias towards aspecific class as a measure of its effectiveness. We no-ticed that the aforementioned scores differ from one clas-sifier to another. For example, using dynamic features,the active learning trained 500-Nearest Neighbor classi-fier scored F1 and specificty scores significantly lowerthan what it scored during the static preliminary exper-iments. Furthermore, it was biased towards classifyingtraining apps as benign, which is noticeable via its speci-ficity score in comparison to the F1 score. However,the Ensemble and Random Forest classifiers managed torecord F1 and specificity scores that are (a) greater thanthe scores they recorded during the static and dynamicpreliminary experiments, and (b) show low bias givenhow close both scores are. The scores achieved by allclassifiers during different types of experiments can befound on Aion’s website [?].

Similarly, the efficiency of active learning classifiersdiffered from one classifier to another. We define effi-ciency in terms of the number of iterations it takes a clas-sifier to record its highest F1 and specificity scores on thetraining dataset. The 500-Nearest neighbors classifier,for instance, achieved its highest scores at the ninth it-eration, whereas the Ensemble and Random Forest (with100 Trees) classifiers recorded their highest scores at thesecond and third iterations, respectively. We averagedthis number across all classifiers, and noticed that it takesaround five iterations for an active learning classifier toachieve its maximum training accuracy. As discussedearlier, we consider this number as the number of iter-ations required by a classifier to have a comprehensiveview of the malicious and benign behaviors within thetraining dataset.

The reliability of the training scores achieved by activelearning classifiers, which is the concern of Q2, refers tothe quality of the training data used to train such clas-

sifiers. In other words, we want to make sure that theclassifiers are being trained using traces and features thatreflect the malicious and benign behaviors exhibited bycurrent Android repackaged malware and apps, respec-tively. Otherwise, the classifiers will be trained usinginaccurate data that might yield misleading results.

To illustrate this concern, consider the CFG in fig-ure 2. We discussed that, ideally, a repackaged mal-ware (ai) should be classified as malicious if and onlyif the malicious path (P2) was executed. With this re-search question, we reason about whether the high F1scores achieved by classifiers such as Random Trees dur-ing training honor this condition. In other words, howlikely is it that a classifier deems the app (ai) as benign,given that a path other than (P2) has executed. Unfortu-nately, as discussed earlier, we do not possess any groundtruth about the paths that reveal the malicious behav-iors injected within apps labeled as repackaged malware.Consequently, we can only speculate about the reliabilityof the results. However, we reasoned about the scenar-ios under which this behavior could be observed (i.e., aiis classified as malicious based on benign paths, such asP1). Firstly, and presumably less likely, the original la-bels of the apps in the dataset may turn out to be wrong(i.e., ai is a benign or a harmless app being mistakenly la-beled as malicious). Secondly, (P1) could in fact turn outto be a malicious behavior that we were not aware of andassert as benign. Thirdly, within (ai), the path (P1) couldbe benign; however, it may have been utilized within amalicious context in other apps in the training dataset,which obliges a classifier to deem it as malicious. In thefourth scenario, the path (P1) could be a benign path thatis rarely used by benign apps (i.e., anomalous), whichencourages the classifier to deem it as malicious as well.Lastly, there is the possibility of utilizing poor classifiers,features, or learning processes, which yields mediocre,unreliable classification results.

11

https://aionexperiments.github.io/figures.html

Considering an extreme scenario, assume that allrepackaged malware instances in a training dataset havebeen correctly classified despite the fact that none of theirmalicious paths has executed. For this condition to hold,we need to assume that all the benign paths deemed mali-cious by a classifier are anomalous (i.e., seldom encoun-tered in benign apps), which is highly unlikely giventhe limited number of methods a task (e.g., sending atext message), can be achieved. Moreover, if a classifiermistakenly deems the majority of benign paths as ma-licious, this behavior should be traceable via the speci-ficity scores which should plumet to minimum values.

To empirically complement the assumptions above,we manually went through the repackaged malware testtraces generated by droidmon during the 25th run of theactive learning experiments. Around 75% of such testtraces were correctly classified using the Random Forest(T=100) classifier trained using active learning. Duringour analysis, we found evidence that the majority of be-haviors based on which the apps were classified as ma-licious are, in fact, intrinsically malicious. For exam-ple, we found that a repackaged version of a simulationgaming app (i.e., com.happylabs.happymall), was classi-fied as malicious based on a trace that included retrievingthe device’s IMEI, IMSI, and network operator. Further-more, such trace included the decryption of a URL thatpointed to http://csapi.adfeiwo.com:9999. Needless tosay, such behavior is not expected from a gaming app.

We also examined traces of repackaged malwaremisclassified as benign apps. We found that the ma-jority of such misclassified traces exhibited benignbehaviors. The trained classifier, thus, reasonablyassigned them to the benign class. This observationis expected, given that the test apps were stimulatedusing the primitive tool Droidutan, which is unlikelyto execute the malicious behaviors within repackagedmalware after one execution. Nevertheless, uponsubmitting the hashes of the apps corresponding to theexamined traces to VirusTotal [3], we noticed that anoticeable number of the malicious apps classified asbenign were also deemed as benign by the majority ofantiviral software. For example, the repackaged appscom.sojunsoju.go.launcherex.theme.greenlantern andcom.gametowin.save monster Google were labeled aspotentially unwanted by only two out of more than50 antiviral software, whereas a repackaged versionof com.gccstudio.DiamondCaveLite was labeled asbenign by all antiviral software on VirusTotal. The latterobservation raises the question about the accuracy of thelabels in current Android repackaged malware datasets,as discussed earlier.

In absence of the ground truth necessary to verify thereliability of the achieved scores, and based on our previ-ous discussion and observations, our answer to question

Q2 is that we believe that the scores achieved by the ac-tive learning classifiers are likely to be reliable.

The active learning process we adopt to train classi-fiers might lead to overfitting. In other words, the trainedclassifier would not be able to correctly classify a largenumber of test apps (i.e., apps not used during the train-ing phase). Research question Q3 focuses on this is-sue. Similar to the training phase, we noticed that someclassifiers underperform in comparison to conventionaltraining methods, while others can match and outper-form them. For example, using hybrid features, RandomForest classifiers trained using active learning manage tooutperform conventional training methods using static,dynamic, and hybrid features. Furthermore, it avoidsbias towards a certain class by maintaining a good bal-ance between the F1 and specificty scores. Observingthe performances of different active learning classifiers,we conclude that Random Forest classifiers are able toachieve the best F1 and specificty scores during, both,training and test phases specifically using hybrid fea-tures, which is the answer to Q4.

6 Related Work

To the best of our knowledge, there are no research ef-forts that attempt to apply active learning to the prob-lem of stimulating and detecting Android repackagedmalware. However, there are indeed efforts that studyrepackaged malware/piggybacked apps and attempt tostimulate and detect it. Moreover, we managed to iden-tify research efforts that apply active learning to detect-ing Android malware in general. Consequently, we di-vide our related work section into two sub sections thatdiscuss those two dimensions.

6.1 Repackaged Malware Analysis and De-tection

There are different efforts that attempt to detect repack-aging in Android apps including [23, 10, 34, 8]. Repack-aged apps need not necessarily be malicious, however.An app could be repackaged for the sole purpose of in-jecting advertisement modules or translation purposes.In this paper, we focus on detecting repackaged malware,or piggybacked apps, that conform with Li et al.’s defi-nition: benign apps that has been grafted with maliciouspayloads that are possible protected with trigger condi-tions [13].

In [17], Pan et al. developed a program analysis tech-nique that decides whether an app witholds hidden sen-sitive operations (HSO) (i.e., malicious payloads), effec-tively deeming it as repackaged malware. Their tech-nique, called HSOMiner, gathers static features from the

12

app’s source code that capture the code segments’ re-lationship with their surrounding segments, the type ofinputs they tend to check, and the operations they usu-ally perform. The extracted features are used to train andvalidate a SVM that decides whether an app contains aHSO. Needless to say, if an app contains HSO, it proba-bly is malicious. Tian et al. implemented another staticapproach to detect Android repackaged malware basedon code heterogeneity analysis [27]. The approach isbased on extract features that depict the dependence of acode segments on one another. In essence, injected codesegments should exhibit more heterogeneity to the restof the segments. That is to say, the malicious payloadsinjected into benign apps should be logically and seman-tically independent of the original app portions. Similarto our approach, they used the extracted features to trainand compare the performance of four different classifiersviz., K-Nearest Neighbors, Support Vector Machine, De-cision Tree, and Random Forest. Lastly, Shahriar et al.use a metric called Kullback-Leibler Divergence (KLD)to detect Android repackaged malware [22]. KLD is ametric depicting the difference between two probabil-ity distributions. In their paper, Shahriar et al. analyzedthe Smali code of benign and malicious apps, from theMalgenome dataset, to build probability distributions ofdifferent Smali instructions. Their detection approach isbased on the assumption that repackaged malware wouldhave different distributions for different instructions thanits benign counterpart.

6.2 Detection with Active Learning

Active learning has been applied to the problem of mal-ware detection on the Windows operating system [16]and within the network intrusion detection [24] domain.In [32], Zhao et al. attempt to apply the same techniqueto the problem of Android malware detection. Despiteadopting a similar architecture that comprises a charac-teristic (monitoring and extraction) module and a detec-tion module, their work differs from ours in two mainaspects. Firstly, Zhao et al. do not consider repackagedmalware, rather Android malware in general. Secondly,and more importantly, their definition and utilization ofactive learning is different from ours. In [32], activelearning is defined as a technique to reduce the size ofthe training samples used to train a SVM classifier thatdeem an Android app, represented by a behavioral signa-ture (i.e., API call trace), malicious or benign. This re-duction is carried out as follows. A pool of pre-generatedand labeled behavioral signatures is built as a source fortraining data. Given an unlabeled, out-of-sample signa-ture, a classifier is trained using the subset of the train-ing pool that guarantees maximum classification accu-racy. The main difference between this method and ours

is that the classification accuracy in [32] solely dependson the combination of the training samples, whereas oursdepends on the content of such samples, especially sincethe training vectors usually differ from one iteration toanother. Furthermore, this approach to active learningis very likely to yield small training datasets that hinderthe trained classifiers from achieving good classificationaccuracies on test datasets (i.e., generalization).

7 Conclusion

The lack of ground truth about the nature and locationof malicious payloads injected into benign apps contin-ues to hinder the development of effective methods tostimulate, analyze, and detect Android repackaged mal-ware. We argue that training an effective classifier canbe achieved if the classifier is allowed to request betterrepresentations of the training apps (i.e., via active learn-ing).

Using a sample implementation of our proposed activelearning architecture Aion, we used active learning tostimulate, analyze, and detect Android repackaged mal-ware. In [34], Zhou et al. managed to detect malwarein the Malgenome dataset with a best case of 79.6% ac-curacy [34], highlighting the difficulty of detecting suchbreed of malware. In this paper, our active learning clas-sifiers managed to achieve test F1 scores of 72% usingthe more recent dataset Piggybacking [13], which weconsider a benchmark for future comparison.

We consider this effort as the first step towards de-signing better app stimulation, analysis, and detectiontechniques that focus on Android repackaged malware,which continues to pose a serious threat [29]. Our futureplan to enhance Aion has three dimensions. Firstly, weplan on using the number of iterations it takes to train aclassifier, and verify whether it transfers to the numberof executions needed to effectively classify a test app.Secondly, we plan to use more sophisticated stimulationengines and compare their effect on the training and testscores compared to the random, primitive technique usedin this paper. Thirdly, we plan to utilize features and clas-sifiers that capture the semantics of the runtime behaviorsexhibited by under test.

References[1] Apktool: https://goo.gl/eaewql.

[2] Droidutan: https://goo.gl/7xsczq.

[3] Virustotal: https://www.virustotal.com/en/search.

[4] ABRAHAM, A., ANDRIATSIMANDEFITRA, R., BRUNELAT, A.,LALANDE, J.-F., AND VIET TRIEM TONG, V. GroddDroid:a Gorilla for Triggering Malicious Behaviors. In 10th Inter-national Conference on Malicious and Unwanted Software (Fa-jardo, Puerto Rico, 2015), IEEE Computer Society.

13

[5] ARP, D., SPREITZENBARTH, M., HUBNER, M., GASCON, H.,AND RIECK, K. Drebin: Effective and explainable detection ofandroid malware in your pocket. In NDSS (2014).

[6] ARSHAD, S., SHAH, M. A., KHAN, A., AND AHMED, M.Android malware detection & protection: a survey. Interna-tional Journal of Advanced Computer Science and Applications7 (2016), 463–475.

[7] BARTEL, A., KLEIN, J., MONPERRUS, M., ALLIX, K., ANDLE TRAON, Y. Improving privacy on android smartphonesthrough in-vivo bytecode instrumentation.

[8] DESHOTELS, L., NOTANI, V., AND LAKHOTIA, A. Droidle-gacy: Automated familial classification of android malware. InProceedings of ACM SIGPLAN on Program Protection and Re-verse Engineering Workshop 2014 (2014), ACM, p. 3.

[9] FRATANTONIO, Y., BIANCHI, A., ROBERTSON, W., KIRDA,E., KRUEGEL, C., AND VIGNA, G. Triggerscope: Towards de-tecting logic bombs in android applications. In Security and Pri-vacy (SP), 2016 IEEE Symposium on (2016), IEEE, pp. 377–396.

[10] HANNA, S., HUANG, L., WU, E., LI, S., CHEN, C., ANDSONG, D. Juxtapp: A scalable system for detecting code reuseamong android applications. In International Conference on De-tection of Intrusions and Malware, and Vulnerability Assessment(2012), Springer, pp. 62–81.

[11] HUANG, C.-Y., TSAI, Y.-T., AND HSU, C.-H. Performanceevaluation on permission-based detection for android malware.In Advances in Intelligent Systems and Applications-Volume 2.Springer, 2013, pp. 111–120.

[12] LI, L., BISSYANDE, T. F., PAPADAKIS, M., RASTHOFER, S.,BARTEL, A., OCTEAU, D., KLEIN, J., AND LE TRAON, Y.Static analysis of android apps: A systematic literature review.Information and Software Technology (2017).

[13] LI, L., LI, D., BISSYANDE, T. F., KLEIN, J., LE TRAON, Y.,LO, D., AND CAVALLARO, L. Understanding android app pig-gybacking: A systematic study of malicious code grafting. IEEETransactions on Information Forensics and Security 12, 6 (2017),1269–1284.

[14] LI, L., LI, D., BISSYANDE, T. F. D. A., KLEIN, J., CAI, H.,LO, D., AND LE TRAON, Y. Automatically locating maliciouspackages in piggybacked android apps. In 4th IEEE/ACM Inter-national Conference on Mobile Software Engineering and Sys-tems (2017).

[15] LUO, S., AND YAN, P. Fake apps: Feigning legitimacy, 2014.

[16] NISSIM, N., MOSKOVITCH, R., ROKACH, L., AND ELOVICI,Y. Novel active learning methods for enhanced pc malware detec-tion in windows os. Expert Systems with Applications 41 (2014),5843–5857.

[17] PAN, X., WANG, X., DUAN, Y., WANG, X., AND YIN, H. Darkhazard: Learning-based, large-scale discovery of hidden sensitiveoperations in android apps.

[18] RASTHOFER, S., ARZT, S., TRILLER, S., AND PRADEL, M.Making malory behave maliciously: Targeted fuzzing of androidexecution environments. In Proceedings of the 39th Interna-tional Conference on Software Engineering (2017), IEEE Press,pp. 300–311.

[19] RASTHOFER, S., ASRAR, I., HUBER, S., AND BODDEN, E.How current android malware seeks to evade automated codeanalysis. In IFIP International Conference on Information Se-curity Theory and Practice (2015), Springer, pp. 187–202.

[20] SANZ, B., SANTOS, I., LAORDEN, C., UGARTE-PEDRERO, X.,BRINGAS, P. G., AND ALVAREZ, G. Puma: Permission us-age to detect malware in android. In International Joint Confer-ence CISIS’12-ICEUTE´ 12-SOCO´ 12 Special Sessions (2013),Springer, pp. 289–298.

[21] SATO, R., CHIBA, D., AND GOTO, S. Detecting android mal-ware by analyzing manifest files. Proceedings of the Asia-PacificAdvanced Network 36 (2013), 23–31.

[22] SHAHRIAR, H., AND CLINCY, V. Kullback-leibler divergencebased detection of repackaged android malware.

[23] SHAO, Y., LUO, X., QIAN, C., ZHU, P., AND ZHANG, L. To-wards a scalable resource-driven approach for detecting repack-aged android applications. In Proceedings of the 30th An-nual Computer Security Applications Conference (2014), ACM,pp. 56–65.

[24] STOKES, J. W., PLATT, J. C., KRAVIS, J., AND SHILMAN, M.Aladin: Active learning of anomalies to detect intrusion, 2008.

[25] STRAZZERE, T. Detecting pirated and malicious android appswith apkid, 2016.

[26] TAM, K., FEIZOLLAH, A., ANUAR, N. B., SALLEH, R., ANDCAVALLARO, L. The evolution of android malware and androidanalysis techniques. ACM Computing Surveys (CSUR) 49, 4(2017), 76.

[27] TIAN, K., YAO, D., RYDER, B. G., AND TAN, G. Analysisof code heterogeneity for high-precision classification of repack-aged malware. In Security and Privacy Workshops (SPW), 2016IEEE (2016), IEEE, pp. 262–271.

[28] TONG, S. Active learning: theory and applications. StanfordUniversity, 2001.

[29] WEI, F., LI, Y., ROY, S., OU, X., AND ZHOU, W. Deep groundtruth analysis of current android malware. In International Con-ference on Detection of Intrusions and Malware, and Vulnerabil-ity Assessment (2017), Springer, pp. 252–276.

[30] WU, D.-J., MAO, C.-H., WEI, T.-E., LEE, H.-M., AND WU,K.-P. Droidmat: Android malware detection through manifestand api calls tracing. In Information Security (Asia JCIS), 2012Seventh Asia Joint Conference on (2012), IEEE, pp. 62–69.

[31] XUE, L., ZHOU, Y., CHEN, T., LUO, X., AND GU, G. Malton:Towards on-device non-invasive mobile malware analysis for art.

[32] ZHAO, M., ZHANG, T., GE, F., AND YUAN, Z. Robotdroid: Alightweight malware detection framework on smartphones. Jour-nal of Networks 7, 4 (2012), 715–722.

[33] ZHAUNIAROVICH, Y., AHMAD, M., GADYATSKAYA, O.,CRISPO, B., AND MASSACCI, F. StaDynA: Addressing theProblem of Dynamic Code Updates in the Security Analysis ofAndroid Applications. In Proceedings of the 5th ACM Confer-ence on Data and Application Security and Privacy (2015), CO-DASPY ’15, ACM, pp. 37–48.

[34] ZHOU, W., ZHOU, Y., JIANG, X., AND NING, P. Detectingrepackaged smartphone applications in third-party android mar-ketplaces. In Proceedings of the second ACM conference on Dataand Application Security and Privacy (2012), ACM, pp. 317–326.

[35] ZHOU, Y., AND JIANG, X. Dissecting android malware: Char-acterization and evolution. In Security and Privacy (SP), 2012IEEE Symposium on (2012), IEEE, pp. 95–109.

14

Appendix: Features

A A: Static FeaturesIn this section, we list the categories of static featuresextracted from the apps’ APK’s, and utilized in, both,preliminary and active learning experiments

Table 2: The static features used during the preliminarystatic experiments.

Category Feature

basic

Min. SDK versionMax. SDK versionTotal # of activitiesTotal # of services

Total # of broadcastreceivers

Total # of contentproviders

permission

Total requestedpermissions

Android permissions÷

Total permissionsCustom permissions

÷Total permissions

Dangerous permissions÷

Total permissions

APICounts of API categories

listed in here.(Total: 27)

B B: Dynamic FeaturesThe following table lists the API categories used to ex-tract dynamic features. The total number of API methodshooked is 71 methods, which makes them difficult to belisted in this paper. The exact list of API methods wehook can be found here.

C C: FiguresThis section contains the plots of specificity scoresachieved by various classifiers during the static and dy-namic variations of the preliminary experiments.

API Category Total Hookedandroid.accounts.AccountManager 2

android.app.Activity 1android.app.ActivityManager 1android.app.ActivityThread 1

android.app.ApplicationPackageManager 2android.app.ContextImpl 1

android.app.NotificationManager 1android.app.SharedPreferencesImpl$EditorImpl 5

android.content.BroadcastReceiver 1android.content.ContentResolver 4android.content.ContentValues 1

android.location.Location 2android.media.AudioRecord 1

android.media.MediaRecorder 1android.net.ConnectivityManager 1

android.net.wifi.WifiInfo 1android.os.Debug 1android.os.Process 1

android.os.SystemProperties 1android.telephony.SmsManager 2

android.telephony.TelephonyManager 11android.util.Base64 3

android.system.BaseDexClassLoader 3android.system.DexClassLoader 3

android.system.DexFile 2android.system.PathClassLoader 3

java.io.FileInputStream 1java.io.FileOutputStream 1java.lang.ProcessBuilder 1

java.lang.Runtime 1java.lang.reflect.Method 1

java.net.URL 1javax.crypto.Cipher 1javax.crypto.Mac 1

javax.crypto.spec.SecretKeySpec 5libcore.io.IoBridge 1

org.apache.http.impl.client.AbstractHttpClient 1

Table 3: The API categories that comprise the dynamicfeatures used in the preliminary dynamic and activelearning experiments.

15

https://aionexperiments.github.io/results/sensitiveAPICalls.py

https://aionexperiments.github.io/results/hooks.json

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.80

0.85

0.90

0.95

1.00

Basic Specificity

Perm Specificity

API Specificity

All Specificity

(a) Malgenome median specificity

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.4

0.5

0.6

0.7

0.8

Basic Specificity

Perm Specificity

API Specificity

All Specificity

(b) Piggybacking median specificity

Figure 7: The median (after 25 runs) specificity scores (Y-axis) recorded by 13 classifiers (X-axis) using static featureson the test datasets.

K10 K25 K50

K100

K250

K500

SVMT10 T25 T50 T75

T100 En

0.4

0.5

0.6

0.7

0.8

Dynamic Specificity

Hybrid Specificity

Static Specificity

Figure 8: The median (after 25 runs) specificity scores(Y-axis) recorded by 13 classifiers (X-axis) using dy-namic and hybrid features versus the best scoring staticfeatures (permission-based) on the test datasets of Piggy-backing.

16

arxiv:1808.01186v1 [cs.cr] 3 aug 2018android apps can be easily disassembled, reverse engi-neered,...

Documents