delving into android malware families with a novel...
TRANSCRIPT
Research ArticleDelving into Android Malware Families with a NovelNeural Projection Method
Rafael Vega Vega 1 Heacutector Quintiaacuten1 Carlos Cambra 2 Nuntildeo Basurto 2
Aacutelvaro Herrero 2 and Joseacute Luis Calvo-Rolle 13
1University of A Coruna Departamento de Ingenierıa Industrial Avda 19 de febrero sn 15495 Ferrol A Coruna Spain2Grupo de Inteligencia Computacional Aplicada (GICAP) Departamento de Ingenierıa Civil Escuela Politecnica SuperiorUniversidad de Burgos Av Cantabria sn 09006 Burgos Spain3Research Institute of Applied Sciences in Cybersecurity (RIASC) Spain
Correspondence should be addressed to Rafael Vega Vega rafaelalejandrovegavegaudces
Received 5 December 2018 Accepted 23 January 2019 Published 2 June 2019
Guest Editor Alicja Krzemien
Copyright copy 2019 RafaelVegaVega et alThis is an open access article distributed under theCreativeCommonsAttribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited
Present research proposes the application of unsupervised and supervised machine-learning techniques to characterize Androidmalware families More precisely a novel unsupervised neural-projection method for dimensionality-reduction namely BetaHebbian Learning (BHL) is applied to visually analyze such malware Additionally well-known supervised Decision Trees (DTs)are also applied for the first time in order to improve characterization of such families and compare the original features that areidentified as the most important onesThe proposed techniques are validated when facing real-life Androidmalware data bymeansof the well-known and publicly available Malgenome dataset Obtained results support the proposed approach confirming thevalidity of BHL and DTs to gain deep knowledge on Android malware
1 Introduction and Previous Work
Undoubtedly smartphones are one of the emerging tech-nologies that have revolutionized the use of computingsystems From the very beginning (late 1990s) more andmore smartphones are sold every year and it is expectedthat the number of smartphone users passes the 27 billionmark by 2019 [1] Although there is a variety of operatingsystems for such devices Googlersquos Android is themost widelyused one [1] and consequently the number of Android usershas permanently increased Concurrently the number ofAndroid apps strongly increased in the last years but it startedto decline from 36 million in March 2017 (highest value) to26 million in September 2018 [2]
From the security standpoint one of the main problemsof smartphone apps is malware that is included in software ingeneral and in these apps in particular Furthermore ldquousers ofmobile devices are increasingly subject to malicious activitypushing malware appsrdquo [3] It is true that some effort hasbeen devoted by Google to remove and prevent malicious
apps from Google Play Market but malware is still there [3]Moreover malware Android apps are increasing in the thirdtrimester of 2018 there has been an increase of 17 milliondetections [4]
As it can be seen privacy and security of smartphones stillare open challenges [5] and many researchers are workingon this topic To better fight against malware and be ableto develop an effective solution understanding it and itsnature is required [6] In keeping with this idea presentpaper proposes getting deeper knowledge about Androidmalware for its better detection More precisely both super-vised (Decision Trees) and unsupervised (Neural ProjectionMethod)machine-learning techniques are applied to increaseour knowledge about the main families of Android malwareIn order to validate the proposed techniques they are appliedto the well-known Malgenome dataset [7] that is open andreal-life
This pioneering work on collecting Android malwarefound some interesting statistics [6]motivating further analy-sis of malware 367 of the collected apps leverage root-level
HindawiComplexityVolume 2019 Article ID 6101697 10 pageshttpsdoiorg10115520196101697
2 Complexity
exploits to fully compromise the security of the smartphoneand more than 90 of the apps tried to turn the smartphoneinto a botnet controlled through network or short messages
To improve present knowledge of Android malwarefamilies a novel neural-projection technique from the familyof Exploratory Projection Pursuit (EPP) techniques namedBeta Hebbian Learning (BHL) [8] is applied Obtainedresults are then compared to those from several differentDecision Trees (DTs) [9] when trying to predict the malwarefamily from apps features
Each app (data sample) that was collected for theMalgenome dataset is defined as a set of certain featuresusing a binary representation Apps were grouped accordingto the family they belong to and features were recalculatedfor the whole family taking into account which features werepresent in the given apps The generated high-dimensionalspace is then analysed by means of BHL in order to revealthe inner structure of the dataset Obtained projections areconsequently scrutinized to get further knowledge aboutthe app features that define the organization of the data indifferent groups and subgroups For comparison purposesDTs have been additionally generated on the same featuresset in order to know the features that better discriminatebetween the different malware families
A variety of problems have been addressed by artificialneural networks in recent decades [10ndash14] More preciselyneural projection models have been previously applied to awide variety of security datasets including network traffic[15 16] SQL code [17 18] and HTTP traffic [19] Similarlyfrom a more general perspective different machine learningsolutions have been proposed to differentiate between legiti-mate and malicious apps [20ndash22]
Visualization techniques have been previously appliedto this problem of analyzing malware [23ndash29] Howeverfew dimensionality-reduction techniques have been appliedto Android apps in order to detect malware Pythagorastree fractal visualization is proposed in [25] being all appsscattered as leaves in the tree Graphs for deciding aboutmalicious apps by depicting lists of malicious methodsneedless permissions and malicious strings were proposedin [26] Biclustering on permission information was used togenerate visualization in [27] while behavior-related dendro-grams are generated out of malware traces in [28] In thelater different pieces of information are analysed includingnodes related to the package name of the application theAndroid components that has called the API call and thenames of functions and methods invoked by the applicationDifferentiating from previous work in present paper a novelneural projection technique is applied for the first time to thecharacterization of Androidmalware [8 24 30] Apps are notanalysed one by one but family-level is considered insteadAdditionally DTs are applied for the first time in order toimprove characterization of such families
The rest of this paper is organized as follows initially BHLand DTs are presented and the analyzed dataset is describedin the following section Then the proposed experimentsare introduced and the obtained results are analyzed inSection 3 Finally conclusions and future work are presentedin Section 4 of the paper
2 Materials and Methods
In present research the EPP BHL algorithm [8] has beenapplied to a dataset of malware families with the aim ofidentifying the internal structure of such dataset and findingfamilies of malware with similar characteristicsThe obtainedresults have been compared with a well-known predictionalgorithm (DT) [9] to validate the BHL results regardingthe most relevant features to briefly characterize Malwarefamilies
21 Beta Hebbian Learning The Beta Hebbian Learningtechnique (BHL) [8] is an unsupervised neural network fromthe family of EPP that employs the Beta distribution to updateits learning rule and fit the Probability Density Function(PDF) of the residual with the distribution of a given dataset
Thus if the PDF of the residuals is known the optimalcost function can be determined By using119861(120572 120573) parametersof the Beta distribution the residual (e) can be drawnwith thefollowing PDF
119901 (119890) = 119890120572minus1 (1 minus 119890)120573minus1 = (119909 minus 119882119910)120572minus1 (1 minus 119909 + 119882119910)120573minus1 (1)
where 120572 and 120573 are used to adjust the shape of the PDF ofthe Beta distribution 119909 is the input of the network 119890 is theresidual 119882 is the weight matrix and 119910 is the output of thenetwork
Then by using the following gradient descent is per-formed to maximize the likelihood of the weights
120597119901120597119882 = (119890120572minus2119895 (1 minus 119890119895)120573minus2 (minus (120572 minus 1) (1 minus 119890119895) + 119890119895 (120573 minus 1)))= (119890120572minus2119895 (1 minus 119890119895)120573minus2 (1 minus 120572 + 119890119895 (120572 + 120573 minus 2)))
(2)
In the case of BHL the learning rule allows for fitting thePDF of the residual by maximizing the likelihood of suchresidual with the current distribution
Therefore the neural architecture for BHL is defined asfollows
119865119890119890119889119891119900119903119908119886119903119889 119910119894 = 119873sum119895=1
119882119894119895119909119895 forall119894 (3)
119865119890119890119889119887119886119888119896 119890119895 = 119909119895 minus 119872sum119894=1
119882119894119895119910119894 (4)
119882119890119894119892ℎ119905119904119906119901119889119886119905119890 Δ119882119894119895= 120578 (119890120572minus2119895 (1 minus 119890119895)120573minus2 (1 minus 120572 + 119890119895 (120572 + 120573 minus 2))) 119910119894 (5)
22 Decision Trees Decision Trees (DTs) [9] are machine-learning algorithms widely used for prediction that haveproved their benefits in several real applications They can becategorized as supervised nonparametric inductive learningtechniques They are based on the construction of diagramsfrom a dataset in a similar way to prediction systems based
Complexity 3
Root node
Decision node
Decision nodeLeaf node
Leaf node
Branches Branches
BranchesBranches
Leaf node
BranchesBranches
Leaf node
Figure 1 Structure of decision trees
on rules which serve to represent and categorize a series ofconditions that occur repeatedly for the solution of a problem
The main objective of a classification DT is to dividea dataset into groups of samples as similar as possible inrelation to one of the features They are made of three mainelements root node (contains all samples of the dataset)decision nodes (represent a decision or rule) and leaf nodes(final label) A dataset is then classified based on subdivisionsof the DT nodes to reach one of the final (leaf) nodes whoselabel corresponds to a class (Figure 1)
Several algorithms have been proposed so far to buildDTs and their efficiency has been proved The most notableones [31] are ID3 (Iterative Dichotomiser 3) C45 (successorof ID3) CART (Classification and Regression Tree) CHAID(CHi-squared Automatic Interaction Detector) MARS andConditional Inference Trees Among all of them CART hasbeen selected in present work due to two main reasons thebinary nature of the dataset and the main objective of thestudy (to identify the most relevant features of the dataset)[31]
221 CART TheClassification and Regression Tree (CART)[9] is a binary tree so each decision node has two binarybranches determined by a splinting function obtained byprocessing variance function In order to build the tree thisCART algorithm takes 4 main steps [9]
(1) Build the decision tree splitting nodes according to agiven function
(2) Finish tree construction once the learning fits the stopcriteria
(3) Pruning the tree to avoid overfitting
(4) Select the best tree after pruning process
Originally the splitting function used by CART is theGini Index
119866119894119899119894 (119878) = 1 minus 119899sum119894=1
1199012119894 (6)
where 119878 is the dataset 119899 is the number of classes in thedataset and p is the probability of different classesThereforea Gini index of 0 means a 100 accuracy in predicting theclass
For comparison purposes two other splitting functionshave been applied in present paper Deviance (7) and Twoing(8)
119863119890V119894119886119899119888119890 (119878) = minus 119899sum119894=1
119901119894 log2 119901119894 (7)
Twoing is a splitting function different from Gini andDeviance Being 119871 119894 and 119877119894 there is fraction of membersof class 119894 in the left and right child nodes after a splitrespectively 119875119871 and 119875119877 are the fractions of observations thatsplit to the left and right respectivelyTherefore the functionto be maximized is the one in
119875119871119875119877( 119899sum119894=1
1003816100381610038161003816119871 119894 minus 1198771198941003816100381610038161003816)2
(8)
On the other hand in standard CART algorithm thesplit feature that is selected for a decision node is the onethat maximizes the split-criterion gain Once again for amore comprehensive comparison two other criteria havebeen applied for selecting split features curvature [32] andinteraction [33] These criteria can be defined as follows
(i) Curvature it is based on the null hypothesis of unas-sociated two features With these criteria the bestsplit predictor feature is the one that minimizes the
4 Complexity
significant p-values of curvature tests between eachfeature and the response variable Such a selection isrobust to the number of levels in individual features
(ii) Interaction it is based on the null hypothesis ofno interaction between the label and the predictorfeatures Therefore for deep decision trees standardCART tends to miss important interactions betweenpairs of features when there are also many other lessimportant features By means of this criterion thedetection of such important interactions is improved
23 Malgenome Dataset The dataset used in this researchhas been obtained from the Android Malware GenomeProject [7] which consists on 1260 Androidmalware samplesgrouped in a total of 49 malware families It was collectedfrom August 2010 to October 2011 and still is a standardbenchmark dataset for Android Malware
This dataset contains malware apps installed in userphones and based on 3 main attack strategies repackagingupdate attack and drive-by download Samples of this datasetwere manually classified based on different aspects suchas installation and activation mechanisms and maliciouspayloads nature Collected malware was split in families thatwere obtained ldquoby carefully examining the related securityannouncements threat reports and blog contents fromexisting mobile antivirus companies and active researchersas exhaustively as possible and diligently requesting malwaresamples from them or actively crawling from existing officialand alternative Android Marketsrdquo [6]
The different families present in the dataset areADRD AnserverBot Asroot BaseBridge BeanBotBgServ CoinPirate Crusewin DogWars DroidCouponDroidDeluxe DroidDream DroidDreamLightDroidKungFu1 DroidKungFu2 DroidKungFu3DroidKungFu4 DroidKungFuSapp DoidKungFuUpdateEndofday FakeNetflix FakePlayer GamblerSMSGeinimi GGTracker GingerMaster GoldDream Gone60GPSSMSSpy HippoSMS Jifake jSMSHider Kmin LovetrapNickyBot Nickyspy Pjapps Plankton RogueLemonRogueSPPush SMSReplicator SndApps Spitmo TapSnakeWalkinwat YZHC zHash Zitmo and Zsone
Therefore the final dataset is made of a total of 49samples one for each family of malware defined by a total of26 binary features divided in 6 categories (Table 1) A detaileddescription of each feature can be found in the original paper[6] and some previous work where this dataset is used can befound in [34ndash36]
3 Experiments and Results
This section presents the experiments performed and theresults obtained in the validation process of the proposedsolution
Both BHL (Section 21) and DT (Section 22) algorithmshave been applied to the previously described dataset (Sec-tion 23) in order to identify the features that define theinternal structure of the data and that support the groupingof the different families of malware attacks In the conducted
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
Figure 2 BHL Projection of malware families
experiments firstly BHL is applied to identify groups ofmalware families with similar behaviour This is done byvisually inspecting the obtained BHL projections and themost relevant features are consequently identified Then thedataset is analyzed by means of DTs to determine the level ofimportance of each feature considering as the most relevantfeatures those that are used in the decision nodes at lowestdepth
In Figure 2 it is shown the best projection obtained byBHL using the following parameter values 120572 = 3 120573 = 4119899119906119898119887119890119903119900119891119894119905119890119903119886119905119894119900119899119904 = 1000 and 119897119890119886119903119899119894119899119892119903119886119905119890 = 005 Theseparameter values were chosen in an experimental processof trial and error As parameter tuning is a task that isvery dependent on the dataset to be used several initialexperiments were conducted with a range of combinations ofthese parameter values
Based on such projection samples are grouped in 2main clusters G1 and G2 (Figure 3) Additionally severalsubgroups (at a 3 level depth ie G1 997888rarr G1A 997888rarr G1A1G1A2 G1A3 and G1A4) can be defined in these maingroups
Figure 4 presents the split of family groups in a schemathat shows the results of thoroughly analyzing the allocationof families in groups The most relevant features that havebeen identified varying from one cluster to another can beseen As an example data are split in G1 and G2 based on thefeatures ldquoRepackagingrdquo and ldquoStandalonerdquo The complete listsof families assigned to each one of the groups are presentedin Figures 5 and 6 Malware families are allocated in the samegroup as they are associated to similar characteristics andbehaviour and therefore there could be similar ways to dealwith them
Based on the analysis of BHL results the most relevantfeatures in decreasing order of importance are ldquoRepackag-ingrdquo and ldquoStandalonerdquo ldquoBootrdquo and ldquoActivation SMSrdquo andldquoFinancial Charges SMSrdquo
BHL clearly outperforms other algorithms used in pre-vious works [24 29] providing a clearer visualization
Complexity 5
Table 1 Features in the Malgenome Dataset
Category 1 Installation 1Repackaging 2Update 3Drive-by download 4StandaloneCategory 2 Activation 5Boot 6SMS 7Net 8Call 9USB 10PKG 11Batt 12SYS 13MainCategory 3 Privilege escalation 14exploid 15RATCzimperlich 16ginger break 17asroot 18encryptedCategory 4 Remote control 19NET 20SMSCategory 5 Financial charges 21phone call 22SMS 23block SMSCategory 6 Personal information stealing 24SMS 25phone number 26user account
Table 2 Summary table of DT results minimum depth of decision nodes for each one of the original features
Deviance Gini Twoing
ID Feature Standard Curvature Interactioncurvature Standard Curvature Interaction
curvature Standard Curvature Interactioncurvature Average
1 Repackaging 1 2 4 1 2 6 1 2 4 2565 BOOT 2 3 3 4 3 2 2 3 3 27818 Encrypted 4 2 4 4 2 3209 USB 6 3 5 3 4 6 3 429
3 Drive-byDownload 5 5 4 2 5 6 5 5 4 456
24 SMS 4 3 5 10 3 6 3 3 5 46726 User Account 6 1 8 3 1 8 6 1 8 4672 Update 5 7 4 2 6 3 5 7 4 47819 NET 6 2 8 9 2 3 4 2 8 4896 SMS 3 6 5 8 7 3 3 6 4 50010 PKG 6 5 4 5 4 6 5 50022 SMS 3 6 4 10 6 4 3 6 4 5114 Standalone 5 10 3 3 9 3 5 10 3 5678 CALL 4 7 4 8 4 7 56711 BATT 4 9 5 4 5 4 9 57116 Ginger Break 6 60015 RATCZimperlich 6 8 1 9 9 8 7 8 1 6337 NET 5 8 6 10 9 2 5 8 6 65614 Exploid 5 8 6 9 6 6 5 8 6 65617 Asroot 7 4 7 11 4 9 8 4 7 67823 Block SMS 3 8 5 9 9 9 4 8 6 67825 Phone Number 2 11 2 12 12 11 2 11 2 72212 SYS 5 10 6 10 11 7 5 10 6 77821 Phone Call 4 9 12 13 7 4 9 5 78813 MAIN 6 11 7 10 12 1 6 11 7 78920 SMS 10 8 900
of the internal structure of the dataset Groups obtainedby BHL are more compact and well defined than thegroups generated by other EPP techniques in the previouswork
In addition to the BHL experiments experiments withDTs were additionally conducted in order to compare andvalidate the obtained results As it has been previouslymentioned 3 different splitting functions have been appliedin present paper Gini Deviance and Twoing In addition 3different criteria for selecting split features have been appliedStandard Curvature and Interaction
As an example one of the obtained DTs is shown inFigure 7 This is the tree generated from the Malgenomedataset when applying the standard CART split criteria andtheDeviance function It has been selected as it is the onewithlowest depth In the leaf nodes labels refer to family numbers(alphabetically ordered as presented in Section 23)
To show the most interesting results from the differentalternatives to build DT information has been summarizedin Table 2 For each combination of splitting function andselecting criteria the minimum depth of decision nodeslinked to each one of the original features is shown That is
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
2 Complexity
exploits to fully compromise the security of the smartphoneand more than 90 of the apps tried to turn the smartphoneinto a botnet controlled through network or short messages
To improve present knowledge of Android malwarefamilies a novel neural-projection technique from the familyof Exploratory Projection Pursuit (EPP) techniques namedBeta Hebbian Learning (BHL) [8] is applied Obtainedresults are then compared to those from several differentDecision Trees (DTs) [9] when trying to predict the malwarefamily from apps features
Each app (data sample) that was collected for theMalgenome dataset is defined as a set of certain featuresusing a binary representation Apps were grouped accordingto the family they belong to and features were recalculatedfor the whole family taking into account which features werepresent in the given apps The generated high-dimensionalspace is then analysed by means of BHL in order to revealthe inner structure of the dataset Obtained projections areconsequently scrutinized to get further knowledge aboutthe app features that define the organization of the data indifferent groups and subgroups For comparison purposesDTs have been additionally generated on the same featuresset in order to know the features that better discriminatebetween the different malware families
A variety of problems have been addressed by artificialneural networks in recent decades [10ndash14] More preciselyneural projection models have been previously applied to awide variety of security datasets including network traffic[15 16] SQL code [17 18] and HTTP traffic [19] Similarlyfrom a more general perspective different machine learningsolutions have been proposed to differentiate between legiti-mate and malicious apps [20ndash22]
Visualization techniques have been previously appliedto this problem of analyzing malware [23ndash29] Howeverfew dimensionality-reduction techniques have been appliedto Android apps in order to detect malware Pythagorastree fractal visualization is proposed in [25] being all appsscattered as leaves in the tree Graphs for deciding aboutmalicious apps by depicting lists of malicious methodsneedless permissions and malicious strings were proposedin [26] Biclustering on permission information was used togenerate visualization in [27] while behavior-related dendro-grams are generated out of malware traces in [28] In thelater different pieces of information are analysed includingnodes related to the package name of the application theAndroid components that has called the API call and thenames of functions and methods invoked by the applicationDifferentiating from previous work in present paper a novelneural projection technique is applied for the first time to thecharacterization of Androidmalware [8 24 30] Apps are notanalysed one by one but family-level is considered insteadAdditionally DTs are applied for the first time in order toimprove characterization of such families
The rest of this paper is organized as follows initially BHLand DTs are presented and the analyzed dataset is describedin the following section Then the proposed experimentsare introduced and the obtained results are analyzed inSection 3 Finally conclusions and future work are presentedin Section 4 of the paper
2 Materials and Methods
In present research the EPP BHL algorithm [8] has beenapplied to a dataset of malware families with the aim ofidentifying the internal structure of such dataset and findingfamilies of malware with similar characteristicsThe obtainedresults have been compared with a well-known predictionalgorithm (DT) [9] to validate the BHL results regardingthe most relevant features to briefly characterize Malwarefamilies
21 Beta Hebbian Learning The Beta Hebbian Learningtechnique (BHL) [8] is an unsupervised neural network fromthe family of EPP that employs the Beta distribution to updateits learning rule and fit the Probability Density Function(PDF) of the residual with the distribution of a given dataset
Thus if the PDF of the residuals is known the optimalcost function can be determined By using119861(120572 120573) parametersof the Beta distribution the residual (e) can be drawnwith thefollowing PDF
119901 (119890) = 119890120572minus1 (1 minus 119890)120573minus1 = (119909 minus 119882119910)120572minus1 (1 minus 119909 + 119882119910)120573minus1 (1)
where 120572 and 120573 are used to adjust the shape of the PDF ofthe Beta distribution 119909 is the input of the network 119890 is theresidual 119882 is the weight matrix and 119910 is the output of thenetwork
Then by using the following gradient descent is per-formed to maximize the likelihood of the weights
120597119901120597119882 = (119890120572minus2119895 (1 minus 119890119895)120573minus2 (minus (120572 minus 1) (1 minus 119890119895) + 119890119895 (120573 minus 1)))= (119890120572minus2119895 (1 minus 119890119895)120573minus2 (1 minus 120572 + 119890119895 (120572 + 120573 minus 2)))
(2)
In the case of BHL the learning rule allows for fitting thePDF of the residual by maximizing the likelihood of suchresidual with the current distribution
Therefore the neural architecture for BHL is defined asfollows
119865119890119890119889119891119900119903119908119886119903119889 119910119894 = 119873sum119895=1
119882119894119895119909119895 forall119894 (3)
119865119890119890119889119887119886119888119896 119890119895 = 119909119895 minus 119872sum119894=1
119882119894119895119910119894 (4)
119882119890119894119892ℎ119905119904119906119901119889119886119905119890 Δ119882119894119895= 120578 (119890120572minus2119895 (1 minus 119890119895)120573minus2 (1 minus 120572 + 119890119895 (120572 + 120573 minus 2))) 119910119894 (5)
22 Decision Trees Decision Trees (DTs) [9] are machine-learning algorithms widely used for prediction that haveproved their benefits in several real applications They can becategorized as supervised nonparametric inductive learningtechniques They are based on the construction of diagramsfrom a dataset in a similar way to prediction systems based
Complexity 3
Root node
Decision node
Decision nodeLeaf node
Leaf node
Branches Branches
BranchesBranches
Leaf node
BranchesBranches
Leaf node
Figure 1 Structure of decision trees
on rules which serve to represent and categorize a series ofconditions that occur repeatedly for the solution of a problem
The main objective of a classification DT is to dividea dataset into groups of samples as similar as possible inrelation to one of the features They are made of three mainelements root node (contains all samples of the dataset)decision nodes (represent a decision or rule) and leaf nodes(final label) A dataset is then classified based on subdivisionsof the DT nodes to reach one of the final (leaf) nodes whoselabel corresponds to a class (Figure 1)
Several algorithms have been proposed so far to buildDTs and their efficiency has been proved The most notableones [31] are ID3 (Iterative Dichotomiser 3) C45 (successorof ID3) CART (Classification and Regression Tree) CHAID(CHi-squared Automatic Interaction Detector) MARS andConditional Inference Trees Among all of them CART hasbeen selected in present work due to two main reasons thebinary nature of the dataset and the main objective of thestudy (to identify the most relevant features of the dataset)[31]
221 CART TheClassification and Regression Tree (CART)[9] is a binary tree so each decision node has two binarybranches determined by a splinting function obtained byprocessing variance function In order to build the tree thisCART algorithm takes 4 main steps [9]
(1) Build the decision tree splitting nodes according to agiven function
(2) Finish tree construction once the learning fits the stopcriteria
(3) Pruning the tree to avoid overfitting
(4) Select the best tree after pruning process
Originally the splitting function used by CART is theGini Index
119866119894119899119894 (119878) = 1 minus 119899sum119894=1
1199012119894 (6)
where 119878 is the dataset 119899 is the number of classes in thedataset and p is the probability of different classesThereforea Gini index of 0 means a 100 accuracy in predicting theclass
For comparison purposes two other splitting functionshave been applied in present paper Deviance (7) and Twoing(8)
119863119890V119894119886119899119888119890 (119878) = minus 119899sum119894=1
119901119894 log2 119901119894 (7)
Twoing is a splitting function different from Gini andDeviance Being 119871 119894 and 119877119894 there is fraction of membersof class 119894 in the left and right child nodes after a splitrespectively 119875119871 and 119875119877 are the fractions of observations thatsplit to the left and right respectivelyTherefore the functionto be maximized is the one in
119875119871119875119877( 119899sum119894=1
1003816100381610038161003816119871 119894 minus 1198771198941003816100381610038161003816)2
(8)
On the other hand in standard CART algorithm thesplit feature that is selected for a decision node is the onethat maximizes the split-criterion gain Once again for amore comprehensive comparison two other criteria havebeen applied for selecting split features curvature [32] andinteraction [33] These criteria can be defined as follows
(i) Curvature it is based on the null hypothesis of unas-sociated two features With these criteria the bestsplit predictor feature is the one that minimizes the
4 Complexity
significant p-values of curvature tests between eachfeature and the response variable Such a selection isrobust to the number of levels in individual features
(ii) Interaction it is based on the null hypothesis ofno interaction between the label and the predictorfeatures Therefore for deep decision trees standardCART tends to miss important interactions betweenpairs of features when there are also many other lessimportant features By means of this criterion thedetection of such important interactions is improved
23 Malgenome Dataset The dataset used in this researchhas been obtained from the Android Malware GenomeProject [7] which consists on 1260 Androidmalware samplesgrouped in a total of 49 malware families It was collectedfrom August 2010 to October 2011 and still is a standardbenchmark dataset for Android Malware
This dataset contains malware apps installed in userphones and based on 3 main attack strategies repackagingupdate attack and drive-by download Samples of this datasetwere manually classified based on different aspects suchas installation and activation mechanisms and maliciouspayloads nature Collected malware was split in families thatwere obtained ldquoby carefully examining the related securityannouncements threat reports and blog contents fromexisting mobile antivirus companies and active researchersas exhaustively as possible and diligently requesting malwaresamples from them or actively crawling from existing officialand alternative Android Marketsrdquo [6]
The different families present in the dataset areADRD AnserverBot Asroot BaseBridge BeanBotBgServ CoinPirate Crusewin DogWars DroidCouponDroidDeluxe DroidDream DroidDreamLightDroidKungFu1 DroidKungFu2 DroidKungFu3DroidKungFu4 DroidKungFuSapp DoidKungFuUpdateEndofday FakeNetflix FakePlayer GamblerSMSGeinimi GGTracker GingerMaster GoldDream Gone60GPSSMSSpy HippoSMS Jifake jSMSHider Kmin LovetrapNickyBot Nickyspy Pjapps Plankton RogueLemonRogueSPPush SMSReplicator SndApps Spitmo TapSnakeWalkinwat YZHC zHash Zitmo and Zsone
Therefore the final dataset is made of a total of 49samples one for each family of malware defined by a total of26 binary features divided in 6 categories (Table 1) A detaileddescription of each feature can be found in the original paper[6] and some previous work where this dataset is used can befound in [34ndash36]
3 Experiments and Results
This section presents the experiments performed and theresults obtained in the validation process of the proposedsolution
Both BHL (Section 21) and DT (Section 22) algorithmshave been applied to the previously described dataset (Sec-tion 23) in order to identify the features that define theinternal structure of the data and that support the groupingof the different families of malware attacks In the conducted
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
Figure 2 BHL Projection of malware families
experiments firstly BHL is applied to identify groups ofmalware families with similar behaviour This is done byvisually inspecting the obtained BHL projections and themost relevant features are consequently identified Then thedataset is analyzed by means of DTs to determine the level ofimportance of each feature considering as the most relevantfeatures those that are used in the decision nodes at lowestdepth
In Figure 2 it is shown the best projection obtained byBHL using the following parameter values 120572 = 3 120573 = 4119899119906119898119887119890119903119900119891119894119905119890119903119886119905119894119900119899119904 = 1000 and 119897119890119886119903119899119894119899119892119903119886119905119890 = 005 Theseparameter values were chosen in an experimental processof trial and error As parameter tuning is a task that isvery dependent on the dataset to be used several initialexperiments were conducted with a range of combinations ofthese parameter values
Based on such projection samples are grouped in 2main clusters G1 and G2 (Figure 3) Additionally severalsubgroups (at a 3 level depth ie G1 997888rarr G1A 997888rarr G1A1G1A2 G1A3 and G1A4) can be defined in these maingroups
Figure 4 presents the split of family groups in a schemathat shows the results of thoroughly analyzing the allocationof families in groups The most relevant features that havebeen identified varying from one cluster to another can beseen As an example data are split in G1 and G2 based on thefeatures ldquoRepackagingrdquo and ldquoStandalonerdquo The complete listsof families assigned to each one of the groups are presentedin Figures 5 and 6 Malware families are allocated in the samegroup as they are associated to similar characteristics andbehaviour and therefore there could be similar ways to dealwith them
Based on the analysis of BHL results the most relevantfeatures in decreasing order of importance are ldquoRepackag-ingrdquo and ldquoStandalonerdquo ldquoBootrdquo and ldquoActivation SMSrdquo andldquoFinancial Charges SMSrdquo
BHL clearly outperforms other algorithms used in pre-vious works [24 29] providing a clearer visualization
Complexity 5
Table 1 Features in the Malgenome Dataset
Category 1 Installation 1Repackaging 2Update 3Drive-by download 4StandaloneCategory 2 Activation 5Boot 6SMS 7Net 8Call 9USB 10PKG 11Batt 12SYS 13MainCategory 3 Privilege escalation 14exploid 15RATCzimperlich 16ginger break 17asroot 18encryptedCategory 4 Remote control 19NET 20SMSCategory 5 Financial charges 21phone call 22SMS 23block SMSCategory 6 Personal information stealing 24SMS 25phone number 26user account
Table 2 Summary table of DT results minimum depth of decision nodes for each one of the original features
Deviance Gini Twoing
ID Feature Standard Curvature Interactioncurvature Standard Curvature Interaction
curvature Standard Curvature Interactioncurvature Average
1 Repackaging 1 2 4 1 2 6 1 2 4 2565 BOOT 2 3 3 4 3 2 2 3 3 27818 Encrypted 4 2 4 4 2 3209 USB 6 3 5 3 4 6 3 429
3 Drive-byDownload 5 5 4 2 5 6 5 5 4 456
24 SMS 4 3 5 10 3 6 3 3 5 46726 User Account 6 1 8 3 1 8 6 1 8 4672 Update 5 7 4 2 6 3 5 7 4 47819 NET 6 2 8 9 2 3 4 2 8 4896 SMS 3 6 5 8 7 3 3 6 4 50010 PKG 6 5 4 5 4 6 5 50022 SMS 3 6 4 10 6 4 3 6 4 5114 Standalone 5 10 3 3 9 3 5 10 3 5678 CALL 4 7 4 8 4 7 56711 BATT 4 9 5 4 5 4 9 57116 Ginger Break 6 60015 RATCZimperlich 6 8 1 9 9 8 7 8 1 6337 NET 5 8 6 10 9 2 5 8 6 65614 Exploid 5 8 6 9 6 6 5 8 6 65617 Asroot 7 4 7 11 4 9 8 4 7 67823 Block SMS 3 8 5 9 9 9 4 8 6 67825 Phone Number 2 11 2 12 12 11 2 11 2 72212 SYS 5 10 6 10 11 7 5 10 6 77821 Phone Call 4 9 12 13 7 4 9 5 78813 MAIN 6 11 7 10 12 1 6 11 7 78920 SMS 10 8 900
of the internal structure of the dataset Groups obtainedby BHL are more compact and well defined than thegroups generated by other EPP techniques in the previouswork
In addition to the BHL experiments experiments withDTs were additionally conducted in order to compare andvalidate the obtained results As it has been previouslymentioned 3 different splitting functions have been appliedin present paper Gini Deviance and Twoing In addition 3different criteria for selecting split features have been appliedStandard Curvature and Interaction
As an example one of the obtained DTs is shown inFigure 7 This is the tree generated from the Malgenomedataset when applying the standard CART split criteria andtheDeviance function It has been selected as it is the onewithlowest depth In the leaf nodes labels refer to family numbers(alphabetically ordered as presented in Section 23)
To show the most interesting results from the differentalternatives to build DT information has been summarizedin Table 2 For each combination of splitting function andselecting criteria the minimum depth of decision nodeslinked to each one of the original features is shown That is
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Complexity 3
Root node
Decision node
Decision nodeLeaf node
Leaf node
Branches Branches
BranchesBranches
Leaf node
BranchesBranches
Leaf node
Figure 1 Structure of decision trees
on rules which serve to represent and categorize a series ofconditions that occur repeatedly for the solution of a problem
The main objective of a classification DT is to dividea dataset into groups of samples as similar as possible inrelation to one of the features They are made of three mainelements root node (contains all samples of the dataset)decision nodes (represent a decision or rule) and leaf nodes(final label) A dataset is then classified based on subdivisionsof the DT nodes to reach one of the final (leaf) nodes whoselabel corresponds to a class (Figure 1)
Several algorithms have been proposed so far to buildDTs and their efficiency has been proved The most notableones [31] are ID3 (Iterative Dichotomiser 3) C45 (successorof ID3) CART (Classification and Regression Tree) CHAID(CHi-squared Automatic Interaction Detector) MARS andConditional Inference Trees Among all of them CART hasbeen selected in present work due to two main reasons thebinary nature of the dataset and the main objective of thestudy (to identify the most relevant features of the dataset)[31]
221 CART TheClassification and Regression Tree (CART)[9] is a binary tree so each decision node has two binarybranches determined by a splinting function obtained byprocessing variance function In order to build the tree thisCART algorithm takes 4 main steps [9]
(1) Build the decision tree splitting nodes according to agiven function
(2) Finish tree construction once the learning fits the stopcriteria
(3) Pruning the tree to avoid overfitting
(4) Select the best tree after pruning process
Originally the splitting function used by CART is theGini Index
119866119894119899119894 (119878) = 1 minus 119899sum119894=1
1199012119894 (6)
where 119878 is the dataset 119899 is the number of classes in thedataset and p is the probability of different classesThereforea Gini index of 0 means a 100 accuracy in predicting theclass
For comparison purposes two other splitting functionshave been applied in present paper Deviance (7) and Twoing(8)
119863119890V119894119886119899119888119890 (119878) = minus 119899sum119894=1
119901119894 log2 119901119894 (7)
Twoing is a splitting function different from Gini andDeviance Being 119871 119894 and 119877119894 there is fraction of membersof class 119894 in the left and right child nodes after a splitrespectively 119875119871 and 119875119877 are the fractions of observations thatsplit to the left and right respectivelyTherefore the functionto be maximized is the one in
119875119871119875119877( 119899sum119894=1
1003816100381610038161003816119871 119894 minus 1198771198941003816100381610038161003816)2
(8)
On the other hand in standard CART algorithm thesplit feature that is selected for a decision node is the onethat maximizes the split-criterion gain Once again for amore comprehensive comparison two other criteria havebeen applied for selecting split features curvature [32] andinteraction [33] These criteria can be defined as follows
(i) Curvature it is based on the null hypothesis of unas-sociated two features With these criteria the bestsplit predictor feature is the one that minimizes the
4 Complexity
significant p-values of curvature tests between eachfeature and the response variable Such a selection isrobust to the number of levels in individual features
(ii) Interaction it is based on the null hypothesis ofno interaction between the label and the predictorfeatures Therefore for deep decision trees standardCART tends to miss important interactions betweenpairs of features when there are also many other lessimportant features By means of this criterion thedetection of such important interactions is improved
23 Malgenome Dataset The dataset used in this researchhas been obtained from the Android Malware GenomeProject [7] which consists on 1260 Androidmalware samplesgrouped in a total of 49 malware families It was collectedfrom August 2010 to October 2011 and still is a standardbenchmark dataset for Android Malware
This dataset contains malware apps installed in userphones and based on 3 main attack strategies repackagingupdate attack and drive-by download Samples of this datasetwere manually classified based on different aspects suchas installation and activation mechanisms and maliciouspayloads nature Collected malware was split in families thatwere obtained ldquoby carefully examining the related securityannouncements threat reports and blog contents fromexisting mobile antivirus companies and active researchersas exhaustively as possible and diligently requesting malwaresamples from them or actively crawling from existing officialand alternative Android Marketsrdquo [6]
The different families present in the dataset areADRD AnserverBot Asroot BaseBridge BeanBotBgServ CoinPirate Crusewin DogWars DroidCouponDroidDeluxe DroidDream DroidDreamLightDroidKungFu1 DroidKungFu2 DroidKungFu3DroidKungFu4 DroidKungFuSapp DoidKungFuUpdateEndofday FakeNetflix FakePlayer GamblerSMSGeinimi GGTracker GingerMaster GoldDream Gone60GPSSMSSpy HippoSMS Jifake jSMSHider Kmin LovetrapNickyBot Nickyspy Pjapps Plankton RogueLemonRogueSPPush SMSReplicator SndApps Spitmo TapSnakeWalkinwat YZHC zHash Zitmo and Zsone
Therefore the final dataset is made of a total of 49samples one for each family of malware defined by a total of26 binary features divided in 6 categories (Table 1) A detaileddescription of each feature can be found in the original paper[6] and some previous work where this dataset is used can befound in [34ndash36]
3 Experiments and Results
This section presents the experiments performed and theresults obtained in the validation process of the proposedsolution
Both BHL (Section 21) and DT (Section 22) algorithmshave been applied to the previously described dataset (Sec-tion 23) in order to identify the features that define theinternal structure of the data and that support the groupingof the different families of malware attacks In the conducted
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
Figure 2 BHL Projection of malware families
experiments firstly BHL is applied to identify groups ofmalware families with similar behaviour This is done byvisually inspecting the obtained BHL projections and themost relevant features are consequently identified Then thedataset is analyzed by means of DTs to determine the level ofimportance of each feature considering as the most relevantfeatures those that are used in the decision nodes at lowestdepth
In Figure 2 it is shown the best projection obtained byBHL using the following parameter values 120572 = 3 120573 = 4119899119906119898119887119890119903119900119891119894119905119890119903119886119905119894119900119899119904 = 1000 and 119897119890119886119903119899119894119899119892119903119886119905119890 = 005 Theseparameter values were chosen in an experimental processof trial and error As parameter tuning is a task that isvery dependent on the dataset to be used several initialexperiments were conducted with a range of combinations ofthese parameter values
Based on such projection samples are grouped in 2main clusters G1 and G2 (Figure 3) Additionally severalsubgroups (at a 3 level depth ie G1 997888rarr G1A 997888rarr G1A1G1A2 G1A3 and G1A4) can be defined in these maingroups
Figure 4 presents the split of family groups in a schemathat shows the results of thoroughly analyzing the allocationof families in groups The most relevant features that havebeen identified varying from one cluster to another can beseen As an example data are split in G1 and G2 based on thefeatures ldquoRepackagingrdquo and ldquoStandalonerdquo The complete listsof families assigned to each one of the groups are presentedin Figures 5 and 6 Malware families are allocated in the samegroup as they are associated to similar characteristics andbehaviour and therefore there could be similar ways to dealwith them
Based on the analysis of BHL results the most relevantfeatures in decreasing order of importance are ldquoRepackag-ingrdquo and ldquoStandalonerdquo ldquoBootrdquo and ldquoActivation SMSrdquo andldquoFinancial Charges SMSrdquo
BHL clearly outperforms other algorithms used in pre-vious works [24 29] providing a clearer visualization
Complexity 5
Table 1 Features in the Malgenome Dataset
Category 1 Installation 1Repackaging 2Update 3Drive-by download 4StandaloneCategory 2 Activation 5Boot 6SMS 7Net 8Call 9USB 10PKG 11Batt 12SYS 13MainCategory 3 Privilege escalation 14exploid 15RATCzimperlich 16ginger break 17asroot 18encryptedCategory 4 Remote control 19NET 20SMSCategory 5 Financial charges 21phone call 22SMS 23block SMSCategory 6 Personal information stealing 24SMS 25phone number 26user account
Table 2 Summary table of DT results minimum depth of decision nodes for each one of the original features
Deviance Gini Twoing
ID Feature Standard Curvature Interactioncurvature Standard Curvature Interaction
curvature Standard Curvature Interactioncurvature Average
1 Repackaging 1 2 4 1 2 6 1 2 4 2565 BOOT 2 3 3 4 3 2 2 3 3 27818 Encrypted 4 2 4 4 2 3209 USB 6 3 5 3 4 6 3 429
3 Drive-byDownload 5 5 4 2 5 6 5 5 4 456
24 SMS 4 3 5 10 3 6 3 3 5 46726 User Account 6 1 8 3 1 8 6 1 8 4672 Update 5 7 4 2 6 3 5 7 4 47819 NET 6 2 8 9 2 3 4 2 8 4896 SMS 3 6 5 8 7 3 3 6 4 50010 PKG 6 5 4 5 4 6 5 50022 SMS 3 6 4 10 6 4 3 6 4 5114 Standalone 5 10 3 3 9 3 5 10 3 5678 CALL 4 7 4 8 4 7 56711 BATT 4 9 5 4 5 4 9 57116 Ginger Break 6 60015 RATCZimperlich 6 8 1 9 9 8 7 8 1 6337 NET 5 8 6 10 9 2 5 8 6 65614 Exploid 5 8 6 9 6 6 5 8 6 65617 Asroot 7 4 7 11 4 9 8 4 7 67823 Block SMS 3 8 5 9 9 9 4 8 6 67825 Phone Number 2 11 2 12 12 11 2 11 2 72212 SYS 5 10 6 10 11 7 5 10 6 77821 Phone Call 4 9 12 13 7 4 9 5 78813 MAIN 6 11 7 10 12 1 6 11 7 78920 SMS 10 8 900
of the internal structure of the dataset Groups obtainedby BHL are more compact and well defined than thegroups generated by other EPP techniques in the previouswork
In addition to the BHL experiments experiments withDTs were additionally conducted in order to compare andvalidate the obtained results As it has been previouslymentioned 3 different splitting functions have been appliedin present paper Gini Deviance and Twoing In addition 3different criteria for selecting split features have been appliedStandard Curvature and Interaction
As an example one of the obtained DTs is shown inFigure 7 This is the tree generated from the Malgenomedataset when applying the standard CART split criteria andtheDeviance function It has been selected as it is the onewithlowest depth In the leaf nodes labels refer to family numbers(alphabetically ordered as presented in Section 23)
To show the most interesting results from the differentalternatives to build DT information has been summarizedin Table 2 For each combination of splitting function andselecting criteria the minimum depth of decision nodeslinked to each one of the original features is shown That is
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
4 Complexity
significant p-values of curvature tests between eachfeature and the response variable Such a selection isrobust to the number of levels in individual features
(ii) Interaction it is based on the null hypothesis ofno interaction between the label and the predictorfeatures Therefore for deep decision trees standardCART tends to miss important interactions betweenpairs of features when there are also many other lessimportant features By means of this criterion thedetection of such important interactions is improved
23 Malgenome Dataset The dataset used in this researchhas been obtained from the Android Malware GenomeProject [7] which consists on 1260 Androidmalware samplesgrouped in a total of 49 malware families It was collectedfrom August 2010 to October 2011 and still is a standardbenchmark dataset for Android Malware
This dataset contains malware apps installed in userphones and based on 3 main attack strategies repackagingupdate attack and drive-by download Samples of this datasetwere manually classified based on different aspects suchas installation and activation mechanisms and maliciouspayloads nature Collected malware was split in families thatwere obtained ldquoby carefully examining the related securityannouncements threat reports and blog contents fromexisting mobile antivirus companies and active researchersas exhaustively as possible and diligently requesting malwaresamples from them or actively crawling from existing officialand alternative Android Marketsrdquo [6]
The different families present in the dataset areADRD AnserverBot Asroot BaseBridge BeanBotBgServ CoinPirate Crusewin DogWars DroidCouponDroidDeluxe DroidDream DroidDreamLightDroidKungFu1 DroidKungFu2 DroidKungFu3DroidKungFu4 DroidKungFuSapp DoidKungFuUpdateEndofday FakeNetflix FakePlayer GamblerSMSGeinimi GGTracker GingerMaster GoldDream Gone60GPSSMSSpy HippoSMS Jifake jSMSHider Kmin LovetrapNickyBot Nickyspy Pjapps Plankton RogueLemonRogueSPPush SMSReplicator SndApps Spitmo TapSnakeWalkinwat YZHC zHash Zitmo and Zsone
Therefore the final dataset is made of a total of 49samples one for each family of malware defined by a total of26 binary features divided in 6 categories (Table 1) A detaileddescription of each feature can be found in the original paper[6] and some previous work where this dataset is used can befound in [34ndash36]
3 Experiments and Results
This section presents the experiments performed and theresults obtained in the validation process of the proposedsolution
Both BHL (Section 21) and DT (Section 22) algorithmshave been applied to the previously described dataset (Sec-tion 23) in order to identify the features that define theinternal structure of the data and that support the groupingof the different families of malware attacks In the conducted
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
Figure 2 BHL Projection of malware families
experiments firstly BHL is applied to identify groups ofmalware families with similar behaviour This is done byvisually inspecting the obtained BHL projections and themost relevant features are consequently identified Then thedataset is analyzed by means of DTs to determine the level ofimportance of each feature considering as the most relevantfeatures those that are used in the decision nodes at lowestdepth
In Figure 2 it is shown the best projection obtained byBHL using the following parameter values 120572 = 3 120573 = 4119899119906119898119887119890119903119900119891119894119905119890119903119886119905119894119900119899119904 = 1000 and 119897119890119886119903119899119894119899119892119903119886119905119890 = 005 Theseparameter values were chosen in an experimental processof trial and error As parameter tuning is a task that isvery dependent on the dataset to be used several initialexperiments were conducted with a range of combinations ofthese parameter values
Based on such projection samples are grouped in 2main clusters G1 and G2 (Figure 3) Additionally severalsubgroups (at a 3 level depth ie G1 997888rarr G1A 997888rarr G1A1G1A2 G1A3 and G1A4) can be defined in these maingroups
Figure 4 presents the split of family groups in a schemathat shows the results of thoroughly analyzing the allocationof families in groups The most relevant features that havebeen identified varying from one cluster to another can beseen As an example data are split in G1 and G2 based on thefeatures ldquoRepackagingrdquo and ldquoStandalonerdquo The complete listsof families assigned to each one of the groups are presentedin Figures 5 and 6 Malware families are allocated in the samegroup as they are associated to similar characteristics andbehaviour and therefore there could be similar ways to dealwith them
Based on the analysis of BHL results the most relevantfeatures in decreasing order of importance are ldquoRepackag-ingrdquo and ldquoStandalonerdquo ldquoBootrdquo and ldquoActivation SMSrdquo andldquoFinancial Charges SMSrdquo
BHL clearly outperforms other algorithms used in pre-vious works [24 29] providing a clearer visualization
Complexity 5
Table 1 Features in the Malgenome Dataset
Category 1 Installation 1Repackaging 2Update 3Drive-by download 4StandaloneCategory 2 Activation 5Boot 6SMS 7Net 8Call 9USB 10PKG 11Batt 12SYS 13MainCategory 3 Privilege escalation 14exploid 15RATCzimperlich 16ginger break 17asroot 18encryptedCategory 4 Remote control 19NET 20SMSCategory 5 Financial charges 21phone call 22SMS 23block SMSCategory 6 Personal information stealing 24SMS 25phone number 26user account
Table 2 Summary table of DT results minimum depth of decision nodes for each one of the original features
Deviance Gini Twoing
ID Feature Standard Curvature Interactioncurvature Standard Curvature Interaction
curvature Standard Curvature Interactioncurvature Average
1 Repackaging 1 2 4 1 2 6 1 2 4 2565 BOOT 2 3 3 4 3 2 2 3 3 27818 Encrypted 4 2 4 4 2 3209 USB 6 3 5 3 4 6 3 429
3 Drive-byDownload 5 5 4 2 5 6 5 5 4 456
24 SMS 4 3 5 10 3 6 3 3 5 46726 User Account 6 1 8 3 1 8 6 1 8 4672 Update 5 7 4 2 6 3 5 7 4 47819 NET 6 2 8 9 2 3 4 2 8 4896 SMS 3 6 5 8 7 3 3 6 4 50010 PKG 6 5 4 5 4 6 5 50022 SMS 3 6 4 10 6 4 3 6 4 5114 Standalone 5 10 3 3 9 3 5 10 3 5678 CALL 4 7 4 8 4 7 56711 BATT 4 9 5 4 5 4 9 57116 Ginger Break 6 60015 RATCZimperlich 6 8 1 9 9 8 7 8 1 6337 NET 5 8 6 10 9 2 5 8 6 65614 Exploid 5 8 6 9 6 6 5 8 6 65617 Asroot 7 4 7 11 4 9 8 4 7 67823 Block SMS 3 8 5 9 9 9 4 8 6 67825 Phone Number 2 11 2 12 12 11 2 11 2 72212 SYS 5 10 6 10 11 7 5 10 6 77821 Phone Call 4 9 12 13 7 4 9 5 78813 MAIN 6 11 7 10 12 1 6 11 7 78920 SMS 10 8 900
of the internal structure of the dataset Groups obtainedby BHL are more compact and well defined than thegroups generated by other EPP techniques in the previouswork
In addition to the BHL experiments experiments withDTs were additionally conducted in order to compare andvalidate the obtained results As it has been previouslymentioned 3 different splitting functions have been appliedin present paper Gini Deviance and Twoing In addition 3different criteria for selecting split features have been appliedStandard Curvature and Interaction
As an example one of the obtained DTs is shown inFigure 7 This is the tree generated from the Malgenomedataset when applying the standard CART split criteria andtheDeviance function It has been selected as it is the onewithlowest depth In the leaf nodes labels refer to family numbers(alphabetically ordered as presented in Section 23)
To show the most interesting results from the differentalternatives to build DT information has been summarizedin Table 2 For each combination of splitting function andselecting criteria the minimum depth of decision nodeslinked to each one of the original features is shown That is
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Complexity 5
Table 1 Features in the Malgenome Dataset
Category 1 Installation 1Repackaging 2Update 3Drive-by download 4StandaloneCategory 2 Activation 5Boot 6SMS 7Net 8Call 9USB 10PKG 11Batt 12SYS 13MainCategory 3 Privilege escalation 14exploid 15RATCzimperlich 16ginger break 17asroot 18encryptedCategory 4 Remote control 19NET 20SMSCategory 5 Financial charges 21phone call 22SMS 23block SMSCategory 6 Personal information stealing 24SMS 25phone number 26user account
Table 2 Summary table of DT results minimum depth of decision nodes for each one of the original features
Deviance Gini Twoing
ID Feature Standard Curvature Interactioncurvature Standard Curvature Interaction
curvature Standard Curvature Interactioncurvature Average
1 Repackaging 1 2 4 1 2 6 1 2 4 2565 BOOT 2 3 3 4 3 2 2 3 3 27818 Encrypted 4 2 4 4 2 3209 USB 6 3 5 3 4 6 3 429
3 Drive-byDownload 5 5 4 2 5 6 5 5 4 456
24 SMS 4 3 5 10 3 6 3 3 5 46726 User Account 6 1 8 3 1 8 6 1 8 4672 Update 5 7 4 2 6 3 5 7 4 47819 NET 6 2 8 9 2 3 4 2 8 4896 SMS 3 6 5 8 7 3 3 6 4 50010 PKG 6 5 4 5 4 6 5 50022 SMS 3 6 4 10 6 4 3 6 4 5114 Standalone 5 10 3 3 9 3 5 10 3 5678 CALL 4 7 4 8 4 7 56711 BATT 4 9 5 4 5 4 9 57116 Ginger Break 6 60015 RATCZimperlich 6 8 1 9 9 8 7 8 1 6337 NET 5 8 6 10 9 2 5 8 6 65614 Exploid 5 8 6 9 6 6 5 8 6 65617 Asroot 7 4 7 11 4 9 8 4 7 67823 Block SMS 3 8 5 9 9 9 4 8 6 67825 Phone Number 2 11 2 12 12 11 2 11 2 72212 SYS 5 10 6 10 11 7 5 10 6 77821 Phone Call 4 9 12 13 7 4 9 5 78813 MAIN 6 11 7 10 12 1 6 11 7 78920 SMS 10 8 900
of the internal structure of the dataset Groups obtainedby BHL are more compact and well defined than thegroups generated by other EPP techniques in the previouswork
In addition to the BHL experiments experiments withDTs were additionally conducted in order to compare andvalidate the obtained results As it has been previouslymentioned 3 different splitting functions have been appliedin present paper Gini Deviance and Twoing In addition 3different criteria for selecting split features have been appliedStandard Curvature and Interaction
As an example one of the obtained DTs is shown inFigure 7 This is the tree generated from the Malgenomedataset when applying the standard CART split criteria andtheDeviance function It has been selected as it is the onewithlowest depth In the leaf nodes labels refer to family numbers(alphabetically ordered as presented in Section 23)
To show the most interesting results from the differentalternatives to build DT information has been summarizedin Table 2 For each combination of splitting function andselecting criteria the minimum depth of decision nodeslinked to each one of the original features is shown That is
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
6 Complexity
minus1 minus05 0 05 1 15
BHL
minus25
minus2
minus15
minus1
minus05
0
05
G1
G2
G1a G1b
G1a1
G1a2
G1a3
G1a4
G1b1
G1b2
G1b3
G1b4
G2a G2b
G2a1
G2a2
G2a3
G2a4
G2b1
G2b2
G2b3
Figure 3 BHL Labelling of clusters
G1
G2
G1A
G1A1 amp G1A2
G1A3 amp G1A4
G1B
G1B1 amp G1B2
G1B3 amp G1B4
G2B
G2B1
G2B2 amp G2B3
G2A1 G2A2 amp G2A3
G2A
G2A4
G1B1
G1B2
G2A1
G2A2 amp G2A3
Repa
ckag
ing=
0St
anda
lone
=1
Repackaging=1
Standalone=0
BOOT=1
BOOT=0
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Financial ChargesSMS=1Activation SMS=1
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Activation SMS =0Financial ChargesSMS=0
Financial ChargesSMS=1
BOOT=1
BOOT=0
Figure 4 Schematic clustering and relevant features from BHL projection
when the same feature appears in more than one node theminimum depth of all these nodes is the one selected for thegiven feature In the case a certain feature was not included inthe DT there is no value
In this table it can be seen that results (slightly orsignificantly) vary when comparing the obtained results (bydifferent splitting function and selecting criteria) for a certainfeature As general conclusions cannot be derived and to sum
up all figures the average depth value is calculated for eachfeature that is further analyzed
When analyzing Figure 4 and Table 2 it can be seenthat results from BHL and DT are coherent In the case ofBHL it can be seen that Repackaging is identified as themost discriminative feature because the two main groupsin the dataset (G1 and G2) take complementary values forsuch feature Coherently Repackaging is the feature with the
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Complexity 7
G1
Gone60 DroidDeluxe FakeNetflix Asroot Plankton zHash SndApps TapSnake GamblerSMS Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A
Gone60 DroidDeluxe FakeNetflix Asroot Plankton Zitmo Walkinwat FakePlayer GPSSMSSpy SMSReplicator RogueSPPush RogueLemon Spitmo
G1B
zHash SndApps TapSnake GamblerSMS GGTracker Lovetrap Nickyspy NickyBot GoldDream KminYZHC Crusewin
G1A1
Gone60 DroidDeluxeFakeNetflix Asroot Plankton
G1A2
Zitmo Walkinwat FakePlayer
G1A3
GPSSMSSpy SMSReplicator
G1A4
RogueLemon Spitmo RogueSPPush
G1B1 amp G1B2
zHash SndApps TapSnake GamblerSMS Nickyspy
G1B2
zHash SndApps TapSnake GamblerSMS
G1B3
NickyBot GoldDream Kmin YZHC
G1B4
GGTrackerLovetrap Crusewin
BOOT=0
BOOT=1
G1B2
Nickyspy
Financial ChargesSMS=0
Financial ChargesSMS=0
Activation SMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Financial ChargesSMS=1
Activation SMS=1
Figure 5 Families allocation in Group 1 and relevant features identified in BHL projection
G2
DoidKungFuUpdateDroidDream DogWarsJifake DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRDjSMSHiderZsoneBeanBot AnserverBot Endofday BaseBridge CoinPirate
BgServPjapps
G2A
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHiderZsoneBeanBot
G2B
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMasterADRDAnserverBot Endofday BaseBridge CoinPirate GeinimiBgServPjappsHippoSMS
G2A1 G2A2 amp G2A3
DoidKungFuUpdateDroidDream DogWarsJifakejSMSHider
G2A4
Zsone BeanBot
G2B1
DroidKungFu1 DroidKungFu2 DroidKungFu3 DroidKungFu4 DroidKungFuSapp DroidDreamLight DroidCoupon GingerMaster ADRD
G2B2
AnserverBot Endofday
G2B3
BaseBridge CoinPirate Geinimi BgServ Pjapps HippoSMS
BOOT
=0
BOOT=1
G2A1
DoidKungFuUpdateDroidDream
G2A2 amp G2A3
DogWars Jifake jSMSHider
Financial ChargesSMS=1
Financial ChargesSMS=0
Activation SMS=0
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=1
Activation SMS=1
Financial ChargesSMS=0
Activation SMS=0
Geinimi
HippoSMS
Figure 6 Families allocation in Group 2 and relevant features identified in BHL projection
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
8 Complexity
Figure 7 DT obtained with standard CART split criteria and Deviance function
lowest mean depth being included in all the generated treesFurthermore it was selected for the root node of 3DTsWhenanalyzing subgroups in BHL projection (Figure 3) it can beseen that BOOT is the feature that drive the split in 1st-levelsubgroups (subgroups G1A and G1B in the case of groupG1 and subgroups G2A and G2B in the case of group G2)In keeping with this idea according to DTs results Boot isthe second feature with the lowest mean depth For the nextlevel of importance the BHL projection identifies FinancialCharges SMS and Activation SMS as the features that bestexplain the split in different subgroups The two of them are
also selected by DTs as ranked in the first half of features witha lowestmean depth although some other features take lowervalues
Additionally from the DTs results (Table 2) Privilegeescalation-Ginger Break and Remote control-SMS can beidentified as the least relevant features The former was notincluded in 7 (out of 9) DTs while the latter was not includedin 6 Furthermore Remote control-SMS is the feature witha highest value of the average depth taking a value of9 It means that these features are almost useless whencharacterizing malware families
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Complexity 9
Results from present paper are consistent with thoseobtained in previous work [30] when applying FeatureSelection (FS) to the same dataset Installation-RepackagingActivation-SMS Activation-Boot Remote Control-NET andFinancial Charges-SMS were identified as the 5 most rel-evant features in order to characterize malware familiesaccording to a given method of filter-based FS Minimum-Redundancy Maximum Relevance This method is intendedat obtaining the maximum relevance to the output whilekeeping redundancy of selected features to lowest levelsComplementarily two evolutionary approaches to FS (GA-ICC-W and GA-I-W) identified Installation-RepackagingInstallation-Standalone Activation-SMS Remote Control-NET and Financial Charges-SMS as the 5most relevant onesThese methods perform the selection of features accordingto the Information Correlation Coefficient and the MutualInformation respectively
4 Conclusions and Future Work
In this paper some machine learning techniques have beenapplied to Android malware data in order to analyse thefeatures of such apps and subsequently identify the ones thatbetter define the organization ofmalware families As a resultdetection and categorization of malware could be improvedand spedup at the same time Furthermore by knowing aboutthese features malware apps could be identifiedmore quicklyand precisely and then removed from the official Androidmarket
From the obtained results some conclusions can bederived first of all the proposed machine-learning tech-niques probed to successfully address the given challengeBHL has outperformed previous neural projection tech-niques that have been applied to the same data in clearlyrevealing the structure of the Malgenome dataset Addition-ally features identified as the most important ones by suchEPP technique are also highlighted by DTs as being relevantto better differentiate between malware families
Obtained results are consistent with those obtained byFS and hence validate present proposal Future work willfocus on the development of a Hybrid Intelligent Systemto integrate results from the previously validated machine-learning techniques In addition it will be applied to up-to-datemalware datasets in order to check its performancewhenfacing 0-day malware
Data Availability
Dataset used in this research is available in [7] Bibliog-raphy 7 (2010) Malgenome Project [Online] available athttpwwwmalgenomeprojectorg
Conflicts of Interest
The authors declare that there are no conflicts of interestregarding the publication of this paper
Acknowledgments
This work is partially supported by Instituto Nacional deCiberseguridad (INCIBE) and developed by Research Insti-tute of Applied Sciences in Cybersecurity (RIASC)
References
[1] Gartner Global smartphone sales to end users from 1stquarter 2009 2018 httpswwwstatistacomstatistics266219global-smartphone-sales-since-1st-quarter-2009-by-operating-system
[2] AppBrain Android and google play statistics httpswwwappbraincomstatsstats-index
[3] SOPHOSLABS ldquoLtd s sophoslabs 2019 threat reportrdquo TechRep 2019
[4] M Labs ldquoCybercrime tactics and techniques Q3 2018rdquo TechRep 2018
[5] S Arshad M A Shah A Khan andM Ahmed ldquoAndroid mal-ware detection amp protection A surveyrdquo International Journal ofAdvanced Computer Science and Applications vol 7 no 2 2016
[6] Y Zhou and X Jiang ldquoDissecting android malware char-acterization and evolutionrdquo in Proceedings of the 2012 IEEESymposium on Security and Privacy pp 95ndash109 San FranciscoCalif USA May 2012
[7] Y Zhou Malgenome project 2010 httpwwwmalgenome-projectorg
[8] H Quintıan and E Corchado ldquoBeta hebbian learning as anew method for exploratory projection pursuitrdquo InternationalJournal of Neural Systems vol 27 no 6 Article ID 1750024 2017
[9] L Breiman Classification and regression trees Routledge Lon-don UK 2017
[10] P J Garcıa Nieto J Martınez Torres F J De Cos Juez andF Sanchez Lasheras ldquoUsing multivariate adaptive regressionsplines and multilayer perceptron networks to evaluate papermanufactured using Eucalyptus globulusrdquoAppliedMathematicsand Computation vol 219 no 2 pp 755ndash763 2012
[11] M Paliwal and U A Kumar ldquoNeural networks and statisticaltechniques a review of applicationsrdquo Expert Systems withApplications vol 36 no 1 pp 2ndash17 2009
[12] R F Garcia J L C Rolle M R Gomez and A D CatoiraldquoExpert condition monitoring on hydrostatic self-levitatingbearingsrdquo Expert Systems with Applications vol 40 no 8 pp2975ndash2984 2013
[13] C C TurradoMD CM Lopez F S Lasheras B A R GomezJ L C Rolle and F J D C Juez ldquoMissing data imputationof solar radiation data under different atmospheric conditionsrdquoSensors vol 14 no 11 pp 20382ndash20399 2014
[14] J L Calvo Rolle I Machon Gonzalez and H Lopez GarcıaldquoNeuro-robust controller for non-linear systemsrdquoDyna vol 86no 3 pp 308ndash317 2011
[15] A Herrero E Corchado M A Pellicer and A AbrahamldquoHybrid multi agent-neural network intrusion detection withmobile visualizationrdquo in Innovations in Hybrid Intelligent Sys-tems pp 320ndash328 2008
[16] R Sanchez A Herrero and E Corchado ldquoVisualization andclustering for SNMP intrusion detectionrdquo Cybernetics andSystems vol 44 no 6-7 pp 505ndash532 2013
[17] C Pinzon A Herrero J F De Paz E Corchado and J BajoldquoCBRid4SQL A CBR intrusion detector for SQL injection
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
10 Complexity
attacksrdquo in Proceedings of the 5th International Conference onHybrid Artificial Intelligence Systems HAIS 2010 - Part II vol6077 of Lecture Notes in Computer Science pp 510ndash519 SpringerBerlin Heidelberg 2010
[18] C Pinzon J F De Paz J Bajo A Herrero and E CorchadoldquoAIIDA-SQL An adaptive intelligent intrusion detector agentfor detecting SQL injection attacksrdquo in Proceedings of the 201010th International Conference on Hybrid Intelligent Systems HIS2010 pp 73ndash78 USA August 2010
[19] D Atienza A Herrero and E Corchado ldquoNeural Analysis ofHTTP Traffic for Web Attack Detectionrdquo in Proceedings of the8th International Conference on Computational Intelligence inSecurity for Information SystemsCISIS 2015 vol 369 ofAdvancesin Intelligent Systems and Computing pp 201ndash212 SpringerInternational Publishing 2015
[20] L Cen C S Gates L Si and N Li ldquoA probabilistic discrimi-native model for android malware detection with decompiledsource coderdquo IEEE Transactions on Dependable and SecureComputing vol 12 no 4 pp 400ndash412 2015
[21] H-J Zhu Z-H You Z-X Zhu W-L Shi X Chen and LCheng ldquoDroidDet effective and robust detection of androidmalware using static analysis along with rotation forest modelrdquoNeurocomputing vol 272 pp 638ndash646 2018
[22] F A Narudin A Feizollah N B Anuar and A GanildquoEvaluation of machine learning classifiers for mobile malwaredetectionrdquo Soft Computing vol 20 no 1 pp 343ndash357 2016
[23] M Wagner F Fischer R Luh et al ldquoA Survey of VisualizationSystems forMalware Analysisrdquo in Proceedings of the Eurograph-ics Conference on Visualization (EuroVis) - STARs 2015
[24] A Gonzalez A Herrero and E Corchado ldquoNeural visual-ization of android malware familiesrdquo in International JointConference SOCOrsquo16-CISISrsquo16-ICEUTErsquo16 vol 527 of Advancesin Intelligent Systems and Computing pp 574ndash583 SpringerInternational Publishing Cham 2017
[25] A Paturi M Cherukuri J Donahue and S MukkamalaldquoMobile malware visual analytics and similarities of AttackToolkits (Malware gene analysis)rdquo in Proceedings of the 2013International Conference on Collaboration Technologies andSystems CTS 2013 pp 149ndash154 USA May 2013
[26] W Park K Lee K Cho and W Ryu ldquoAnalyzing and detectingmethod of Android malware via disassembling and visual-izationrdquo in Proceedings of the 2014 International Conferenceon Information and Communication Technology Convergence(ICTC) pp 817-818 Busan South Korea October 2014
[27] V Moonsamy J Rong and S Liu ldquoMining permission patternsfor contrasting clean and malicious android applicationsrdquoFuture Generation Computer Systems vol 36 pp 122ndash132 2014
[28] O Somarriba U Zurutuza R Uribeetxeberria L Delosieresand S Nadjm-Tehrani ldquoDetection and visualization of androidmalware behaviorrdquo Journal of Electrical and Computer Engineer-ing vol 2016 Article ID 8034967 p 17 2016
[29] R Vega Vega H Quintian J L Calvo-Rolle A Herrero andE Corchado ldquoGaining deep knowledge of Android malwarefamilies through dimensionality reduction techniquesrdquo LogicJournal of the IGPL 2018
[30] J Sedano S Gonzalez C Chira A Herrero E Corchado andJ R Villar ldquoKey features for the characterization of Androidmalware familiesrdquo Logic Journal of the IGPL vol 25 no 1 pp54ndash66 2017
[31] S Singh and P Gupta ldquoComparative study id3 cart and c45 decision tree algorithm a surveyrdquo International Journal of
Advanced Information Science and Technology vol 27 no 7 pp97ndash103 2014
[32] W-Y Loh and Y-S Shih ldquoSplit selectionmethods for classifica-tion treesrdquo Statistica Sinica vol 7 no 4 pp 815ndash840 1997
[33] W-Y Loh ldquoRegression trees with unbiased variable selectionand interaction detectionrdquo Statistica Sinica vol 12 no 2 pp361ndash386 2002
[34] L Li A Bartel T F Bissyande et al ldquoIccTA Detecting Inter-Component Privacy Leaks in Android Appsrdquo in Proceedingsof the 2015 IEEEACM 37th IEEE International Conference onSoftware Engineering (ICSE) vol 1 pp 280ndash291 Florence ItalyMay 2015
[35] D Arp M Spreitzenbarth M Hubner H Gascon and KRieck ldquoDrebin effective and explainable detection of androidmalware in your pocketrdquo in Proceedings of the 2014 Network andDistributed System Security (NDSS) Symposium vol 14 pp 23ndash26 2014
[36] G Suarez-Tangil S K DashM Ahmadi J Kinder G Giacintoand L Cavallaro ldquoDroidsieve Fast and accurate classificationof obfuscated android malwarerdquo in Proceedings of the SeventhACM on Conference on Data and Application Security andPrivacy (CODASPY) pp 309ndash320 Scottsdale Arizona USA2017
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom
Hindawiwwwhindawicom Volume 2018
MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Mathematical Problems in Engineering
Applied MathematicsJournal of
Hindawiwwwhindawicom Volume 2018
Probability and StatisticsHindawiwwwhindawicom Volume 2018
Journal of
Hindawiwwwhindawicom Volume 2018
Mathematical PhysicsAdvances in
Complex AnalysisJournal of
Hindawiwwwhindawicom Volume 2018
OptimizationJournal of
Hindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom Volume 2018
Engineering Mathematics
International Journal of
Hindawiwwwhindawicom Volume 2018
Operations ResearchAdvances in
Journal of
Hindawiwwwhindawicom Volume 2018
Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018
International Journal of Mathematics and Mathematical Sciences
Hindawiwwwhindawicom Volume 2018
Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom
The Scientific World Journal
Volume 2018
Hindawiwwwhindawicom Volume 2018Volume 2018
Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in
Nature and SocietyHindawiwwwhindawicom Volume 2018
Hindawiwwwhindawicom
Dierential EquationsInternational Journal of
Volume 2018
Hindawiwwwhindawicom Volume 2018
Decision SciencesAdvances in
Hindawiwwwhindawicom Volume 2018
AnalysisInternational Journal of
Hindawiwwwhindawicom Volume 2018
Stochastic AnalysisInternational Journal of
Submit your manuscripts atwwwhindawicom