longitudinal analysis of a large corpus of cyber threat...

J Comput Virol Hack TechDOI 10.1007/s11416-014-0217-8

ORIGINAL PAPER

Longitudinal analysis of a large corpus of cyber threat descriptions

Ghita Mezzour · L. Richard Carley ·Kathleen M. Carley

Received: 6 February 2014 / Accepted: 9 June 2014© Springer-Verlag France 2014

Abstract Online cyber threat descriptions are rich, butlittle research has attempted to systematically analyze thesedescriptions. In this paper, we process and analyze twoof Symantec’s online threat description corpora. The Anti-Virus (AV) corpus contains descriptions of more than 12,400threats detected by Symantec’s AV, and the Intrusion Preven-tion System (IPS) corpus contains descriptions of more than2,700 attacks detected by Symantec’s IPS. In our analysis, wequantify the over time evolution of threat severity and typein the corpora. We also assess the amount of time Symantectakes to release signatures for newly discovered threats. Ouranalysis indicates that a very small minority of threats in theAV corpus are high-severity, whereas the majority of attacksin the IPS corpus are high-severity. Moreover, we find thatthe prevalence of different threat types such as worms andviruses in the corpora varies considerably over time. Finally,

This work is supported in part by the Defense Threat Reduction(DTRA) under the grant number HDTRA11010102, the ArmyResearch Office (ARO) under grants W911NF1310154 andW911NF0910273, and the Center for Computational Analysis ofSocial and Organizational Systems (CASOS). The views andconclusions contained in this document are those of the authors andshould not be interpreted as representing the official policies, eitherexpressed or implied of DTRA, ARO or the US government.

G. Mezzour (B) · L. R. CarleyElectrical and Computer Engineering Department,Carnegie Mellon University, 5000 Forbes Avenue,Pittsburgh, PA 15213, USAe-mail: [email protected]

L. R. Carleye-mail: [email protected]

K. M. CarleyInstitute for Software Research, Carnegie Mellon University,5000 Forbes Avenue, Pittsburgh, PA 15213, USAe-mail: [email protected]

we find that Symantec prioritizes releasing signatures for fastpropagating threats.

1 Introduction

Security companies with anti virus products have first hand,daily experience with cyber threats. These companies typi-cally provide rich online cyber threat descriptions. Althoughthese descriptions are frequently used by security researchersand practitioners, little has been done to systematically ana-lyze such descriptions. Such analysis can provide historicalinsight into which threat types were a major concern to thesecurity community throughout the years. That analysis canalso provide a better understanding of cyber threat severity.Finally, such analysis can inform about some of anti viruscompanies’ policies. Such policies are important becauseanti viruses are often the main barrier between end usersand cyber threats.

In this paper, we process two online threat descriptioncorpora [36,37] in order to build and populate two threat cat-alogs. The corpora consist of semi-structured descriptionsprovided by Symantec, a major security vendor providingend-host security products. Our catalogs consist of structureddescriptions that are easy to analyze using automated tech-niques. More specifically, we build the AV catalog based onthe anti virus (AV) corpus [37] and the Intrusion PreventionSystem (IPS) catalog based on the IPS corpus [36]. The AVcorpus contains descriptions of threats detected by Syman-tec’s commercial AV and the IPS corpus contains descrip-tions of attacks detected by Symantec’s commercial IPS. TheAV and IPS are two end-host systems that often run side-by-side, but do not interact. The AV examines the end-host’sfiles, whereas the IPS examines the end-host’s network activ-ity. The AV corpus contains more than 12,400 threat descrip-

123

G. Mezzour et al.

tions and the IPS corpus contains more than 2,700 attackdescriptions.

The AV catalog contains, for a given threat, the threatname, threat family name, type, discovery date, signaturerelease date and severity measures. Threat severity mea-sures [38] are distribution, damage, threat containment andremoval levels. The values of these measures belong to threecategories: high, medium and low. The IPS contains, for agiven attack, the attack name, attack family name and type.The AV catalog contains more attributes than the IPS catalogbecause the AV corpus contains more detailed informationthan the IPS corpus.

In our analysis, we study the prevalence of different types(e.g. worms, viruses and trojans) in the corpora over time.Symantec usually prioritizes including severe and/or emer-gent threats that affect home users in the corpora. We findthat different threat types reach their peak during differenttime periods. For example, worms reach their peak in 2003,and fake applications in 2007. In our analysis, we also quan-tify the over time evolution of threat severity. We find thathigh severity threats are consistently a minority in the AVcatalog. On the other hand, the majority of attacks in the IPScorpus are high-severity. This is because IPS signatures aremainly created for high severity attacks that can be detectedwith low false positive probability. Finally, we also estimatethe amount of time Symantec takes to release signatures fornewly discovered threats, and compare this amount of timefor different severity levels. We find that Symantec prioritizescreating and releasing signatures for fast propagating threats.However, we do not find evidence that Symantec prioritizesreleasing signatures for high damage threats.

Symantec’s threat corpora contain rich information, butextracting information from these corpora to populate ourcatalogs is challenging. Symantec provides these corporain order to help threat victims remove these threats fromtheir machines. Symantec does not provide these corporain order to help researchers perform a systematic analysis.Threat descriptions are created by different people and con-tain large amounts of unstructured text. While we used asemi-automated process to extract some attributes from thesedescriptions, we had to rely on a semi-manual process toextract other attributes. We also often contacted Symantec toinquire about their internal conventions, and to validate ourapproaches.

We intend to share our catalogs with researchers. In addi-tion to the insight that can be obtained directly from ourcatalogs, our catalogs facilitate mining the Worldwide Intel-ligence Network Environment (WINE) [9,15] telemetry datasets. WINE is a platform for repeatable experimental researchthrough which researchers can access security data used atSymantec Research Labs. These data sets consist of threatdetection reports collected from millions of machines world-wide during the period November 2009 –September 2011. A

threat report contains the geolocation of the machine sendingthe report, the detection time, and the identifier of the threatdetected. Researchers need then to obtain these threats’ char-acteristics from Symantec’s online threat descriptions cor-pora [36,37]. We significantly help researchers interested inWINE by presenting information in these corpora in a struc-tured and uniform form.

We describe the AV catalog in Sect. 2 and the IPS catalogin Sect. 3. We present our analysis in Sect. 4. We discussrelated work in Sect. 5, limitations in Sect. 6 and future workin Sect. 7, before concluding in Sect. 8.

Non-goal. We note that our study does not aim at ana-lyzing the prevalence of different threats among infectedmachines. In order to clarify that point, we provide an analogythat applies to the over time evolution of car horsepower. Ina study similar to ours, a researcher would study such evolu-tion based on brochures of major new car models released inthe market over the years. In a different study, the researcherwould contact a random sample of people every year and askthem about their car horsepower. Both studies would likelyfind that horsepower has increased over the past 20 years, butthe rate would probably be different. The first study focuseson manufacturers, while the second on customers.

2 AV catalog

We build and populate the AV catalog based on informationavailable in the Symantec AV online corpus [37]. In orderto extract information from the corpus, we mainly look forrelevant keywords in the threat descriptions. Table 1 presentstwo examples of AV catalog entries.

2.1 Catalog attributes

The catalog contains the following attributes:Threat name. The unique name Symantec gives to the

threat.Specific/generic. This variable indicates whether this is a

specific or a generic threat. A specific threat is a particularthreat variant, whereas a generic threat may correspond tomultiple threat variants that either belong to the same fam-ily or that have a common characteristic such as the packersoftware. The first entry in Table 1 corresponds to a specificthreat, whereas the second entry corresponds to a genericthreat.

Threat family name. A generalized name that we derivefrom the threat name. We are always able to extract a threatfamily name for specific threats. On the other hand, we aremostly unable to associate a threat family name with genericthreats.

Type. The type such as worm and virus. Some threats havemore than one type. The main types are trojan, worm, virus,macro, adware/spyware and fake application.

123

Longitudinal analysis of a large corpus

Table 1 Examples of AV catalog entries

Entry Field Value

1 Threat name W32.Aimdes.A@mm

Specific/generic Specific

Threat family name Aimdes

Type Worm

Discovered February 11, 2005

Initial rapid release February 11, 2005

Initial daily certified version February 11, 2005

Discovery year 2005

Distribution High

Damage Medium

Threat containment Easy

Removal Moderate

2 Threat name Backdoor.Trojan

Specific/generic Generic

Threat family name Not available

Type Backdoor, trojan

Discovered February 11, 1999

Initial rapid release February 11, 1999

Initial daily certified version February 11, 1999

Discovery year 1999

Distribution Low

Damage Medium

Threat containment Easy

Removal Easy

Discovered. The date Symantec finds out about the threat.Initial rapid release. The date the rapid release virus def-

inition is released. Rapid release signatures are subject tobasic testing before their release. These signatures defendagainst newly emerging threats, but are more susceptible tofalse positives [39].

Initial daily certified version. The date the threat def-inition is included in the daily release. Threat signaturesundergo thorough testing before being included in the dailyrelease [39].

Discovery year. The year the threat is discovered. In mostcases, the discovery year is simply the year in the discoveredattribute. However, when the discovered attribute is missing,we use the year in the initial rapid release or the initial dailycertified version attributes when available.

Distribution level. A measure of the aggressiveness of thethreat propagation mechanism [38]. There are three distrib-ution levels: high, medium and low according to guidelinesin Table 2. It is worth noting that the distribution level isnot always a perfect indicator of the number of computersinfected by the threat. This is due to multiple reasons suchas the fact that the threat may target an old software vulner-ability that is currently patched in most systems.

Table 2 AV. Guidelines on threat distribution and damagemeasures [38]

DistributionHigh Some worms, network aware executables, uncontainable

threats (due to high virus complexity or low AV ability tocombat)

Medium Most viruses, some worms

Low Most trojans

Damage

High File destruction/modification, very high server traffic, largescale non repairable damage, large security breaches,destructive triggers

Medium Non critical settings altered, buggy routines, easilyrepairable damage, non destructive triggers

Low No intentionally destructive behavior

Damage level. A measure of the damage that an infectionis capable of causing [38]. There are also three damage levels:high, medium and low as described in Table 2.

Threat containment. A measure of the difficulty to containthe threat.

Removal. A measure of the difficulty to remove the threatfrom a machine.

Threat containment. This measure takes 3 possible values:easy, moderate and difficult.

Removal. This measure also takes 3 possible values: easy,moderate and difficult. Easy removal may only require run-ning a full system scan and deleting detected malicious files.Difficult removal may require starting the machine in troubleshooting mode and following detailed instructions.

The AV corpus also contains the number of infections,number of sites and wild level, but we choose not to includethese severity measures in the catalog because drawing con-clusions based on these measures is difficult. These measuresvary over time, and are reported at different times and differ-ent life stages for different threats.

2.2 Attribute extraction methodology

We automatically extract the values of some attributesbecause these values immediately follow a fixed keywordin threat descriptions. These attributes are the threat name,type, initial rapid release, initial daily certified version, dis-covered, distribution level, damage level, threat containmentand removal. Later, we remove inconsistencies from thesevalues. More specifically, we merge the types “trojan horse”and “trojan” into “trojan”. We also merge the types “adware”,“spyware” and “trackware” into “adware/spyware”.

For threats that do not have the threat type listed as anattribute in the threat description, we leverage Symantec’svirus naming conventions [40] in order to identify the type.Leveraging naming conventions allows us to determine the

123

G. Mezzour et al.

type of about 5 % of the threats we use in our analysis.We determine that the type is “worm” if the name contains“worm” or ends with “@m” or “@mm”, that the type is“adware/spyware” if the name contains “adware”, “spyware”or “infostealer”, and that the type is “trojan” if the namecontains “trojan” or “backdoor”. We decide to assign type“trojan” when the threat name contains “backdoor” becausein more than 98 % of the cases when the threat name con-tains “backdoor” and the threat type is given in the corpus,the threat type is “trojan”. We do not use the other nam-ing conventions because these conventions do not match theusage in the AV corpus, probably because the conventiondescription is imprecise. For example, one convention saysthat a “W32” prefix indicates that the threat is a virus thatcan infect Windows 32 platforms. However, worms are themajority of threats that have a “W32” prefix and that have atype listed in the corpus. The “W32” prefix probably indi-cates the platform targeted by the threat, but not necessarilythe threat type.

The specific/generic attribute is not explicitly given in thecorpus, but can be inferred from the threat name or descrip-tion. We consider a threat to be generic when the threat nameends with “!gen” or “Family” [40], or if the descriptioncontains one of the keywords “generic signature”, “detec-tion name”, “detection technology”, “cloud-based detec-tion”, “heuristic”, “without traditional signatures”, and “newmalware threats”. We determine the keywords to search forby reading a large number of threat descriptions, and manu-ally extracting the relevant keywords.

We extract the threat family name from the threat name bytaking advantage of the threat name structure. According toSymantec’s virus naming conventions [40], a threat variantname consists of a prefix that designates the platform targetedby the threat or the threat type, the family name and a suffixthat differentiates among variants of the family name. Weautomatically search for and remove the prefixes and suffixeslisted in the naming convention. We manually review andcorrect the resulting family names as some threat variantnames do not strictly follow the naming convention.

3 IPS catalog

We build and populate the IPS catalog based on descriptionsin the IPS corpus [36]. The IPS corpus contains descriptionsof more than 2,700 attacks. The majority of attacks in theIPS corpus are attempts to exploit software vulnerabilities.However, in the IPS corpus, we also find threats such asworms and trojans. Some of these threats may also appear inthe AV corpus.

Descriptions in the IPS corpus are less rich than descrip-tions in the AV corpus, and therefore the IPS catalog con-

Table 3 Examples of IPS catalog entries

Entry Field Value

1 Attack name Attack: Apache Struts CVE-2013-2251Code Execution 2

Family name Attack: Apache Struts CVE-2013-2251Code Execution

Type Exploit

Severity High

2 Attack name BugBear B Worm FileShare Propagation

Family name BugBear

Type Worm

Severity High

tains fewer attributes than the AV catalog. Table 3 presentsan example of an IPS catalog entry.

3.1 Catalog attributes

The catalog contains the following attributes:Attack name. The unique name given by Symantec to the

attack. The name may contain the CVE code of the corre-sponding exploit [27] as in the first entry in Table 3, or thename of the corresponding threat as in the second entry inthe table.

Family name. This is a generalization of the attack name.We obtain the family name by removing the exploit variantidentifier as in the first entry in Table 3, or by extracting thethreat family name as in the second entry in the table.

Type. The attack type. Some attacks have more thanone type. The main types are web attack, exploit, worm,adware/spyware, trojan, backdoor and OS exploit.

Severity. The severity level of the attack is assigned basedon a combination of the attack prevalence among users, andthe potential malicious impact of the attack. The Symantecguidelines for assigning the severity level, used for informa-tional purposes, are given in Table 4.

3.2 Attribute extraction methodology

Attack name. We extract this name automatically as it is thetitle of the online attack description.

Family name. We extract this name manually from theattack name or description. Although IPS attack names fol-low some patterns, these names lack a precise structure. Itis worth noting that we are unable to associate an accuratefamily name with exploits, OS exploits and web attacks. Forthese attacks, the family name is equal to the signature nameafter removing the variant identifier as illustrated in the firstentry in Table 3.

Type. We extract the type based on keywords that appearin the attack name or description. More specifically, we

123


Table 4 IPS. Guidelines on attack severity levels

Level Interpretation

High Widespread worms or viruses

Or arbitrary code execution as superuser

Or high-impact denial of service

Or some backdoors

Medium Arbitrary code execution not as superuser

Or write access to important or arbitrary data

Or medium-impact denial of service

Or some backdoors

Or most invasive scanning tools

Low Reconnaissance tools

Or policy violation such as P2P networks and instantmessenger

Or troubleshooting signatures

Or authorized activity

assign type “worm”, “virus”, “rootkit”, “trojan” or “back-door” when these exact keywords appear in the attack nameor description. We assign the type “exploit” when the attackname starts with “Attack:”1, and type “OS exploit” whenthe attack name starts with “OS Attack:”. We assign type“adware/spyware” when the attack name or description con-tains “adware” or “spyware”. We assign type “web attack”when the attack name starts with “Web Attack:” or contains“MS IE” or “MSIE” (for Microsoft Internet Explorer). Wealso assign type “Web Attack” when the attack name contains“HTTP” and is not already assigned another type such asworms or adware. Some worm or adware attack names con-tain “HTTP” because machines infected with these threatscommunicate over HTTP.

Severity. We extract this attribute automatically as italways follows the keyword “severity” in the descriptions.

4 Analysis

In this section, we analyze the AV and IPS catalogs. Weanalyze the threat type and severity based on the two catalogs.We analyze other aspects only based on the AV catalog asthe AV catalog contains more attributes than the IPS catalog.

We distinguish between specific and generic threats whenanalyzing the AV catalog, as specific and generic threatscorrespond to different granularity levels. When analyzingattacks in the IPS catalog and specific threats in the AV cata-log, the unit of analysis is a family. For example, when com-puting the distribution of threat types, we count the number

1 We choose to assign type “exploit” instead of assigning type “attack”because assigning type “attack” would be confusing given that all entriesin the IPS corpus are considered attacks.

of families in each type, and not the number of threats. Wechoose to focus on families because it is relatively easy tocreate a large number of threat variants. When threats fromthe same family have different attribute values, we considerthat the family has a weighted association with each of thesevalues. We compute the weight based on the ratio of threatsthat have a given attribute value out of all threats of that fam-ily. For example, if two threats are worms and one threat isa virus, we consider that the threat family is a worm withweight 2/3 and a virus with weight 1/3. We also considerthat a family is discovered the first time a threat belongingto that threat family is discovered. Unfortunately, in the caseof generic threats in the AV catalog, we are mostly unable toidentify the threat family detected. For generic threats, ourunit of analysis is a threat. For example, when computing thedistribution of threat types detected, we count the numberof such threats belonging to each type. Finally, we note thatalthough we are able to associate an accurate family namewith some generic threats, in the rest of the paper, threatfamilies in the AV catalog always refer to threat familiesassociated with specific threats.

We discuss the availability of attribute values in the cata-logs in Sect. 4.1. In Sects. 4.2 and 4.3, we discuss threat typeand severity respectively. Finally, we examine the amountof time Symantec takes to release a signature for a newlydiscovered threat in Sect. 4.4.

4.1 Data availability

All attacks listed in the IPS online corpus have descrip-tions, and we are able to extract all the attributes presentedin Sect. 3 from these descriptions. On the other hand, theAV corpus contains descriptions for 12,398 threats, but listsanother 28,132 threats without providing any description forthese threats. The analysis in this paper exclusively takes intoaccount threats that have descriptions. In general, Symantecprioritizes providing descriptions for severe and/or emergentthreats that affect home users.

Within the 12,398 threats studied in this paper, Table 5presents the percentage of threat families and generic threatswith available attribute values. We see that the threat type isavailable for more than 98 % of threat families and genericthreats. The discovery date is available for 70 % of threatfamilies, and 96.56 % of generic threats. Fortunately, sincethe signature release dates are more often available, the dis-covery year is available for more than 98 % of threat fami-lies and generic threats. In our over-time analysis, we use ayearly granularity, and disregard the less than 2 % of threatsfor which the discovery year is missing. Severity measuresare available for 63.22 % of threat families and 95.77 % ofgeneric threats.

Figure 1 presents the number of threat families and genericthreats over time. We see that the number of threat families is

123

G. Mezzour et al.

Table 5 AV. Availability of attribute values in the AV catalog

Attribute Threatfamilies

Genericthreats

Type 98.93 % 99.03 %

Discovery 70.30 % 96.56 %

Initial rapid release 95.30 % 94.35 %

Initial daily certified version 95.78 % 95.90 %

Discovery year 98.67 % 99.96 %

Severity measures (damage, distribution,removal, threat containment)

63.22 % 95.77 %

Fig. 1 AV. Number of threat families and generic threats over time

very small prior to year 2000. This is probably due to the factthat the AV online corpus has a very low coverage of theseold threats. We only consider the period later than 2000 inour over-time analysis of threat families. From the figure, wealso see that the number of generic threats is very small priorto 2008, but becomes very significant after that year. Startingfrom 2008, Symantec starts extensively using signatures ableto detect multiple threats. We only consider the period laterthan 2008 in the over time analysis of generic threats giventhat the number of such threats only becomes significant inthe corpus during that period.

Figure 2 presents the over-time availability of the threattype for threat families and generic threats2. We see that theavailability of the threat type is always above 97 %. Fig-ure 3 presents the over-time availability of the threat severitymeasures. Unfortunately, we see that the availability of theseverity measures varies between 45 % and 93 % for threatfamilies. This means that we need to be careful interpret-ing the evolution over time of threat severity in the catalog.The availability of the severity measures for generic threatsis always above 93 %.

We discuss the implications of data availability on ourfindings throughout the remaining of Sect. 4.

2 We only consider the period later than 2008 for generic threats. Priorto 2008, the number of generic threats is small, and thus the analysis isnot meaningful.

Fig. 2 AV. Percentage of threats for which type is available

Fig. 3 AV. Percentage of threats for which severity measures areavailable

4.2 Types

Figure 4 presents the distribution of threat types in the AVcatalog and Fig. 5 presents the distribution of attack types inthe IPS catalogs. From the figures, we see that the distribu-tion differs substantially across the two catalogs as the AVand IPS systems operate differently. More specifically, theAV examines files and the IPS examines network activity.From Fig. 4, we see that in the AV catalog, trojans, worms,viruses, adware/spyware account for the vast majority ofthreats. From Fig. 5, we see that in the IPS catalog, webattacks constitute the vast majority of attacks, followed byexploits. Worms, adware/spyware and trojans account eachfor less than 5 % of attacks.

Figure 6 presents the evolution over time of threat types inthe AV catalog. More specifically, in a given year, we countthe number of threat families and generic threats from dif-ferent types included in the corpus. In Fig. 6a, we find a verylarge number of worms during the period 2001–2004. Duringthat period, worms were probably the most serious concern tothe security community as many major worms were releasedduring that period. For example, Code Red and SQL Slammerwere released in 2001 and 2003 respectively. We find almostno adware/spyware in the catalog prior to 2002, but manyadware/spyware families during the period 2003–2006. Sim-ilarly, the catalog contains almost no fake application prior to2005, but a considerable number of such families during theperiod 2006–2009. Adware/spyware and fake applications

123


(a)

(b)

Fig. 4 AV. Threat type empirical distribution

Fig. 5 IPS. Attack type empirical distribution

e.g. fake anti viruses are mainly profit oriented. The num-ber of macros decreases over time and becomes zero startingfrom 2006. However, this may be due to macros being labeledviruses in recent years.

We discuss the over time evolution of viruses and tro-jans based on Fig. 6a and b since there is a substantial num-ber of trojans and viruses among both threat families andgeneric threats. The number of trojans among threat fam-ilies is always large, except from the period 2007–2010.However, we find a considerable number of trojans amonggeneric threats starting from 2008. The number of virus fami-lies decreases over time in Fig. 6a. However, we find a sizablenumber of viruses among generic threats starting from 2008.

The fact that the AV corpus only contains descriptionsfor about 30 % of threats means that our results need to be

(a)

(b)

Fig. 6 Evolution of threat types over time

interpreted carefully. More precisely, the number of threatsin the AV corpus belonging to a given type in the AV corpusdiscovered in a given year is a measure of both the numberof discovered threats from that type and the extent to whichSymantec thinks that that threat type is severe and/or emerg-ing during that year.

4.3 Severity

Table 6 presents the distribution of threat severity measuresin the AV catalog. From the table, we see that high sever-ity threats account for the minority of threats and that lowseverity threats account for the vast majority of threats, andthis according to all severity measures. From the table, wealso see that generic threats mostly have lower severity thanthreat families. Symantec mainly uses generic detection forlow severity threats. If a threat turns out later to have highseverity, Symantec uses specific detection for that threat.

Given that only about 30 % of threats have descriptionsin the AV online threat corpus and that high severity threatsare more likely to have descriptions, the table likely over-estimates the percentage of high severity threats among allcyber threats that affect home users. In other words, thepercentage of high severity threats would have been lowerif severity measures from all threats were available. Thisfinding does not indicate that cyber threats are not a major

123

G. Mezzour et al.

Table 6 AV. Threat severity assessment

Field Level Percentage Level Percentage Level Percentage

Threat families Distribution level High 9.32% Medium 10.76% Low 79.91%

Damage level High 2.62% Medium 29.16% Low 68.20%

Threat containment Difficult 0.16% Moderate 1.72% Easy 98.11%

Removal Difficult 1.01% Moderate 21.58% Easy 77.40%

Distribution level High 0.78% Medium 4.42% Low 94.79%

Generic threats Damage level High 0.27% Medium 6.36% Low 93.36%

Threat containment Difficult 0% Moderate 0.78% Easy 99.21%

Removal Difficult 0.27% Moderate 0.96% Easy 98.75%

(a)

(b)

Fig. 7 AV. Evolution over time of the damage levels

problem. Rather it may indicate that if we had a more finegrained severity measure, we would find that threat severityfollows a heavy tail distribution [7] such as a power law dis-tribution. It would be interesting to test such hypothesis asfuture work based on other data sources. For example, theWINE data [15] can help test whether the number of com-puters that encounter different threats follows a heavy taildistribution.

We now examine how two of these severity measuresnamely the damage and distribution evolve over time basedon Figs. 7 and 8. Figure 7 shows, for any given year, thepercentage of low, medium and high damage threats amongall threats discovered that year. From the figure, we see thatthe percentage of high damage threats is consistently under5 %. Unfortunately, meaningfully drawing conclusions fromthe over time evolution of the percentage of medium andlow damage threats is difficult. This is because the availabil-ity of threat severity measures varies considerably over timeas shown in Fig. 3. Figure 8 shows the percentage of low,medium and high severity threats over time. From the figure,we see that the percentage of high distribution threats rangesbetween 11 % and 20 % during the period 2001–2005. Dur-ing that time period, we find a very large number of wormsas explained in Sect. 4.2. Outside this time period, the per-centage of high distribution threats is under 5 %.

In the IPS catalog, 78.42 % of attacks are high sever-ity, 21.11 % of attacks are medium severity and 0.40 % ofattacks are low severity. One possible reason why the major-ity of attacks in the IPS catalog have high severity is thatdistinguishing malicious network traffic from benign trafficis difficult and prone to false positives. Unless the attack issevere and distinguishable with low false positive probabil-ity, it is relatively unlikely that a signature is created for thatattack.

4.4 Duration between threat discovery and signature release

In this section, we examine the duration between threat dis-covery and signature release3. In our analysis, we distinguishbetween rapid release signatures and daily release signatures.Rapid release signatures are released after basic testing, whiledaily release signatures are released after thorough testing.In our analysis, we only examine specific threats. Moreover,our unit of analysis is a threat and not a threat family unlike

3 A related duration that we do not study in this paper is the durationbetween threat release in the wild and threat discovery. We refer readersinterested in such duration to a different study on zero-day attacks [9].

123


(a)

(b)

Fig. 8 AV. Evolution over time of the distribution levels

the analysis in prior sections. We disregard generic threats inthis analysis because a generic threat consists of an aggre-gation of multiple threats. Interpreting the duration betweenthreat discovery and signature release for generic threats isnot straightforward.

Figure 9 presents the distribution of this duration. From thefigure, we see that Symantec releases both kinds of signaturesfor more than 50 % of threats the same day these threats arediscovered. We also see that signatures for more than 90 %of threats are released within 1 day of the discovery of thesethreats. This is good news given the large number of threatsdaily discovered. However, signatures for some threats arereleased months and even years after the discovery of thesethreats. During that time period, customers’ machines arenot necessarily vulnerable as generic signatures may be ableto detect these threats. We note that the cumulative distrib-ution function in Fig. 9 is estimated only based on threatsthat have descriptions. In case threats that have descriptionsare also threats for which signatures are created promptly,the empirical distribution given in Fig. 9 underestimates theactual distribution.

Figures 10 presents the amount of time Symantec takesto release a daily release signature for different distribution

Fig. 9 AV. Empirical cumulative probability distribution function ofthe duration between threat and signature release

Fig. 10 AV. Empirical cumulative probability distribution function ofthe duration between the release of daily release signature and the dis-covery of a threat for different distribution levels

levels. From the figure, we see that signatures are releasedfaster for high spreading and medium spreading threats thanfor low spreading threats. In order to test the statistical sig-nificance of this finding, we use the Kolmogorov-Smirnov(KS) test. The KS test is a pairwise non-parametric test thatdoes not assume the underlying distribution to be normal. Wefind that the difference between high spreading threats andlow spreading threats, and the difference between mediumspreading threats and low spreading threats are significant atthe 5 % level. The difference between high spreading threatsand medium spreading threats is not significant at the 5 %level. This result suggests that Symantec prioritizes releas-ing signatures for fast spreading threats, as these threats arelikely to affect a large number of computers.

Figure 11 presents the duration to release signatures fordifferent damage levels. The duration is smaller for mediumdamage than for low damage threats and this is statisticallysignificant at the 5 % level. From the figure, it appears thatthe duration for high damage threats is smaller than the dura-tion for low damage threats and larger than the duration formedium damage threats. These differences are however notstatistically significant.

123

G. Mezzour et al.

Fig. 11 AV. Empirical cumulative probability distribution function ofthe duration between the release of daily release signature and the dis-covery of a threat for different damage levels

5 Related work

The most closely related work is a study of worm charac-teristics [21] based on threat description corpora by Net-work associates, the former parent company of McAfee. Thatwork, however, only focuses on worms and uses a smallerdata set than the one we use. We also find reviews of particulartypes such as smartphone malware [42] and java viruses [31].

Previous research on studying threat characteristics basedon field data provides interesting insight, but usually has adifferent focus than our work. Field data typically coversa period of few months or 2–3 years, and is thus unableto provide the same historical perspective as our analysis.Moreover, field data typically focuses on one attack or typeof attack, and is thus unsuitable for a comprehensive analy-sis across all cyber-threats. For example, previous field datawork studies spam [20], spyware [29], botnets [1], androidmalware [45], zero-day attacks [9], targeted attacks [41],SQL slammer worm [28], exploit kits [22] and malware thatpropagates using server-side injection [23].

The analysis of software vulnerabilities and exploits hasattracted some attention [5,11,12,16,17,33]. While vulner-abilities and cyber-threats are related, the two are different.First, there is not a one-to-one correspondence between cyberthreats and vulnerabilities. Cyber threats may exploit zero,one or multiple software vulnerabilities. Similarly, one soft-ware vulnerability may be exploited by zero, one or multiplethreats. Moreover, beyond the exploited software vulnerabil-ity, cyber threats have other attributes such as the propagationmechanism.

There are other threat description corpora [25,43] besidesSymantec’s corpora [36,37]. We choose to analyze Syman-tec’s corpora because Symantec is a key player in the cybersecurity industry [30] and because we intend to use our cata-logs in order to mine the WINE data sets [15]. Merging mul-tiple threat corpora would be interesting, but is complicated

due to the absence of a standard threat taxonomy. There havebeen attempts [3,10,14,18,24,44] to develop a malware tax-onomy, but such taxonomies are not used by major securityplayers. Finally, many security vendors publish yearly threatreports [26,34,35]. However, these reports tend to focus onthe top threats observed during a given year.

Anderson et al. [4] performed a systematic assessmentof the monetary cost of cyber-crime. The work concludesthat society should spend less money on defensive securitymeasures, and more on responsive ones by arresting cyber-criminals. While this research is interesting, it has a differentfocus and goal than our work. First, that work mainly relies onmonetary loss estimates collected from various sources suchas banks. Moreover, the work only considers cyber attacksthat cost more than $10m per year worldwide.

6 Limitations

Our analysis is based on threat description corpora from a sin-gle vendor. While Symantec is a key anti-virus vendor [30],the analysis in this paper strongly relies on Symantec’s per-spective. That perspective affects which threats Symantecconsiders worthy of being described in the corpora, andhow threats are labeled. Researchers [6,13,28] have pointedout that threat labeling is not always consistent across anti-virus vendors. However, this does not necessarily imply thatanti-virus labels from a particular vendor are wrong. Thisis rather a consequence of the lack of unified guidelinesand taxonomy. It is worth noting that anti-virus labels areoften used as ground truth for evaluating new research tech-niques [2,8,19,32].

Finally, we inevitably introduced errors when extractingattributes from the online descriptions. This is particularlythe case for the IPS corpus where we had to rely on a semi-manual process. However, we believe that our results arerelevant since we use a very large number of threats and areonly interested in the overall characteristics of these threats.

7 Future work

While our catalogs contain important information extractedfrom the threat corpora, our catalogs do not contain all infor-mation available in these corpora. For example, our cata-logs do not contain the type of exploited vulnerability suchas buffer overflow and SQL injection, and the distributionmechanism such as email propagation or network probe. Weleave extracting that information for future work. Anotherfuture work direction is to investigate whether studying threatdescription corpora from other vendors [25,43] would yieldsimilar findings to those presented in this paper. Along thesame lines, it would be interesting to merge corpora from

123


different vendors in order to gain a more comprehensive per-spective. It is worth noting that the Symantec AV corpuscontains threat names from other vendors for about 20 % ofthreats which can help start the merger.

It would also be interesting to use machine learningand natural language processing algorithms to automate theextraction of attribute values from threat descriptions. Thisis especially relevant as new threats are continuously addedto the corpora. Our catalogs can serve as labeled data to trainmachine learning algorithms.

Finally, it would be interesting to study threat type andseverity when taking into account the prevalence of differentthreats among infected computers. The WINE data [15] andthe catalogs presented in this paper can help perform suchanalysis. Unfortunately, such study will only cover the periodNovember 2009 - September 2011, and will not provide thesame historical perspective as this paper. It will still be inter-esting to compare the results of that study with the findingsof this paper during the time period in which the two datasets overlap.

8 Conclusion

In this paper, we process and analyze Symantec AV and IPSthreat description corpora. The AV corpus contains descrip-tions for more than 12,400 threats and the IPS corpus containsdescriptions for more than 2,700 attacks.

We find that the prevalence of different threat types inthe corpora varies considerably over time. In a given year,the prevalence of a threat type in the corpus reflects thenumber of discovered threats from that type and the extentto which that type is serious and/or emergent during thatyear. Worms are very prevalent in the AV catalog duringthe period 2001–2004. Later on, adware/spyware and fakeapplications become prevalent. Viruses and trojans are preva-lent almost throughout the entire time period. Finally, webattacks account for the vast majority of attacks in the IPScatalog.

Our analysis also indicates that high damage threats con-stitute a very small minority of threats in the AV catalog.High damage threats cause file destruction or modification,very high server traffic, large-scale non-repairable damage,large security breaches or destructive triggers [38]. Attackers,nowadays, are mainly interested in monetary gains. High-damage threats are not necessarily the most lucrative threats.Moreover, threat writers may refrain from creating high dam-age threats because these threats are more likely to be noticedand removed.

In the IPS catalog, the majority of attacks have high sever-ity. Severity levels in the IPS catalog capture both attackprevalence and potential attack damage. One possible expla-nation for this finding is that network-level detection is diffi-

cult and error prone. IPS attack signatures are mainly createdfor high severity attacks that can be accurately distinguishedfrom benign network traffic.

Finally, we find that Symantec releases signatures for thevast majority of threats in the catalog within one day of thediscovery of these threats. In some cases, Symantec releasessignatures within months or even years after the discovery ofthese threats. Because the analysis in this paper only coversthreats with descriptions, the duration we report in this papermay be smaller than the duration we would find if we couldtake into account all threats. It is, however, worth notingthat during that time period customers are not necessarilyvulnerable. The threat may fall under a generic threat and bedetected, for example, using heuristics.

Acknowledgments The authors would like to thank Matthew Elderand Tudor Dumitras for their excellent feedback and support.

References

1. Abu Rajab, M., Zarfoss, J., Monrose, F., Terzis, A.: A multifacetedapproach to understanding the botnet phenomenon. In: InternetMeasurement Conference (IMC), pp. 41–52 (2006)

2. Allodi, L., Massacci, F.: A preliminary analysis of vulnerabilityscores for attacks in wild. The ekits and sym datasets. In: Workshopon Building Analysis Datasets and Gathering Experience Returnsfor Security (BADGERS), pp. 17–24. Raleigh, NC (2012)

3. Alvarez, G., Petrovic, S.: A new taxonomy of web attacks suitablefor efficient encoding. Comput. Secur. 22, 435–449 (2003)

4. Anderson, R., Barton, C., Böhme, R., Clayton, R., Eeten, M.J.G.V.,levi, M., Moore, T., Savage, S.: Measuring the cost of cybercrime.In: Workshop on the Economics of Information Security (WEIS),pp. 1–31. Berlin, Germany (2012)

5. Arbaugh, W.A., Fithen, W.L., McHugh, J.: Windows of vulnera-bility: a case study analysis. Computer 12(33), 52–59 (2000)

6. Bailey, M., Oberheide, J., Anderen, J., Mao, Z.M., Jahanian, F.,Nezario, J.: Automated classification and analysis of internet mal-ware. In: International Symposium on Research in Attacks, Instru-sions and Defenses (RAID), pp. 178–197 (2007)

7. Barabási, A.L.: The origin of bursts and heavy tails in humandynamics. Lett. Nat. 435, 207–211 (2005)

8. Bayer, U., Comparetti, P.M., Hlauschek, C., Kruegel, C., Kirda,E.: Scalable, behavior-based malware clustering. In: Network andDistributed System Security Symposium (NDSS). San Diego, CA(2009)

9. Bilge, L., Dumitras, T.: Before we knew it. An empirical study ofzero-day attacks in the real world. In: Computer and Communica-tion Security Conference (CCS), pp. 833–844. Raleigh, NC (2012)

10. Bishop, M.: A taxonomy of (unix) system and network vulnera-bilities. Tech. Rep. CSE-95-10, Department of Computer Science,University of California Davis (1995)

11. Bozorgi, M., Saul, L.K., Savage, S., Vœlker, G.M.: Beyond heuris-tics: learning to classify vulnerabilities and predict exploits. In:ACM SIGKDD Conference on Knowledge Discovery and DataMinining (KDD), pp. 105–114 (2010)

12. Browne, H., Arbaugh, W., McHugh, J., Fithen, W.: A trend analysisof exploitations. In: Symposium on Security and Privacy. Oakland,CA (2001)

13. Canto, J., Dacier, M., Kirda, E., Leita, C.: Large scale malwarecollection: lessons learned. In: IEEE Workshop on Sharing Field

123

G. Mezzour et al.

Data and Experiment Measurements on Resilience of DistributedComputed Systems (SRDS) (2008)

14. Cohen, F.: Information system attacks: a preliminary classificationscheme. Comput. Secur. 16, 29–46 (1997)

15. Dumitras, T., Shou, D.: Toward a standard benchmark for com-puter security research. The worldwide intelligence network envi-ronment (wine). In: Workshop on Building Analysis Datasets andGathering Experience Returns for Security (BADGERS), pp. 89–96. Salzburg, Austria (2011)

16. Frei, S., May, M., Fiedler, U., Plattner, B.: Large-scale vulnerabilityanalysis. In: SIGCOMM workshop on large-scale attack defense,pp. 131–138 (2006)

17. Frei, S., Tellenbach, B., Plattner, B.: 0-day patch exposting ven-dors (in)security performance. In: Black Hat Technical SecurityConference (2008)

18. Hansman, S., Hunt, R.: A taxonomy of network and computerattacks. Comput. Secur. 24, 31–43 (2005)

19. Hu, X., Chiueh, T., Shin, K.G.: Large-scale malware indexing usingfunction-call graphs. In: Computer and Communication SecurityConference (CCS). Chicago, IL (2009)

20. Kanich, C., Kreibich, C., Levchenko, K., Enright, B., Vœlker,G.M., Paxon, V.: Spamlytics: an empirical analysis of spam mar-keting conversion. In: The Computer and Communication SecurityConference (CCS), pp. 3–14. Alexandria, VA (2008)

21. Kienzle, D.M., Elder, M.: Recent worms: a survey and trends. In:Proceedings of the ACM Workshop on Rapid Malcode (WORM),pp. 1–10. Washington, DC (2003)

22. Kotov, V., Massacci, F.: Anatomy of exploit kits: Preliminary analy-sis of exploit kits as software artefacts. In: the 5th International Con-ference on Engineering Secure Software and Systems (ESSoS), pp.181–196. Paris, France (2013)

23. Leita, C., Bayer, U., Kirda, E.: Exploiting diverse observation per-spectives to get insights on the malware landscape. In: IEEE/IFIPInternational Conference on Dependable Systems and Networks(DSN), pp. 393–402 (2010)

24. Lough, D.L.: A taxonomy of computer attacks with applications towireless networks. Phd thesis, Virginia Polytechnic Institute andState University (2001)

25. McAfee: Mcafee threat report: third quarter 2012. http://www.mcafee.com/us/resources/reports/rp-quarterly-threat-q3-2012.pdf (2012). Last accessed: February 2013

26. McAfee: Mcafee threats report: third quarter 2012. http://www.mcafee.com/au/resources/reports/rp-quarterly-threat-q3-2012.pdf (2012). Last accessed: December 2012

27. MITRE: CVE-Common Vulnerabilities and Exposures. http://cve.mitre.org/ (2012). Last accessed: October 2012

28. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S.,Weaver, N.: Inside the slammer worm. IEEE Secur. Priv. 4(1), 33–39 (2003)

29. Moshchuk, A., Bragin, T., Gribble, S.D., Levy, H.M.: A crawler-based study of spyware on the web. In: Symposium on Networkand Distributed System Security (NDSS). San Diego, CA (2006)

30. OPSWAT: Market share report. http://www.opswat.com/about/media/reports/antivirus-september-2012 (2013). Last accessed:March 2013

31. Reynaud-Plantey, D.: New threats of java viruses. J. Comput. Virol.1–2(1), 32–43 (2005)

32. Rieck, K., Holz, T., Willems, C., Düssel, P., Laskov, P.: Learningand classification of malware behavior. In: Conference on Detec-tion of Intrusions and Malware and Vulnerability (DIMVA), pp.108–125. Paris, France (2008)

33. Shahzad, M., Shafiq, M.Z., Liu, A.X.: A large scale exploratoryanalysis of software vulnerability life cycles. In: International Con-ference on Software Engineering (ICSE), pp. 771–781 (2012)

34. Sophos: security threat report. http://www.sophos.com/en-us/security-news-trends/reports/security-threat-report.aspx (2012).Last accessed: December 2012

35. Symantec: internet security threat report. http://www.symantec.com/content/en/us/enterprise/other_resources/b-istr_main_report_2011_21239364.en-us.pdf (2011). Last accessed: October2012

36. Symantec: symantec attack signatures. http://www.symantec.com/security_response/attacksignatures/ (2012). Last accessed: Octo-ber 2012

37. Symantec: symantec threat explorer. http://www.symantec.com/security_response/landing/azlisting.jsp (2012). Last accessed:October 2012

38. Symantec: threat severity assessment. http://www.symantec.com/security_response/severityassessment.jsp (2012). Last accessed:October 2012

39. Symantec: Types of virus definitions available for download. http://www.symantec.com/popup.jsp?popupid=sr_help_popup (2012).Last accessed: October 2012

40. Symantec: symantec naming conventions. http://www.symantec.com/security_response/virusnaming.jsp (2013). Last accessed:June 2013

41. Thonnard, O., Bilge, L., O’Gorman, G., Kiernan, S., Lee, M.:Industrial espionage and targeted attacks: understanding the char-acteristics of an escalating threat. In: International Symposium onResearch in Attacks, Instrusions and Defenses (RAID), pp. 64–85(2012)

42. Töyssy, S., Helenius, M.: About malicious software in smart-phones. J. Comput. Virol. 2(2), 109–119 (2006)

43. TrendMicro: Threat encyclopedia. http://about-threats.trendmicro.com/us/threatencyclopedia (2012). Last accessed: October 2012

44. Weaver, N., Paxson, V., Staniford, S., Cunningham, R.: A taxon-omy of computer worms. In: ACM Workshop on Rapid Malcode(WORM), pp. 11–18 (2003)

45. Zhou, Y., Jiang, X.: Dissecting android malware: Characterizationand evolution. In: IEEE Symposium on Security and Privacy, pp.95–109. Oakland, CA (2012)

123

http://www.mcafee.com/us/resources/reports/rp-quarterly-threat-q3-2012.pdf



http://www.mcafee.com/au/resources/reports/rp-quarterly-threat-q3-2012.pdf



http://cve.mitre.org/

http://cve.mitre.org/

http://www.opswat.com/about/media/reports/antivirus-september-2012

http://www.opswat.com/about/media/reports/antivirus-september-2012

http://www.sophos.com/en-us/security-news-trends/reports/security-threat-report.aspx

http://www.sophos.com/en-us/security-news-trends/reports/security-threat-report.aspx

http://www.symantec.com/content/en/us/enterprise/other_resources/b-istr_main_report_2011_21239364.en-us.pdf



http://www.symantec.com/security_response/attacksignatures/

http://www.symantec.com/security_response/attacksignatures/

http://www.symantec.com/security_response/landing/azlisting.jsp

http://www.symantec.com/security_response/landing/azlisting.jsp

http://www.symantec.com/security_response/severityassessment.jsp

http://www.symantec.com/security_response/severityassessment.jsp

http://www.symantec.com/popup.jsp?popupid=sr_help_popup

http://www.symantec.com/popup.jsp?popupid=sr_help_popup

http://www.symantec.com/security_response/virusnaming.jsp

http://www.symantec.com/security_response/virusnaming.jsp

http://about-threats.trendmicro.com/us/threatencyclopedia

http://about-threats.trendmicro.com/us/threatencyclopedia

longitudinal analysis of a large corpus of cyber threat...

Documents