analyzing worms and network traffic using compression

arX

iv:c

s.C

R/0

5040

45 v

1 1

2 A

pr 2

005

Analyzing Worms and Network Traffic using Compression

Stephanie Wehner

Centrum voor Wiskunde en Informatica

Kruislaan 413, 1098 SJ Amsterdam, Netherlands

[email protected]

July 31, 2006

Abstract

Internet worms have become a widespread threat tosystem and network operations. In order to fightthem more efficiently, it is necessary to analyze newlydiscovered worms and attack patterns. This papershows how techniques based on Kolmogorov Com-plexity can help in the analysis of internet worms andnetwork traffic. Using compression, different speciesof worms can be clustered by type. This allows us todetermine whether an unknown worm binary couldin fact be a later version of an existing worm in anextremely simple, automated, manner. This may be-come a useful tool in the initial analysis of maliciousbinaries. Furthermore, compression can also be use-ful to distinguish different types of network trafficand can thus help to detect traffic anomalies: Cer-tain anomalies may be detected by looking at thecompressibility of a network session alone. We fur-thermore show how to use compression to detect mali-cious network sessions that are very similar to knownintrusion attempts. This technique could become auseful tool to detect new variations of an attack andthus help to prevent IDS evasion. We provide twonew plugins for Snort which demonstrate both ap-proaches.

1 Introduction

Recent years have seen a sharp increase in internetworms, such as Sasser and Msblaster, causing dam-age to millions of systems worldwide. Worms areautomated programs that exploit vulnerabilities ofcomputers connected to the network in order to takecontrol of them. Once they have successfully infecteda system, they continue to search for new victims.When a worm finds a new vulnerable computer it

will infect it as well. Worms thus spread through thenetwork on their own.

Given the enormous problems caused by worms, itis important to develop defenses against them. Forthis we need to analyze them in detail. Once a systemis found to be infected, we would like to know howthis happened, what actions the worm might performand how we can prevent infection in the future. Thispaper is concerned with one of the first steps in sucha diagnosis: we want to determine, whether we haveseen this worm before or whether it is very similarto a known worm. To achieve this, we analyze thebinary executable of the worm. Many known worms,such as Sasser, often come in a number of variants.These are usually successive versions released by thesame author to extend the worm’s functionality. Allsuch versions together then form a family of relatedworms. Using compression, we first cluster the dif-ferent versions of worms by family. We compare anunknown worm to a number of known worms usingcompression, which often allows us to guess its fam-ily which simplifies further analysis. This very simpleapproach does not make use of any manual compar-isons of text or especially selected patterns containedin the worm used in conventional analysis. Many bi-naries found in the wild are also compressed usingUPX, which is then modified to prevent decompres-sion. This makes it very difficult to analyze the bi-nary using approaches which rely on disassemblingthe binary and comparing the flow of execution oftwo programs. Surprisingly, our approach still workseven if UPX is used, although it becomes less accu-rate. Therefore compression may be a useful tool inthe initial analysis of newly captured worms.

Infected systems and network attacks often causevariations in network traffic. In order to recognizeand prevent such attacks one would like to detectthese variations. Typically, anomalies are detected

1

by searching for specific patterns within the networktraffic. For example, Snort [11] can trigger an alarmwhen the string “msblast.exe” is found back in a TCPsession. This works very well, once it is known whatpatterns we need to look for. Here, we are inter-ested in recognizing anomalies even if we don’t knowwhat to look for. First of all, we explore the complex-ity of different types of network traffic using com-pression. The differences in complexity allow us tomake a rough guess what class of application proto-cols gave rise to a certain network interaction. Intu-itively, the complexity of an object is determined byhow well it compresses. If something compresses wellwe say it has low complexity. If, on the other hand,it can hardly be compressed at all, we say it has highcomplexity. For example, the complexity of SSH ses-sions is very high. If we observe TCP sessions on thesame port with much lower complexity, we suspectan anomaly. This requires no additional knowledgeof the anomaly or a search for specific traffic patterns.The compress Snort plugin, allows Snort to trigger analarm if the complexity of the observed traffic doesnot fall within a specified range.

Finally, we make use of compression to comparetwo different sessions. An attacker or a new versionof a worm may generate slightly different networktraffic, which no longer contains the specific patternwe were looking for and thus escape detection. Here,we are interested in recognizing patterns which arevery similar to an existing ones. Informally, we saythat two protocol sessions are close together if theircombined complexity is not much larger than theirindividual complexity. This is the case if two ses-sions are very similar and thus describe each othervery well. We can use this to determine whetheran observed TCP session is similar to a prerecordedsession corresponding to a known attack. The NCDSnort plugin allows Snort to trigger an alarm if anobserved session is at a certain distance from a givensession. Even though compression is a relatively ex-pensive operation to perform on network traffic, webelieve this approach could be useful to detect newtypes of attacks. Once such attacks have been recog-nized, specific patterns can be constructed and usedin conventional intrusion detection.

1.1 Related Work

Graph-based methods have been used with great suc-cess in order to compare executable objects by HalvarFlake [5] as well as Carrera and Erdelyi [1]. Recently,Halvar Flake has also been applied this to the analy-

sis of malware [3]. Using these methods it is possibleto gain information about the actual security prob-lems used by worms, by comparing different versionsof the same executable. This requires disassembly ofthe binary. Whereas this approach has the advantageof disclosing actual information about the nature ofthe security vulnerability, our focus here is very dif-ferent. We provide a fast method for guessing thefamily of an observed worm without the need for dis-assembly, which can be used in he initial analysis ofthe executable. Based on this analysis, other meth-ods can be applied more accurately.

Much work has been done to determine the natureof network traffic. What makes our approach funda-mentally different is that we do not make any pre-selection of patterns or specific characteristics. Weleave it entirely to the compressor to figure out anysimilarities.

Kulkarni and Bush [6] have previously consideredmethods based on Kolmogorov complexity to moni-tor network traffic. Their approach however is dif-ferent as they do not use compression to estimateKolmogorov complexity. Furthermore they arguethat anomalies in network traffic stand out in thattheir complexity is lower than ’good’ network traffic.Whereas we find this to be true for network prob-lems caused by for example faulty hardware prompt-ing a lot of retransmissions, we argue that anomaliesare generally characterized by a different complexitythan we would normally expect for a certain protocol.For example, if normally we expect only HTTP trafficon port 80 outgoing and we now find sessions whichhave a much higher complexity, this may equally wellbe caused by a malicious intruder who is now runningan encrypted session to the outside. In this specificexample, the complexity has increased for malicioustraffic.

Evans and Barnett [4] compare the complexity oflegal FTP traffic to the complexity of attacks againstFTP servers. To achieve this they analyzed the head-ers of legal and illegal FTP traffic. For this theygathered several hundred bytes of good and bad traf-fic and compressed it using compress. Our approachdiffers in that we use the entire packet or even en-tire TCP sessions. We use this as we believe that inthe real world, it is hard to collect several hundredbytes of bad traffic from a single attack session usingheaders alone. Attacks exploiting vulnerabilities ina server are often very short and will not cause anyother malicious traffic on the same port. This is es-pecially the case in non-interactive protocols such as

2

HTTP where all interactions consist of a request andreply only.

Kulkarni, Evans and Barnett [7] also try to trackdown denial of service attacks using Kolmogorovcomplexity. They now estimate the Kolmogorov com-plexity by computing an estimate of the entropy of1’s contained in the packet. They then track the com-plexity over time using the method of a complexitydifferential. For this they sample certain packets froma single flow and then compute the complexity differ-ential once. Here, we always use compression and donot aim to detect DDOS attacks.

1.2 Contribution

• We propose compression as a new tool in theinitial analysis of internet worms.

• We show that differences in the compression ra-tio of network sessions can help to distinguishdifferent types of traffic. This leads to a sim-ple form of anomaly detection which does notassume prior knowledge of their exact nature.

• Finally, we demonstrate that compression canhelp to discover novel attacks which are very sim-ilar to existing ones.

1.3 Outline

In Section 2, we take a brief look at the concept ofnormalized compression distance, which we will usethroughout this paper. Section 3 then shows howwe can apply this concept to the analysis of internetworms. In Section 4 we examine how compressioncan be useful in the analysis of internet traffic. Fi-nally, in Section 4.1.2 and Section 4.2 we use theseobservations to construct two snort plugins to helprecognize network anomalies.

2 Preliminaries

In this paper we use the notion of normalized com-pression distance (NCD) based on Kolmogorov com-plexity [8]. Informally, the Kolmogorov complexityC(x) of a finite binary string x is equal to the lengthof the shortest program which outputs x. The Kol-mogorov complexity is in fact machine independentup to an additive constant. This means that it doesnot matter which type of programming language isused, as long as one such language is fixed. This is

all we will need here. More information can be foundin the book by Li and Vitanyi [8].

What does this have to do with compression? Wecan regard the compressed version of a string x as aprogram to generate x running on our “decompres-sor machine”. Thus the Kolmogorov complexity withrespect to this machine cannot be any larger thanthe length the compressed string. We thus estimateC(x) by compression: We take a standard compres-sor (here bzip2 or zlib) and compress x which givesus xcompressed. We then use length(xcompressed) inplace of C(x). In our case, the string x will be theexecutable of a worm or the traffic data.

The normalized compression distance (NCD) oftwo string x and y is given by

NCD(x, y) =C(xy) − min{C(x), C(y)}

max{C(x), C(y)}.

More details about the NCD can be found in [13].The NCD is a very intuitive notion: If two stringsA and B are very similar, one “says” a lot aboutthe other. Thus given string A, we can compress astring B very well, as we only need to encode thedifferences between A and B. Then the compressedlength of string A alone will be not very differentfrom the compressed length of strings A and B to-gether. However, if A and B are rather different, thecompressed length of the strings together will be sig-nificantly larger than the compressed length of one ofthe strings by itself. For clustering a set of objects wecompute the pairwise NCD of all objects in the set.This gives us a distance matrix for all objects. Wethen fit a tree to this matrix using the quartet methoddescribed in [13]. Objects which appear close to eachother in the tree, are close to each other in terms ofthe NCD.

To construct the trees displayed throughout thispaper we make use of the CompLearn toolkit [2].For information about networking protocols we re-fer to [10]. Information about the different Wormswe analyzed can be found in the Virus Encyclope-dia [12].

3 Analyzing Worms

We now apply the notion of NCD to analyze the bi-nary executables of internet worms. Once we iden-tified a malicious program on our computer we tryto determine its relation to existing worms. Manyworms exist in a variety of different versions. Oftenthe author of a worm later makes improvements to

3

his creation thereby introducing new versions. Thiswas also in the case of the recent Sasser worm. Wecall such a collection of different versions a family.The question we are asking now is: Given a worm,are there more like it and which family does it belongto?

For our tests, we collected about 790 differentworms of various kinds: IRC worms, vb worms, linuxworms and executable windows worms. Roughly 290of those worms are available in more than one version.

3.1 Clustering Worms

First of all, we examine the relations among differ-ent families of worms using clustering. Our methodworks by computing the NCD between all worms usedin our experiment. This gives us a distance matrixto which a tree is fit for visualization. In this section,we use bzip2 as our compressor.

3.1.1 Different classes of Worms

To gain some insight into the relations among sev-eral families of worms, we first cluster worms of com-pletely different types. For this we make use of textbased worms (such as IRC worms, labeled mIRC andIRC) and worms whose names contain PIF targetedat windows systems. Worms prefixed with Linux aretargeted at the Linux operating system. All otherworms are aimed at Windows systems and are binary.

Looking at Figure 1 we see that text based wormshave been clustered well together. We also see a splitbetween binary and text based worms as expected.This in itself is of course not very interesting. Never-theless, this naive application of our method yieldedsome interesting surprises which caused us to com-pletely reexamine our worm data. What is the Lin-uxGodog worm among the text based worms? Uponcloser inspection we discovered that unlike the otherLinux worms which were all binary executables, Lin-uxGodog was a shell script, thus being too similar tothe script based IRC worms. A similar discovery wasmade when looking at the two windows worms Fozerand Netsp, which are both windows batch scripts.We also found out that SpthJsgVa was mislabeled,and is an IRC worm very closely related to the IRC-Spth series. In fact, manual inspection revealed thatit must be another variant. Similarly, IRCDreamir-cVi was clustered with the other irc worms. It ap-peared that this was the only non-binary member ofthe IRCDreamirc series. Another interesting fact was

revealed by clustering the Alcaul worm. A large num-ber of different Alcaul worms were available, all inbinary format. However during this test we acciden-tally picked the only one which was mime encodedand thus appears closer to text than to the binaryworms. Our test also puts binary Linux worms closertogether then binary worms for other systems. OnlyLinuxAdm appears further apart. We attribute thisto the fact that binaries of linux and windows arestill not too far apart even if they are of different ex-ecutable formats. Worms of the same family are puttogether. We can for example see a clear separationbetween the IRCSpth and the IRCSimpsalapim seriesof text based worms. However, the separation of theMimail and Netres family shows that this approach isalso applicable to binaries. The MyDoom worm wasinitially thought to be an exception. The two ver-sions of MyDoom we obtained, were not as similar asother worms of equal families. Closer examination,however, revealed that the MyDoom.a we used wasin fact a truncated version of the binary. In general,our initial overview clusters the different worm typeswell.

3.1.2 Windows Worms

As text based worms are easy to analyze manually,we concentrate on binary executables. We first of allclustered a variety of windows worms, which were notUPX compressed. Surprisingly even binary wormscluster quite well, which indicates that they are in-deed very similar. The compressor seems to find alot of similarities that are not immediately obvious bymanual inspection. Figure 2 shows that the two avail-able Sasser variants were clustered nicely together.The same goes for the two versions of Nimda. Thisshows that classification by compression might be auseful approach.

3.1.3 The effect of UPX compression

Many worms found in the wild are compressed exe-cutables, where the compression is often modified toprevent decompression. This makes it much harderto obtain the actual binary of the worm. But with-out this binary, we cannot compare text strings em-bedded in the executable or disassemble the worm toapply an analysis of the program flow. If a stringis already compressed very well it is almost impos-sible to compress it much more. Thus we expectedinitially that we would be unable to apply our com-pression based method with any success. Surpris-

4

IISCodeRedVa

n46

n61IISCodeRedVc

IRCDreamircVb

n23

n22

IRCDreamircVh

IRCDreamircVcn37

n32

IRCDreamircVe

IRCDreamircVf

n65

IRCDreamircVi

n4

n57

n71

IRCPIFBeaze

n21

IRCPIFOasis

n47

IRCPIFElsa

n53

n28

PIFFable

IRCPIFPoem

n54

n1

n51

IRCSpthVa

n63

n3

IRCSpthVb

n58

IRCSpthVcn66

IRCSpthVd

IRCSpthVe

IRCSpthPhileVa

MimailVa

n15

MimailVj

n45

MimailVc

n50

n27

MimailVe

n52

MimailVmn14

MimailVf

n56n62 MimailVg

MimailVk

MimailVl

MimailVp

n10

n16

MydoomVa

n17

MydoomVe

n0

n55

n7

SpthJsgVa

n44n6

mIRCJeepWarzVa

n40

n69

mIRCJeepWarzVc

mIRCJeepWarzVd

n9

n24

mIRCJeepWarzVe

n31

n48

mIRCJeepWarzVf

mIRCJeepWarzVk

mIRCSimpsalapimVc

mIRCSimpsalapimVgn12

mIRCSimpsalapimVo

mIRCSimpsalapimVln36

mIRCSimpsalapimVn

mIRCSimpsalapimVt

n19

n43

mIRCSimpsalapimVu

mIRCSimpsalapimVv

LinuxAdm

n68

n49

LinuxGodog

LinuxHijack

n38

n5

LinuxLion

LinuxMighty

n64

LinuxSlapperVa

n29

LinuxMworm

n72

NetresVa

n42NetresVh

n26

NetresVbn33

NetresVi

NetresVc

n39

n59

n70

NetresVd

NetresVe

NetresVf

n60NetresVg

n25

AimVen

n67

Alcaul

n20

n30

ChainsawVa

DanderVan13

n41Datom

n73

Nautical

n34

FasongVa

n2

Fozer

n35

Netsp

n8

MumaVc

PetikVb

SasserVan18

SdBoterVa

UltimaxVa

n11

Figure 1: Different Classes of Worms

5

ingly, however, we are still able to cluster UPX com-pressed worms reasonably well. This indicates thatUPX, which allows the binary to remain executable,does not achieve the same strength of compression asbzip2.

Figure 3 shows a clustering experiment with wormswhich found UPX compressed initially. It shows thateven though our method becomes less accurate, westill observe clustering, as for example in the case ofthe Batzback series. For other worms, such as Al-caul.q, clustering worked less well. This indicatesthat even if worms are UPX compressed, we stillstand a chance of identifying their family by simplecompression. Figure 4 shows the result of the clus-tering process, after removing UPX compression. Wenow observe better results.

However, it is not always possible to simply re-move UPX compression before the analysis of theworm. UPX compression can be modified in such away that the worm still executes, but decompressionfails. This makes an analysis of the worm a lot moredifficult. We therefore investigated how UPX com-pression scrambling influences the clustering process.All worms in our next experiment were selected tobe UPX compressed and scrambled in the wild. Ofthe Alcaul series, only Alcaul.r was scrambled. Weincluded the other UPX compressed Alcaul worms,to determine whether having part of the family UPXcompressed and scrambled and part of it just plainUPX compressed would defeat the clustering process.

As Figure 5 shows, our method could be used togive a first indication to the family of the worm, evenif it cannot be UPX decompressed without great man-ual effort. Our Alcaul.r worm which was scrambledclusters nicely with the unscrambled Alcaul worms.Also the Lentin series is put together.

3.1.4 Worms and Legitimate Windows Pro-

grams

Finally, we examine existing windows programs andtheir relation to internet worms. For this we collecteda number of small executables from a standard win-dows installation and clustered them together withvarious different types of worms. Figure 6 depictsthe outcome of our test. The legal windows pro-grams in use where: net, netsh, cacls, cmd, netstat,cscript,debug.exe and command.com. Surprisingly,our experiment shows most of the windows programsclosely together. It even appears that programs withsimilar functionality appear closely related. Only de-bug.exe and command.com are placed elsewhere in

CodeGreen n3

n0

n2

Codered2

codered

MimailVa

n5

n30MimailVc

MimailVe

n20 n6

MimailVk

MimailVf

n26

MimailVg

n29

MimailVj

n1

n15

MimailVl

n17

MimailVm

Nimda

n16

NimdaVd

nachEXE

n25

AimVen

n23

sasserD

n8

SasserVa

n12

NetresVa

n22

n14

NetresVh

NetresVb

n10

NetresVc

n28NetresVf

n19

NetresVd

n27

n13NetresVe

NetresVg

ChainsawVan7

n9Datom

n21Nautical

n11

FasongVa

n18 MumaVcn31

n4

n24

PetikVb

SdBoterVa

UltimaxVa

Figure 2: Windows Worms

AlcaulVh

n6n18AlcaulVl

AlcaulVm

n10n2

AlcaulVnn0

n1

n12

AlcaulVq

n4

n9

n11

AlcaulVzn3

n8

BatzbackVb

n15MydoomVe

BatzbackVi

n19

n17

BatzbackVj

BatzbackVk

FreeTripVa

FreeTripVb

n13

FreeTripVc

GanterVa

n7

n16

n14

GanterVb

GanterVc

MydoomVa

RecoryVa

n20RecoryVb

RedesiVa

RedistVa

n5

RedistVb

Figure 3: UPX Compressed Windows Worms

6

AlcaulVh

n5

n17

AlcaulVl

AlcaulVm

n14

n13AlcaulVn

n15

n8

AlcaulVq

n6

n4

n3

AlcaulVz

BatzbackVb

n1

n0

n11

BatzbackVi

n18

BatzbackVj

BatzbackVkFreeTripVa n7

n10

FreeTripVb

FreeTripVc

GanterVa

n9

n12GanterVb

GanterVc

n2

MydoomVaMydoomVe

RecoryVa

n19RecoryVb

RedistVan16

RedistVb

Figure 4: UPX Compressed Windows Worms afterDecompression

AlcaulVh

n9n6AlcaulVl

AlcaulVm

n8

n2

AlcaulVn

n3

n12

AlcaulVq

n0n5

n7

AlcaulVr

AlcaulVz

LentinVan1

n10

n11

LentinVeLentinVg

ParrotVa

n4

Tanatos

ApartVc

LovesanVa

Scorvan

Figure 5: UPX Compressed and Scrambled WindowsWorms

the tree, which we attribute to the fact that theyhave quite different functionality compared to theother programs. It would, however, be prematureto conclude from this experiment that compressioncan always be used to distinguish worms from legalwindows programs. Our test included only a verylimited selection of programs. However, it seems tosuggest that programs with similar functionality areclosely related. For example, CodeGreen is very closeto CodeRed, which may be due to the fact that it ex-ploits the same vulnerability.

3.2 Classifying Worms

We are now trying to guess the family of an un-known worm. Determining the family of a wormcan make further analysis a lot easier. Again, wedo not search for specifically selected patterns in theexecutables or apply an analysis of the actual pro-gram. Instead, we simply use the NCD. To find thefamily of a worm, we compute the NCD between theworm and all available binary worms. The best matcht is the worm which was closest to w in terms ofNCD. We then set the family of w to t’s family. Inshort, we assign the new worm w to family F suchthat for ∃t ∈ F, ∀k ∈ W, NCD(w, t) ≤ NCD(w, k)where W is the set of all test worms. To avoid badmatches, we assign the label “unknown” to a wormif NCD(w, t) ≥ 0.65. This value was determined byexperiment from our data set and may need to beadjusted in a different situation.

In our first experiment we use 719 different worms,binary and non-binary. Of 284 of these worms, wehad more than one representative of each family avail-able. For every worm, we determined its closestmatch among the remaining worms. In 179 of the284 cases were matching to its family would havebeen possible, we have succeeded. We obtained 39bad matches. This may seem to be a very low num-ber of correct assignments. Note, however, that wehere matched all types of worms against each otherand used a very large set of worms which occurredin only one version. In practice, one could restrictthe search for the family of a worm to a more se-lected data set to obtain better results. For example,consider only worms which propagate via email. Re-stricting our attention to worms labeled ”I-Worm”alone, we can obtain a better classification as shownin Table 1. We used 454 of such worms, where wehad more than one representative of a family for 184of these worms. Now only 13 worms were classifiedincorrectly.

7

cacls

n36net

n28cmd

n31

n3

n23

CodeGreen

n33

n27

n18

Codered2

codered

commandcomn26

n7

n9

cscript

n20

debug

n16

MumaVcn17

Nimda

n10 NimdaVd

Tanatos n29

n4

mydoomb

nachEXE

n21

AimVenn19

netsh

netstat

NetresVa

n30

n22

NetresVh

NetresVb

n0

NetresVc

n1

NetresVf

n32

NetresVd

n25n11NetresVe

NetresVg

Alcaul

n12

n14

ChainsawVa

n34

DanderVan15

n6

Datom

n5 Nautical

FasongVa

n2

n13

Fozer

Netsp

LovesanVa

n35n8

n24

PetikVb

SasserVa

SdBoterVa

UltimaxVa

Figure 6: Worms and Legitimate Windows Programs

Match 454 Worms (184 in Family) Average NCD

Good Family 106 0.68Bad Family 13 0.43

No Family Found 335 0.84

Table 1: Classifying Worms

8

Our experiment makes us quite hopeful that thissimple method could be of practical use in the future.

4 Analyzing Network Traffic

A lot of effort has been done to detect attackers on anetwork. Next to a large number of available intru-sion detection systems (IDS), intrusion detection isstill a very active area of research. Intrusion detectioncan be split into two categories: host based intrusiondetection which focuses on detecting attackers on acertain machine itself and network based intrusiondetection which aims to detect attackers by observ-ing network traffic. Many IDS use a combination ofboth. In this section, however, we are only concernedwith network based intrusion detection.

The most commonly used approach in existing sys-tems is signature matching. Once a known attackpattern has been identified, a signature is createdfor it. Later network traffic is then searched for allpossible signatures. A popular open source IDS sys-tem (Snort [11]), for example, offers this approach.Clearly this works very well to identify the exact na-ture of the attack. However, it first of all requires alot of manual effort: prior identification of maliciousdata and signature creation becomes necessary. Sec-ondly, if the number of signatures is large it is hard tocheck them all on a high volume link. In Section 4.1we present a very simple compression based approachwhich does not suffer from these problems. Finally,if a worm slightly alters its behavior it will no longercorrespond to the signature and escape detection. InSection 4.2 we examine an approach which is able torecognize attack patterns which are similar to exist-ing ones.

Signature matching is sometimes also referred toas misuse detection, as it looks for specific patternsof abuse. We speak of anomaly detection if we areinterested in identifying all kinds of anomalies, evencaused by new and unknown forms of attacks. Sig-nature based schemes can also be used for anomalydetection. For example, an IDS has a profile of typ-ical behavior of each user. It then tries to matchobserved behavior to the profile. If the deviation istoo large an alarm is triggered. Various efforts havealso been made to apply neural networks for anomalydetection. Neural networks are often used to performoff-line analysis of data, as they are computationallyvery expensive. This makes approaches based on neu-ral networks not very widespread so far. In the fol-lowing, we examine the use of compression as a tool

to detect anomalies in network traffic.

4.1 The Complexity of Network Traf-

fic

4.1.1 The Complexity of Protocols

The simplest approach we use is to estimate the com-plexity of different types of traffic. We thereby try toanswer the following question: Examining sessions ona specific port, can we determine whether one kindof traffic has been replaced by another? This maybe useful to detect the success of certain remote ex-ploits. For example, we may only observe SSH trafficon port 22. After an attacker successfully exploitedthe SSH server he may obtain a remote shell on thesame port. His session is then very different from anormal SSH session.

The method we use here is again extremely sim-ple. For different protocols we identify the averagecompression ratio of a session S: We take the pay-load of a session S and compress it using bzip2. Thelength of the compressed session Scompresssed gives usan estimate for the complexity of S, C(S). We thencalculate the compression ratio R as

R =ℓ(Scompressed)

ℓ(S).

Intuitively, R tells us how well S compresses. If R issmall, S compresses well and thus has a low complex-ity. If R is close to 1, S can hardly be compressed atall and its complexity is very high.

In order to apply this method in practice, the av-erage compression ratio of good sessions on a certainport needs to be determined. In order to find anoma-lies in a new session, we extract the session from net-work traffic and compress it. If its compression ratiodeviates significantly from the average compressionratio of expected traffic, we conclude that an anomalyhas occurred.

For the following comparison, traffic was collectedon a small site providing web, email and shell ser-vices to about 10 users and an ADSL uplink. Thetag “nc” is used for terminal interactions. The pay-load of sessions was extracted into separate files usingchaosreader and compressed using bzip2. We thencalculated the compression ratio of each file. Fromthis we computed the average compression ratio andstandard deviation of this ratio of all files belongingto a certain protocol.

As expected the purely text based protocols, suchas IRC and nc compress very well. Protocol ses-

9

0

0.2

0.4

0.6

0.8

1

1.2

daa dns ftp imaps irc nc ntp smtp ssh torrent http

Com

pres

sion

Rat

io

Protocol

Compressibility

"meanDat2"

Figure 7: Average compressibility and standard de-viation

sions that carry encrypted data or binary data whichis usually already compressed, on the other hand,can not be compressed very well. For example, SSHand IMAPS sessions, carrying an encrypted payload,hardly compress at all. Neither do HTTP and tor-rent sessions which carry compressed files. Why doesSMTP traffic compress so badly? Closer inspectionrevealed that most captured SMTP sessions carriedcompressed viruses instead of text messages. It maytherefore be desirable to examine only the non-datapart of an SMTP session. We see that the stan-dard deviation of SMTP traffic is quite high, which iscaused by the fact that emails containing compressedattachments compress muss less well than plain textmessages.

This initial test shows that this method may well beapplicable to separate protocols which differ greatlyin their complexity. For example, we could tell anIRC session from binary web traffic. Likewise, wecould identify a terminal interaction from the back-ground of SSH traffic. However, this method is un-suited for distinguishing protocols for file transfer orprotocols carrying encrypted data. We furthermorenote that averages for web and SMTP traffic mayvary greatly depending on the web server accessed.Therefore averages would need to be determined forevery site individually. We also hope to use thismethod to detect new and unknown protocols an at-tacker may use as long as their compression ratio dif-fers from the average compression ratio we expect.

4.1.2 Detecting complexity differences using

Snort

In order to test for complexity differences, we providethe compress plugin for the popular open-source IDSsystem Snort [11]. We can use it to specify a lowerand upper limit on the allowed compression ratio. Ifthe observed traffic data falls outside this window, analarm is triggered. The following rule will trigger analarm if the compressibility of traffic on port 22 dropsbelow 0.9 and use zlib for compression.

alert tcp $HOME NET any -> any 22 (msg:”Low

complexity on port 22”; compress:moreThan 0.9,zlib;

flow:to client,established; classtype:bad-unknown;

sid:2565; rev:5;)

In this example no explicit upper limit is given, sothe default value of 2 will be used. The compressionratio can become larger than 1, if the input is com-pletely incompressible and will actually be enlargedby the compressor. The compress plugin takes thefollowing comma delimited arguments:

• moreThan <x>: legal traffic has a compressionratio of more than x, where x is a number be-tween 0 and 1.

• lessThan <x>: legal traffic has a compressionratio of less than x.

• < compressor >: type of compressor to use.Current values are zlib for gzip style compres-sion and rle for a simple simulation of runlengthencoding.

4.2 Identifying similar attacks using

Snort

In order to determine whether the observed networktraffic is similar to an existing attack pattern, weagain make use of the normalized compression dis-tance (NCD) from Section 2. If the NCD betweena known attack pattern and the observed traffic issmall, we conclude that a similar attack is in progress.This method is used by the NCD plugin. Since sin-gle packets can be very small, it makes sense to usethe NCD plugin in combination with stream4 TCPstream reassembly and only examine fully reassem-bled TCP sessions.

In the following simple example, we have recordeda session of an attack using an older vulnerabilityof Apache using PHP 3.0.15 on FreeBSD. This usedto be widely available exploit with the name ph-ploit.c [9]. This exploit comes with different options,

10

as to whether a shell should be opened on the vul-nerable host on a different port following the attack,or whether a shell should be executed directly. Inour recorded session, we used the option of bindinga shell to another port. We then made the followingentry in our Snort configuration:

alert tcp $HOME NET any -> webserver 80

(msg:”Close to phploit attack.”; flow:only stream;

ncd:dist 0.8, file recordedSession; classtype:bad-unknown;

sid:2566; rev:5;)

This entry tells Snort to trigger an alarm on allTCP sessions which have an NCD of less than 0.8 tothe data contained in the file recordedSession. A newattack using the same options triggered this rule withan NCD of 0.608 to the recorded session. Executingthe attack with the option to execute a shell directlyalso triggered this rule with an NCD of 0.638. Mir-roring the entire website on the test server did notresult in any false positives. We have thus succeededin recognizing the new, but slightly different attack.Whereas the above variation on the attack could havebeen easily recognized by selecting a different patternto look for initially, it nevertheless illustrates thatthe NCD may indeed be very useful in the detectionof new attacks. Selected patterns are typically veryshort, which can make it much easier to escape de-tection by creating a small variation of the attack.

The plugin currently makes use of the zlib libraryfor compression. Other compression methods wouldbe possible. It currently takes a single argument: dist<x>, where x is maximum safe distance before therule is triggered.

5 Conclusion and Open Ques-

tions

We analyzed the binary executables of worms of dif-ferent families and clustered them by family. Ouranalysis shows that many worms cluster well. Ourmethod even performs reasonably well, if the wormsare UPX compressed and the compression has beenscrambled, which makes conventional analysis muchharder. This indicates that our method may be avery useful tool in the initial analysis of new foundworms in the future. Many improvements are pos-sible. Different compression methods may give bet-ter results. Secondly, one could pre-classify wormsinto email viruses, internet worms, irc worms and vbscripts. In Section 3.2, we showed that restrictingourselves to worms labeled “I-Worm” already gave

better results. Right now our aim was to provide atool for the analysis of worms. It may be interest-ing see how this method can be incorporated in anautomated virus scanner for email.

We showed that different types of traffic have dif-ferent complexity. This might be a simple and ef-fective tool to distinguish between legal and illegaltraffic on the same port provided the complexity ofthe traffic differs largely. It does not require a pri-ori knowledge of specific patterns which occur onlyin specific types of traffic. The compress plugin forSnort allows the detection of potentially maliciouspackets and sessions which fall outside the specifiedcompressibility window. It would be interesting tosee how this method performs under high load. Oursimple approach also compresses traffic only once. Abetter performance may be achieved by compressingtraffic in a continuous fashion. Some initial experi-ments which did not make use of snort showed thisto be rather promising. Compression is an expensiveoperation, but so is pattern matching. Using efficientand simulated compression may outperform match-ing a large pattern list.

Using the normalized compression distance (NCD),we can detect anomalies similar to existing ones. TheNCD plugin for Snort can sound an alarm, when theobserved session is too close to a given attack ses-sion in terms of the NCD. This comparison is ex-pensive. Nevertheless, we believe that our approachcould be useful to detect new variations of attacksinitially. Once such a variation has been recognized,new patterns can be extracted to be used in con-ventional intrusion detection systems. Our methodseems promising and leaves room for further explo-ration: In this paper, we have only made use of stan-dard compression programs. This also limits us tocomparing packets and sessions exceeding a certainsize so that they can still be compressed to a smallersize by these programs. Perhaps a compression pro-gram specifically targeted at compressing small pack-ets would give better results. It would also be possi-ble to create a faster version of an existing compressorby merely simulating compression, because we neverwant to decompress again.

Finally, our method can be defeated by an attackerwho intentionally alters the complexity of the traffiche generates. For example, he may interleave randomwith the actual data, to artificially increase the com-plexity of this session. Nevertheless, we believe thatthere are many scenarios where this is not possible ormakes an attack considerably more difficult.

11

6 Acknowledgments

We thank Scott A. McIntyre and Alex Le Heux forvaluable advise. Thanks also to the people whohelped to collect worm data: Anthony Aykut, LukasHey, Jorrit Kronjee, Laurent Oudot, Steven de Rooijand Donnie Werner. The author acknowledges sup-port from the EU fifth framework project RESQ IST-2001-37559 and the NWO vici project 2004-2009.

References

[1] E. Carrera and G. Erdelyi. Digital genomemapping - advanced binary malware analysis.In Virus Bulletin Conference September 2004,2004.

[2] R. Cilibrasi, A. Cruz, Julio B.,and S. Wehner. Complearn toolkit.http://complearn.sourceforge.net/index.html.

[3] T. Dullien. Personal communication.

[4] S. Evans and B. Barnett. Network securitythrough conservation of complexity, 2002. MIL-COM 2002.

[5] H. Flake. Structural comparison of executableobjects. In DIMVA, pages 161–173, 2004.

[6] A. Kulkarni and S. Bush. Active network man-agement and kolmogorov complexity, 2001. Ope-nArch 2001, Anchorage Alaska.

[7] A. Kulkarni, S. Bush, and S. Evans. Detectingdistributed denial-of-service attacks using kol-mogorov complexity metrics, 2001. GE CRDTechnical Report 2001CRD176.

[8] M. Li and P. Vitanyi. An Introduction to Kol-mogorov Complexity and Its Applications (2ndEdition). Springer Verlag, 1997.

[9] Security.is. Apache/php exploit.http://packetstormsecurity.org/0010-exploits/phploit.c.

[10] A. S. Tanenbaum. Computer Networks, 3rd edi-tion. Prentice-Hall, 1996.

[11] Snort Development Team. Snort. the opensource network intrusion detection system.http://www.snort.org.

[12] Viruslist.com. Virus encyclopedia.http://www.viruslist.com/eng/viruslist.html.

[13] P. Vitanyi and R. Cilibrasi. Clus-tering by compression, 2003.http://arxiv.org/abs/cs.CV/0312044.

12

http://complearn.sourceforge.net/index.html

http://packetstormsecurity.org/0010-exploits/phploit.c

http://www.snort.org

http://www.viruslist.com/eng/viruslist.html

http://arxiv.org/abs/cs.CV/0312044

analyzing worms and network traffic using compression

Documents