high-speed detection of unsolicited bulk email › ancs › 2007 › slides › lin-tan...

25
1 High-Speed Detection of Unsolicited Bulk Email Sheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu, Computer Science Department, Texas A&M University Michael Oehler National Security Agency Dec, 4, 2007

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

1

High-Speed Detection of Unsolicited Bulk EmailSheng-Ya Lin, Cheng-Chung Tan, Jyh-Charn (Steve) Liu,

Computer Science Department, Texas A&M University

Michael OehlerNational Security Agency

Dec, 4, 2007

Page 2: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

2

Outline

• Motivation• Progressive Email Classifier (PEC) system

architecture• Experimental results

Page 3: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

3

Email Spamming No Longer Just a Nuisance

• Some Facts:– Botnet farms can hit any target (> 106)– bandwidth waste (3:1 or higher)– Network resource exploit & information stealing (malware planting)– Highly effective hit and run strategy (BGP, DNS, domain name, credit

card fraud)• Existing anti-spamming ware

– Large number of software copies and signatures to maintain– Comprehensive detection rules, but slow to respond

• Signatures management a major bottleneck– Acquisition and the deployment of signatures to numerous

machines– A small variation in the known signatures can easily defeat

a signature based filter– Spammers can test their designs with anti-spamming ware

before starting the (hit and run) campaign

Page 4: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

4

Spamming Behavior at a Glance

• Spammers do not have full freedom in launching spamming. – Follow the transport protocols to deliver messages– Messages must be perceivable and appealing to human users – Expensive to compose and personalize spamming messages:

• interactive (click my URL links) or passive• Low yield rate combined with greed lead to high spamming

volumes• Cheap to launch spamming: millions of zombie machines

each send a few copies – Any “hit back, interactive” method could cause severe harm to the

innocents • Summary

– Very difficult for spammers to achieve financial goals without leaving noticeable signatures, i.e. feature instances

– A challenge is how to keep up with their speed, volume, and diversity

Page 5: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

5

Our Approach• Lossy detection:

– focused mainly on the major offenders– Avoid false positive

• Timely acquisition of instances of selected features: – Position the detector at the Network Access Points (NAP)

• Highest concentration of samples for an enterprise network• Detect them before the flood already enters the network

• Work on the algorithm & data structure level, rather than any hardware platform– Broad spectrum of computing resources/constraints

• Regular emails are expected to have random distributions of strings that happen to fall into the spamming feature space– Moderated delivery of bulk, legitimate email

• A spamming stream: Invariant and variant parts– An invariant that also appears in regular emails cannot be used for filtering – For the first cut effort: URL (over 95% spamming have them)

Page 6: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

6

Competitive Aging-Scoring Scheme (CASS)

• A spamming invariant (string) is called its feature instance (FI). The essence of our technique:

“Extract FIs of emails and keep track of their occurrences. If exceeding a threshold: an UNBE stream”

• In a naïve approach, it takes O(1) to update the score of an FI, but O(N) to age all other FIs– A major computing cost

• CASS: a constant time algorithm– The time-to-live of an FI is reset each time when its score is

increased by one (when a new copy arrives)– The time-to-live of all other FIs is reduced by one– New complexity: O(1) for both scoring and aging– Exceeding a threshold: move it to the blacklist– No further copies in a time-out period: discard it

• It may not be a fixed physical time

Page 7: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

7

PEC Architecture

32bit

Hash table of Known strings

New stringidentified

Birth&Death Of strings

Hash vsstring

Aging and scoring of unknown strings

Email flow

Berkley DB

Sendmail

Feature instanceextraction

Page 8: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

8

Data Structure of Scoreboard

URL1, URL1 URL1AddressHash

FunctionH1(URL)20 bits

H2(URL) ++Score Index of SMT

Data Structure of a Cell

76 bits

m

m+1

m-1

HURL1Miss Count

n-1

0

Index

0

1

n

HURL1

Update

HURL2

H1(URL).H2(URL)

Data Structure of HURL32 bits

URL2

Remove

Δt

Entries for feature instances

Entries for feature instances

Scoreboard Table

Age Table

Exceeds UNBE threshold (S)?

Exceeds age threshold (M)

(hash_low, score, age_table location)

(score_table location)

Page 9: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

9

A Snapshot

S =10, M =20

HashURL : (414738(20-bit)+3724(12-bit))

Current feature being processed

Next feature instance

Active featuresArranged in their ages (mod N)

The current time location

MOD queue Placement

HashURL : (124489(20-bit)+176(12-bit))

The current time location

The entry [862 1822] is purged

time

history

newestoldest

Entry moved to blacklist

Queue size = 20

Page 10: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

10

Testbed EnvironmentThree Modules included:

1. Email generation2. PEC (Blacklist and scoreboard):3. Control and visualization console

Page 11: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

11

Experimental Configuration

• Email generator: Intel P4-3.0 Windows XP• Email Server: Xeon 3.0GHz, two single core

CPUs, Linux, Sendmail 8.14.1• Within a batch, the sender sends 2000 copies of

emails (uniformly mixed UNBEs and regulars).– S: 50– M: 2048– The average mail size: 1.5K bytes– One mail per 0.088 seconds on average.

Page 12: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

12

Workflow of Email Generation

`

Windows Control Console

Linux Email Server (Sendmail)

simulation parameters

Random Text

MIME structures

Feature Dictionary

Emails (bulk/regular)

BulkURL

Image Src

Regular

Bulk Regular

U R U U ….. R

MessageComposer

Subject Generation

“From” Generation

SMTP ProtocolDensity Generation

(uniform dist.)

Spamming Keyword selection

Page 13: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

13

UNBE Generation• Both UNBE and regular copies are injected with

URL links or remote image sources – Can adjust density, locations of variants and

invariants in the body of each copy to generate MIME messages.

– UNBE features extracted from 2005 TREC Public Spam Corpus, http://plg.uwaterloo.ca/~gvcormac/treccorpus/about.html

– Variants: random text taken from web sites– Keywords: User defined (not tested in this report)

• The message composer calls an SMTP library to send the generated emails to Sendmail

Page 14: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

14

Detection Latency of Single UNBE source

0

500

1000

1500

2000

2500

50 100 150 200 250 300

Number of messages in a bin

Dete

ctio

n La

tenc

y

Experimental ValueExpected Value

•Fix threshold and age table length under different densities.•Test six different UNBE densities (50, 100, 150, 200 …, 300 UNBE messages/bin)

Unit: Virtual clock

Page 15: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

15

Effects of Multiple UNBE Sources

0

500

1000

1500

2000

2500

50 100 150 200 250 300

Number of messages in a bin for each non-A UNBE

Det

ectio

n la

tenc

y

test 1test 2test 3test 4test 5test 6other sources

• Given an UNBE source A, six tests were made where one addition UNBE source is added to the experiment at a time. – The six lines marked as test[1-6]

• The density (instances/batch)– A: is fixed at 100 – Other UNBE sources: increased

from 50 to 300

• Result: – The detection latency of an UNBE

decreases with the number of UNBE sources

• When a source is captured, it is blocked form the scorebaord. The density measure in VC for others increases

Page 16: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

16

Throughput of URL Parser

0

5

10

15

20

25

30

1.5K 3.0K 4.5K 6.0K 7.5K

Size of Mial Body (K Bytes)

Thro

ughp

ut (1

000

Bod

ys/s

ec

The average Email size is from 1.5 KB to 7.5 KB, and each email has 2 URLs.

Page 17: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

17

Throughput of Scoreboard and Blacklist

0

100

200

300

400

500

600

700

800

900

1000

30 60 90 120 150

URL length (bytes)

Thro

ughp

ut (

K U

RLs

/sec

•Scoreboard: 1.2M transactions

•Blacklist: 0.9M (avg. 30 B) URLs, without including database access

Page 18: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

18

Pointer Table: reduce memory need (at a small cost of delay)

•In the detection window, a limited number of hashed values need to be tracked•Full table for 32-bit hash system takes too much space• Higher order bits used as the index, and the rest, and the rest bits maintained by a linked list (for each entry)

•If pointer table uses 20 bits for indexing, that means it has 1M entries, and age table length is 20K~70K, the maximum depth of linked list pointed by pointer table is 2.

Page 19: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

19

Threshold Setting

• Q1: “What is the minimum value of M to detect an UNBE attack (of known density) with a success probability of higher than α?”– (smaller M retire sooner)

• Q2: “For a given M, what is the maximum value of S to guarantee that the probability of the detection latency < ς is greater than α?”– (large S less likely false positive, but more enter

network before detection)

Page 20: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

20

Compute M and S

/( )f b fλ μ μ μ= +

(2, ) 1Mλ αΓ = −

1[ ]/SE Hς λ+=

2 ( 1)1

1

( ) 1/ 1/ ... 1/ 2 /

(1 2 ) /( 1)

S SS

S S

E H α α α α

α α α

−+

− − +

= + + + +

= − + −

Get M

Get S

Page 21: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

21

Detection latency when S=24, M=55

The prediction model is conservative

Page 22: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

22

Sensitivity of TR vs. S

Page 23: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

23

Sensitivity of TR vs. M (S=24)

Page 24: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

24

Summary

• PEC demonstrates the feasibility of high speed UNBE filtering at the network vantage points

• The method is not meant to replace existing solutions, but to defeat major offenders– 80-20 rule

• Expansion of the techniques to handle multiple features (bad words, dirty subnets, black lists, etc)– Integration/interface with existing tools

Page 25: High-Speed Detection of Unsolicited Bulk Email › ANCS › 2007 › slides › lin-tan ancs07...Unit: Virtual clock 15 Effects of Multiple UNBE Sources 0 500 1000 1500 2000 2500 50

25

Thank You!