stream mode algorithms and architecture for line … › media › 727982 ›...
TRANSCRIPT
Stream Mode Algorithms and Architecture for Line Speed TrafficArchitecture for Line Speed Traffic
Analysis
Steve LiuComputer Science DepartmentComputer Science Department
Texas A&M University [email protected]
1March 7, 2008
BackgroundBackground• Network security solutions have broad presence in every
t k i tnetwork point– Antivirus scanner, network intrusion detection systems,
spamming filters Most solutions designed to operate at desktops or servers serve– Most solutions designed to operate at desktops or servers serve the intended purposes very well, but they are not perfect, nothing is perfect
• A DoD doctrine of defense-in-depth makes sensep– Use layers of (different) protection tools to make intrusion very
inconvenient and very expensive
• Our interest: enhance network security via a stream mode traffic analysis approach at Network Access Point (NAP) of an enterprise network
2
Stream Mode Traffic Analysis
• Highly concentrated traffic flow at the network i t (NAP) i id l l ti faccess point (NAP) is an ideal location for
enterprise traffic analysis– Single location to observe ingress & egress flows– Single location to observe ingress & egress flows– When the conditions are right, could even slow/stop the intrusion
packets before they spread too deep, too broad into the network
C i l t• Commercial systems– Deep Packet Inspection (DPI) engines, DAG cards
Some ir s and spamming filters at the gate a– Some virus and spamming filters at the gateway – Firewall is one of the oldest products for such
purpose
3
p p
Stream Mode Packet Flow Analysis
Promiscuous mode NIC card,
Packet sensor
R l
Promiscuous mode NIC card,Router feed, Libpcap,
TCPDUMP….
N-gramrules
RemoteImage src
Regular expression
src-destIP pairs URL
F t
Feature extractors HW:Bivio, Cloudshield, SW: Flex
Featureinstances
How to identify malicious traffic fromHow to identify malicious traffic from the time series of feature instances?
4Feature: Any string that fits a regular expression rule, e.g., “URL link” Feature instance: An instance of a feature, e.g., “www.cs.tamu.edu”
Two Key Issues: Algorithms and Resource ManagementAlgorithms and Resource Management
• Fast algorithmsFast algorithms• Efficient data structures
M ffi i iti l f t t f l d t ti– Memory efficiency critical for stateful detection • e.g., a 32 bit, y/n hash table 500MB
R l ti i t l l k• Real time vs. virtual clocks• Progressive Email Classifier (PEC) system
architecture
5
Email spamming is no longer just a nuisancenuisance
• Some Facts:Botnet farms can hit any target (over millions of them)– Botnet farms can hit any target (over millions of them)
– bandwidth waste (3:1 or higher)– Network resource exploit & information stealing (malware planting)– Highly effective hit and run strategy at different protocol levels (BGPHighly effective hit and run strategy at different protocol levels (BGP,
DNS, domain name, credit card fraud)• Existing anti-spamming ware
– Large number of software copies and signatures to maintaing p g– Comprehensive detection rules, but slow to respond
• Signatures management a major bottleneck– Acquisition and the deployment of signatures to numerousAcquisition and the deployment of signatures to numerous
machines– A small variation in the known signatures can easily defeat
a signature based filter
6
a signature based filter– Spammers can test their designs with anti-spamming ware
before starting the (hit and run) campaign
Spamming Behavior at a Glance
• Spammers do not have full freedom in launching spamming. – Follow the transport protocols to deliver messages– Messages must be perceivable and appealing to human users – Expensive to compose and personalize spamming messages:
• interactive (click my URL links) or passiveL i ld bi d ith d l d t hi h i• Low yield combined with greed lead to high spamming volumes
• Cheap to launch spamming: millions of zombie machines each send a few copieseach send a few copies – Any “hit back, interactive” method could cause severe harm to the
innocents • Summary• Summary
– Very difficult for spammers to achieve financial goals without leaving noticeable signatures, i.e. feature instances
– A challenge is how to keep up with their speed, volume, and
7
g p p p , ,diversity
Our Approach• Lossy detection:• Lossy detection:
– focused mainly on the major offenders– Avoid false positive
• Timely acquisition of the spamming signatures:• Timely acquisition of the spamming signatures: – features and their instances– Position the detector at the Network Access Points (NAP)
• Regular emails are expected to have white noise like• Regular emails are expected to have white noise like distributions of strings that happen to fall into the spamming feature space– Mediated delivery of bulk legitimate emailMediated delivery of bulk, legitimate email
• The content of a spamming campaign is divided into Invariants and variants – An invariant that also appears in regular emails cannot be used for filtering – For the first cut effort: URL (over 95% spamming have them)
8
Competitive Aging-Scoring Scheme (CASS)(CASS)
• A spamming invariant (string) is called its feature i t (FI) Th f t h iinstance (FI). The essence of our technique:– Extract FIs of emails and keep track of their occurrences. If
exceeding a threshold: an UNBE stream• In a naïve approach it takes O(1) to update the score of• In a naïve approach, it takes O(1) to update the score of
an FI, but O(N) to update ages of all other FIs– A major computing cost
• CASS:– The time-to-live of an FI is reset each time when its score is increased
by one (when a new copy arrives)– The time-to-live of all other FIs is reduced by one– New complexity: O(1) for both scoring and aging– Exceeding a threshold: black; move it to the blacklist– No further copies in a time period: white; discard the feature instance
9
No further copies in a time period: white; discard the feature instance
PEC ArchitectureHash table of Known strings Email flow
Feature instanceextraction
32bit SendmailHash vsstring
Sendmail
New string
Birth&D th
Berkley DB
stringidentifiedDeath
Of strings
10Aging and scoring of unknown strings
Data Structure of ScoreboardData Structure of ScoreboardEntries for feature instances
Scoreboard Hit (SH) Table
Exceeds anomaly threshold (ATF)?
Scoreboard Miss (SM) Table
Exceeds miss threshold (MTF)
11Entries for feature instances
Exceeds miss threshold (MTF)
An execution snapshot of b dscoreboard
HashURL : (414738(20-bit)+3724(12-bit)) HashURL : (124489(20-bit)+176(12-bit))
Current feature being processed
Active featuresArranged
MOD queue Placement
history
Entry moved to blacklist
Arranged in their ages (mod N)
Placement
The current time location
The current time location
timenewestoldest
12ATF =10, MTF =20Next feature instance
The entry [862 1822] is purged Queue size = 20
Testbed EnvironmentTestbed EnvironmentThree Modules included:
1. Email generation2 PEC (Bl kli t d b d)2. PEC (Blacklist and scoreboard):3. Control and visualization console
13
Experimental ConfigurationExperimental Configuration• Email generator: Intel P4-3.0 Windows XPg• Email Server: Xeon 3.0GHz, two single core
CPUs, Linux, Sendmail 8.14.1• Within a bin, the sender sends 2000 copies of
emails (mixed with bulks and regulars).The distribution of bulks and regulars is uniform– The distribution of bulks and regulars is uniform.
– Default Score threshold: 50– Miss table length: 2048g– The average mail size: 1.5K bytes– Email generator sends one mail per 0.088 seconds
on average14
on average.
Workflow of Email GenerationWorkflow of Email Generation
Linux Email Server (Sendmail)
MIME
Feature Dictionary
Emails (bulk/regular)
B lk Reg lar
U R U U ….. R
SMTP ProtocolDensity Generation
(uniform dist.)
MIME structures
BulkURL
Image Src
Regular
Bulk Regular
`
simulation parameters
Random Text MessageComposer
Spamming Keyword selection
Windows Control Console
Subject Generation
“From” Generation
15
Email GenerationEmail Generation• Generate bulk/regular mixed email copies by injecting g p y j g
different features, such as URL links or image sources – Can adjust density or interval time between bulk copies,
placement of variants and invariants.• According to the parsed parameters, message composer
picks the materials to generate MIME messages (bulks or regulars).g )– extracted from 2005 TREC Public Spam Corpus,
http://plg.uwaterloo.ca/~gvcormac/treccorpus/about.html– Random Text: from Internet– Keywords: User defined.
• The message composer calls an SMTP module to send the generated emails
16
the generated emails.
Detection Latency of Single UNBE source
•Fix threshold and age table length under different densities.
2500
•Test six different UNBE densities (50, 100, 150, 200 …, 300 UNBE messages/bin)
2000
ncy
Experimental ValueExpected Value
1000
1500
Dete
ctio
n La
ten
0
500
0 100 1 0 200 2 0 300
17
50 100 150 200 250 300
Number of messages in a bin
Interactive Effects UnderM l i l UNBE SMultiple UNBE Sources
• Observe the change of the detection
2500test 1
glatency of UNBE A in the tests.
• Given an UNBE source A, six tests were made where one addition UNBE source is added to the experiment at a time
1500
2000
on la
tenc
y
test 1test 2test 3test 4test 5test 6other sources
added to the experiment at a time.
• The density of A is fixed at 100 instances per bin, and the density of every remaining UNBE sources is increased from 50 to 300
0
500
1000
50 100 150 200 250 300De
tect
i
instance/bin
• Line Test2: Detection latency of UNBE A when adding 2 additional UNBE sources.
50 100 150 200 250 300
Number of messages in a bin for each non-A UNBE • Conclusion: The more UNBEs sources, the
detection latency of an UNBE decreases.
18
Throughput of Feature ParserThroughput of Feature Parser30
20
25
Bod
ys/s
ec
10
15
ough
put (
1000
0
5Thro
1.5K 3.0K 4.5K 6.0K 7.5K
Size of Mial Body (K Bytes)
The average Email size is from 1 5 KB to 7 5 KB and each
19
The average Email size is from 1.5 KB to 7.5 KB, and each email has 2 URLs.
Throughput of Scoreboard and Bl kliBlacklist
•Scoreboard: 1.2M transactions
•Blacklist: 0.9M (avg. 30 B) URLs, without including database access
800
900
1000
500
600
700
800
put (
K U
RLs
/sec
100
200
300
400
Thro
ughp
20
030 60 90 120 150
URL length (bytes)
Pointer TableD i th d t ti ti i d l li it d b f h h d•During the detection time window, only a limited number of hashed
values need to be tracked•Full table for 32-bit hash system takes too much space• Higher order bits used as the index, and the rest, and the rest bits g , ,maintained by a linked list (for each entry)
•If pointer table uses 20 bits for indexing, that means it has 1M entries, and age table length is 20K~70K, the maximum depth of linked list pointed by pointer table is 2linked list pointed by pointer table is 2.
•Very effective in reducing the actual space requirements, at minor cost of more search cycles
21
Current WorkCurrent Work
• The first generation PEC demonstrates theThe first generation PEC demonstrates the feasibility of high speed UNBE filtering– Not meant to replace existing solutions, but to p g ,
defeat major offenders (80-20 rule)• Next Step
– Packet level filtering – Handle multiple features (bad words, dirty
subnets, black lists, etc)– Integration with existing tools
22
23
Screen Shot (4)A i O h d P kAging out an Orphaned Packet
• \\
24
Screenshot (7)ParsingParsing
An email message has 3 packets.Parser 1 uses DFA 0 to extract a URL link, and uses DFA 1 to extract a domain name in this email message
25
a domain name in this email message.
System Performance Parameters
26
Thank You!Thank You!
27