botnet and spam detection in high-speed networks
DESCRIPTION
Botnet and Spam Detection in High-Speed Networks. Wenke Lee and Nick Feamster Georgia Tech. Overview. Problem: Botnet and Spam Detection in high-speed networks Common theme: Examine network-level properties and build classifier Two systems: BotMiner and SNARE Overview - PowerPoint PPT PresentationTRANSCRIPT
Wenke Lee and Nick FeamsterGeorgia Tech
Botnet and Spam Detection in High-Speed Networks
Overview
• Problem: Botnet and Spam Detection in high-speed networks
• Common theme: Examine network-level properties and build classifier
• Two systems: BotMiner and SNARE– Overview– Integration with SMITE architecture
• Current integration status and plan
3
BotMiner: Structure and Protocol Independent
• Botnets can change their C&C content (encryption, etc.), protocols (IRC, HTTP, etc.), structures (P2P, etc.), C&C servers, infection models …
bot
bot
bot
bot
bot
C&C
bot
bot
bot
bot
bot
bot
(a) (b)
4
Definition of a Botnet
• “A coordinated group of malware instances that are controlled by a botmaster via some C&C channel”– Hosts that have similar C&C-like traffic and similar
malicious activities
• We need to monitor two planes– C-plane (C&C communication plane): “who is talking
to whom”– A-plane (malicious activity plane): “who is doing what”
5
BotMiner Architecture
Scan
Spam
A-Plane Monitor
BinaryDownloading
C-Plane Monitor
Flow Log
C-PlaneClustering
NetworkTraffic
Exploit
...
Activity Log
A-PlaneClustering
Cross-PlaneCorrelation
Reports
SensorsAlgorithms
Correlation
6
BotMiner C-plane Clustering
• What characterizes a communication flow (C-flow) between a local host and a remote service? – <protocol, srcIP, dstIP, dstPort>– Temporal related statistical distribution information
– E.g., BPS (bytes per second), FPH (flows per hour)
– Spatial related statistical distribution information– E.g., BPP (bytes per packet), PPF (packets per flow)
7
A-plane Clustering
• Capture “similar activities patterns”
8
Cross-plane Correlation
• Botnet score s(h) for every host h– A host has higher score if it is in more activity
clusters and in both activity and communication clusters
– A host with a high score is a bot
• Similarity score between bot host hi and hj
– Two hosts in the same A-clusters and in at least one common C-cluster are clustered together
– Each cluster is a bot
9
SMITE Integration: BotMiner
10
• Sensors– Feature extraction for C-Plane and A-Plane
clustering– C-Flow temporal and statistical features
• Counting packets and connections between each pair of endpoints: bytes per second, flows per hour, bytes per packet, packets per flow
– A-Plane header and payload features• Destination IP addresses and ports, payload
bytes/strings
– These sensors are not specific to BotMiner
Integrating BotMiner and SMITE
11
• Algorithms– C-plane clustering
• Multi-step clustering based on statistical and temporal C-flow features
– A-plane clustering• Based on activity-specific similarity measures: e.g., spread of
destination IP addresses and ports, Dice’s coefficient of string similarity, and byte frequency or entropy of payload
– Bot scoring and botnet clustering methods• Scoring based on participation in C-plane and A-plane
clusters• Clustering based on common memberships in the C-plane
and A-plane clusters
Integrating BotMiner and SMITE
12
• Correlation– Botnet detection involves both vertical and horizontal
analysis/clustering:• Vertical: what activities a host has been involved in
– Bot detection
• Horizontal: what other hosts have similar (vertical) behavior patterns
– Botnet detection
– Similar analysis can be applied to other alerts• Improve botnet detection• Understand malicious activities and plans of attacks• Measure the scale of attacks
Integrating BotMiner and SMITE
13
• Filter email based on how it is sent, in addition to simply what is sent.
• Network-level properties are less malleable– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting
infrastructure)– Network location of sender and receiver– Set of target recipients
Network-Based Spam Detection
14
Finding the Right Features
• Goal: Sender reputation from a single packet header?– Low overhead– Fast classification– In-network– Perhaps more evasion resistant
• Key challenge– What features satisfy these properties and can
distinguish spammers from legitimate senders?
15
Network-Level Features
• Single-Packet– AS of sender’s IP– Distance to k nearest senders– Status of email service ports– Geodesic distance– Time of day
• Single-Message– Number of recipients– Length of message
• Aggregate (Multiple Message/Recipient)
16
Sender-Receiver Geodesic Distance
90% of legitimate messages travel 2,200 miles or less
17
Density of Senders in IP Space
For spammers, k nearest senders are much closer in IP space
18
Local Time of Day at Sender
Spammers “peak” at different local times of day
19
Other Network-Level Features
• Time-of-day at sender
• Upstream AS of sender
• Message size (and variance)
• Number of recipients (and variance)
20
Combining Features: RuleFit
• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs
from a large spam filtering appliance provider
• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs
• Using only network-level features• Completely automated
21
Benefits of Whitelisting
Whitelisting top 50 ASes:False positives reduced to 0.14%
22
Integrating SNARE and SMITE
Sensors
Algorithms/Correlation
23
Integration with SMITE
• Sensors– Extract network features from traffic– IP addresses– Combine with auxiliary data (routing, time, etc.)
• Algorithms– Clustering algorithm to identify behavioral fingerprints– Learning algorithm to classify based on multiple features
• Correlation– Clusters formed by aggregating sending behavior observed
across multiple sensors– Various features also require input from data collected across
collections of IP addresses
24
SMITE Integration Challenges
• Sources of labeled data– SNARE requires clean sources of labeled
data for training
• Data collection– SNARE’s performance improves when
behavior can be observed across multiple domains
25
Overall SMITE Integration
26
SMITE Integration: Current Work
• Study pipeline architecture and code
• Modify flow-analyzer to dump 5-tuple flow information
27
SMITE Integration: Phase I
• Modify flow-analyzer with SMITE team to generate 5-tuple flow information (mid-March)
• Spam/scan detection, flow aggregation in BotMiner; Spam feature extraction in SNARE (end of March)
• Clustering and correlation in BotMiner; Classifier in SNARE (end of April)
28
SMITE Integration: Phase II
• Evaluate performance of BotMiner and SNARE– How many hours to process one-day of traffic, or what is
the “lag” time between event and detection?
• Design real-time detection algorithms– A two-tier system: off-line module output lists of suspicious
hosts, and real-time module inspects all packets of these hosts; or, off-line module output clusters
• Design algorithms to handle asymmetric traffic– Cluster on each direction of traffic and cross-correlate
Thank You!