andrew g. west and insup lee ceas `11 – september 1, 2011 towards the effective spatio- temporal...

Andrew G. West and Insup LeeCEAS `11 – September 1, 2011

Towards the Effective Spatio-Temporal Mining of Spam Blacklists

Big Idea / Outline

BIG IDEA: Identify IP addresses that have temporally correlated spam behavior; harness this info. predictively

•Related work; motivations•Blacklists as ground truth

• Data collection• Measurement study

•Temporal association mining• Technique• Parameterization• Negative results; discussion

Usage Example

Blacklist history (time)

20 min. 20 min. tnow

What to do “now”?•Assume IPy will be blacklisted

•Start blocking; decrease listing latency

Motivations

AS-REP

BLK-REP

IP-REP

SpatialFunctions

Plot into3-D SpaceClassify (SVM)

SPAM or HAM

REPALG

History

Recent research leveraging group behaviors [1—5]:•Overcome “cold-start”•Grouping functions: subnets, rDNS hosts, AS, etc.

Botnets a driving force•Non-contiguous in IP space•“Campaigns” should give rise to temporal correlations•Can we calculate grouping function; use for reputation?

Related Work

“How to determine botnet membership?”• Parsing P2P communication graphs

– Issues: Unproven, reqs. expansive view (BotGrep [6])– Blacklists have inherent global view

• Similarity algs. over email bodies/URLs– Issues: Privacy, complexity (Botnet Judo [7])– Mining uses only IP addresses in computation

• Law enforcement infiltrations– Data only useful in ex post facto fashion

BLACKLIST MEASUREMENT STUDY

Blacklists

• Why blacklists?– Global compilation; aggregate; low false-positives– We have tons of data

• Spamhaus blacklists [8]1.PBL (Policy Block List) – Dynamic IP ranges2.SBL (Spamhaus Block List) – Static ranges

belonging to spam gangs3.XBL (Exploits Block List) – IPs spamming due to

malware, Trojans (i.e., botnet nodes)

Blacklist Ops

IPx Blacklist history (time)

listednot- listedlisted

de-listing listing duration (d)listing re-listing

Blacklist Size

• Why?: Desirable to show that blacklists are a reasonable proxy for the spam problem

#1 #2• #1: Spike typical of

holiday seasons

• #2: Shutdowns of Spamit.com affiliate and Rustock

• Small spikes: Evidence of campaigns

Listing Duration (d)

• Why?: Re-listings (basis for patterns/ correlations) limited by de-listing speed

• Almost universal d=7.5 days

• Speaks to static TTL delisting policy

• Must only correlate listings, not overlapping durations

DHCP Issues

PBLXBL

“possiblydynamic”11%

79%10%

“knowndynamic”

≈18.4% of all IP space is on the PBL

• Why?: Dynamic IPs may not be able to accumulate enough history for mining, or produce stale predictions

• A large percentage (80%+) of IPs are dynamic

• More important, is how dynamic they are [9]

• This fact supports narrow learning windows

Relisting Quantity

• Why?: Central issue: do some IPs have histories extensive enough to be mined?

#1 #2 • #1: 50% of IPs have only 1 listing. Discard. Trim problem space.

• #2: 20% of IPs have 5+ listings, yet these account for 66% of all listings (non-trivial).

Relisting Rates

• Why?: Dynamism supports tight learning, thus we want all re-listings well clustered temporally.

• Media re-listing time is 18 days

• Far from a uniform distribution

• Also speaks to infection lifetimes

TEMPORAL ASSOCIATION MINING

Association Rules

• Developed for “market basket” data– “Beer and diapers” example– Apriori and FP-Growth algs.

• Example rule– {DIAPERS} →{BEER}– Interest measures [10]:

– lift(DIAPERS → BEER) = (3/5) / (4/5) * (4/5) = 0.94– Ratio of actual support, to expected rand. support

ID BEER MILK DIAPERS

1 Y N Y

2 N Y N

3 Y Y Y

4 Y N Y

5 N N Y

Correlations

• Previous: discrete, unordered, and transactional data• Spam data defies these– Continuously distributed– Bi-directionally ordered

• Define “correlation radius” (r) to make binary associations

• Symmetric but non-associative

• Radius enables probabilistic lift and support equivalents

Best Pairs

For every IP address, produce a finite “best pairs” list for persistent storage, where ordering determined by “lift”

Implementation

• 232 × 232 = Scalability issues• Prune search space with “minimum support”

– M=3 produces a 54.3 trillion entry matrix

– But 98% sparse

• Multi-threaded runtime= 3 days; we used EC2

Free Variables

• Correlation radius (r)– Try to capture campaigns with minimal noise– r = 2 hours (4 hour diameter)

• Training window length (length(h’))– Narrow: Infection lifetimes [11], DHCP addresses– Broad: Need for re-listings, bot-to-campaign map– length(h’) = 3 months

• Minimum support (m)– Derived based on scalability needs (m=3)

RESULTS AND DISCUSSION

1. “Best pairs” significance2. Botnet membership capture

3. Blacklist prediction

Rule Significance

• Intuition: Lift matrix should have values higher than random chance would suggest

• #1: Flip expected; About 0.6% of all pairs correlate more than random

• #2: Even at lift=120, 36% chance the correlation is rand. AGGREGATE

Botnet Membership

• Intuition: Given a set of botnet IPs, shared member/ pair lifts should exceed member/non-member pairs

• Actual dumps: Kraken + Cutwail

• 70-80% of IPs are XBL listed, 40% at min. support

• 6.0% of shared have non-zero lift, compared to 2.8%

Blacklist Prediction (1)

Blacklist Prediction (2)• Prediction criteria

– No ballot stuffing; can’t re-guess

• Experiment with different thresholds• Same story: Outperforming random, but too

minimal to be of any consequence

Discussion (1)

• Scalability issues × minor performance increments don’t warrant production

• Focus on acute areas of improvement:

1. DHCP research– 90%+ of IPs at min. support are dynamic, how?– Need reliable IP classification; churn rates

2. Refining windows/correlations– Non-binary correlations. Gaussian weights.– Time-decay of events in training windows

Discussion (2)3. Appropriateness of blacklist data

– Desirable conciseness (500 million listings = 12GB)– Blacklists inherently latent. Their aggregate, opaque,

and binary triggers may blur campaign-level data. – Install on an email server? Collect other metadata

• Takeaway; Utility in negative result– Measurement study builds on prior research– Our model serves as foundation for future efforts– Lessons learned about botnet dynamics– Identified poorly understood dynamism areas

References

[1] A. Ramachandran and N. Feamster. Understanding the network-level behavior of spammers. In SIGCOMM, 2006.

[2] F. Li and M.-H. Hsieh. An empirical study of clustering behavior of spammers and group-based anti-spam strategies. In CEAS, 2006.

[3] S. Hao, et al. Detecting spammers with SNARE: Spatio- temporal network-level automated reputation engine. In USENIX Security, 2009.

[4] Z. Qian, et al. On network-level clusters for spam detection. In NDSS, 2010[5] A. G. West, et al. Spam mitigation using spatio-temporal reputations from

blacklist history. In ACSAC, 2010.[6] S. Nagaraja, et al. BotGrep: Finding P2P bots with structured graph

analysis. In USENIX Security, 2010.[7] A. Pitsillidis, et al. Botnet judo: Fighting spam with itself. In NDSS, 2010.[8] Spamhaus Project. http://www.spamhaus.org/[9] Y. Xie, et al. How dynamic are IP addresses? In SIGCOMM, 2007.[10] L. Geng and H. J. Hamilton. Interestingness measures for data mining: A

survey. ACM Comp. Surveys, 38(9), 2006.[11] J. E. Dunn. Botnet PCs stay infected for years. Tech World, 2009.

Backup Slides (1)

Backup Slides (2)

Above: Lift distributions as a consequence of altering minimum support.

Above: Lift distributions as a consequence of altering correlation radius and minimum support

Backup Slides (3)

andrew g. west and insup lee ceas `11 – september 1, 2011 towards the effective spatio- temporal...

block ip

ip addresses

mail ip

ip space campaigns

decrease listing latency

dynamic ips

ip x blacklist history

infection lifetimes

Documents

ceas atlantic model a6a

towards measuring the effectiveness of telephony blacklists...

bank instruments, embargoes, and blacklistsand blacklists or...

understanding blacklists

ceas e-news

mise en page 1 - ceas...ceas trustee: andrey shustov –...

how keyword blacklists are killing reach and …

andrew g. west, jian chang, krishna venkatasubramanian, oleg...

beyond blacklists: learning to detect malicious web sites...

ceas library

ceas - issue 2 - june 2012

occupy ceas

ceas symposium program

reputation digital vaccine: reinventing internet blacklists

ceas - issue 1 - march 2011

beyond blacklists: learning to detect malicious web sites...

labor / gilded age. identification (5 points) 1.blacklists

europe - svfw.ch · presentation of ceas aerospace europe...

webinar: the ultimate guide to blacklists

ceas annual report 2015