learning to detect malicious urls justin ma, lawrence saul, stefan savage, geoff voelker computer...
TRANSCRIPT
Learning to Detect Malicious URLs
Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering
UC San Diego
Presentation for GoogleMay 5, 2010
Malicious Web Sites
Spam-advertised goods
Trojan downloads
Phishing: which one is real?
3
Visiting Malicious Web Sites
Predict what is safe without
committing to risky actions
• Safe URL?
• Malicious download?
• Spam-advertised site?
• Phishing site?
URL = Uniform Resource Locator
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
http://fblight.com
http://mail.ru
http://www.cs.ucsd.edu/~jtma/
4
Problem in a Nutshell URL features to identify malicious Web sites
No context, no content
Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious
facebook.com fblight.com
5
What we want...
6
How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
Blacklist
Hand-picked features (properties of URL)
Machine learning-based classifier
7
State of the Practice
Current approaches Blacklists [SORBS,
URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]
Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]
Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features
Arms race: fast feedback cycle is critical
More automated approach?
8
Contributions
URL classification system Used by large Web mail providers
Practical large-scale machine learning in computer security
Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale
Automatic Classification of Phishing Pages” (NDSS'10)
9
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
10
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
11
Moving Beyond Blacklists
Preliminary study: Do our features work? Batch algorithms for now 104 examples and features
Outline System overview Features ← focus of this segment Experimental results
[Ma, Saul, Savage, Voelker (KDD 2009)]
12
URL Classification System
Label Example Hypothesis
13
Data Sets
Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc)
Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory
Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set
14
Algorithms
Logistic regression w/ L1-norm regularization
Other models Naive Bayes Support vector machines (linear, RBF kernels)
Implicit feature selection Easier to interpret
15
Features
Example
16
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical
17
Features to consider?
1)Blacklists
2)Simple heuristics
3)Domain name registration
4)Host properties
5)Lexical
18
(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive
http://www.bfuduuioo1fp.mobiIn blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................
19
stopgap.cn registered 2 May 2010
(2) Manually-Selected Features
Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date
[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
20
(3) WHOIS Features
Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?
http://sleazysalmon.com
http://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.com
Registered on4 May 2010
By SpamMedia
21
(4) Host-Based Features
Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
WHOIS: registrar, registrant, dates
IP address: Which ASes/IP prefixes?
DNS: TTL? PTR record exists/resolves?
Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com
Bad Parts of the Internet
22
(5) Lexical Features
Tokens in URL hostname + path Length of URL Number of dots
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
[Kolari et al., 2006]
23
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
4,000
# Features
13,000
4
7
17,000
More features → Better accuracy
Error rate (%)
24
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full 96—99% accuracy
4,000
# Features
13,000
4
7
17,000
30,000
Error rate (%)
25
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
7
17,000
30,000
26,000
Error rate (%)
Blacklists and WHOIS are not comprehensive
26
Beyond Blacklists
Blacklist
Full features
Yah
oo-P
his
hTan
k
Higher detection rate for given false positive rate
27
Summary
Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper)
What about scalable and adaptive malicious URL detection system?
28
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
29
URL Classification System
Label Example Hypothesis
30
Live URL Classification System
Label Example Hypothesis
31
Large-Scale Online Learning
How do we scale to live, large-scale data? Outline
Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning
[Ma, Saul, Savage, Voelker (ICML 2009)]
32
Live Training Feed
Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider
Benign URLs From Yahoo Web directory
Total of 20,000 URLs per day Collected Jan – May, 2009
120 days More than two million examples
33
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical
60+ features 1.1 million 1.8 millionGROWING
Day 100
34
New Features Encountered All the Time
Many binary features Enumerating tokens, ISPs, registrants, etc... 2.9 million features after 100 days
35
Live URL Classficiation System
Online learning
36
Practical Challenges of ML in Systems
Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms
race w/ criminals)
Pivotal decision: batch or online?
37
Batch vs. Online Learning
Batch/offline learning SVM, logistic regression,
decision trees, etc Multiple passes over
data No incremental updates Potentially high memory
and processing overhead
Online learning Perceptron-style
algorithms Single pass over data Incremental updates Low memory and
processing overhead
Online learning addresses scale and non-stationarity
38
Evaluations
Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector
39
Need lots of fresh training data?SVM trained once
SVM retrained daily
Fresh data helps
40
Need lots of fresh training data?
Fresh data helps More data helps
SVM trained once on 2 weeks
SVM w/ 2-week sliding window
41
Which online algorithm?
Perceptron
Stochastic Gradient Descent for Logistic Regression
Confidence-Weighted Learning
42
Perceptron
Convergence result:
[Rosenblatt, 1958]
+−+ +
+
+
+−
−
−
−
−
Update on each mistake:
radius
margin
Number of mistakes ≤
43
Logistic Regression with SGD
Log likelihood:
where
[Bottou, 1998]
For every example:
Proportional
44
Confidence-Weighted Learning
Maintain Gaussian distribution over weight vector:
[Dredze et al., 2008] [Crammer et al., 2009]
Constrained problem:
Closed-form update:Update features at different rates
Diagonal covariance matrix
for evals
45
Which online algorithms?
Perceptron
46
Which online algorithms?
LR w/ SGD
Proportional update helps
Perceptron
47
Which online algorithms?
Proportional update helps Per-feature confidence really helps
Confidence-Weighted
LR w/ SGD
Perceptron
48
Batch...
Fresh data helps More data helps
Batch
49
Batch vs. Online
Confidence-Weighted
Batch
Fresh data helps More data helps Online matches batch
50
Why online does well?
SVM w/ 2-week sliding window
Confidence-Weighted
51
Why online does well?
Confidence-Weightedonce-a-day
More data eventually helps Continuous retraining helps
SVM w/ 2-week sliding window
Confidence-Weighted
52
Growing feature vector?
Confidence-Weightedfixed features
53
Growing feature vector?
Confidence-Weightedfixed features
Confidence-Weightedgrowing features
Growing feature vector helps
54
Fixed vs. Variable Features
Perceptronfixed features
Growing feature vector helps
55
Fixed vs. Variable Features
Perceptronfixed features
Growing feature vector helps Growing + CW really helps
Perceptrongrowing features
56
Proof-of-Concept Plugin
57
Examine URL details...
58
It sure looks like phishing...The Real Site
59
Blacklisted later on
60
Summary
Detecting malicious URLs Relevant real-world problem Successful application of online learning
What helps? More, fresher data Continuous retraining Growing feature vector
Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources
61
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
62
Impact
Public data set (UCI ML repo) Industrial impact
Mail providers have adopted our approach for classifying URLs in email messages
Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/
63
Final Thoughts: Systems + ML
Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions
Systems ↔ Machine Learning