![Page 1: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/1.jpg)
Learning to Detect Malicious URLs
Justin Ma, Lawrence Saul, Stefan Savage, Geoff VoelkerComputer Science & Engineering
UC San Diego
Presentation for GoogleMay 5, 2010
![Page 2: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/2.jpg)
Malicious Web Sites
Spam-advertised goods
Trojan downloads
Phishing: which one is real?
![Page 3: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/3.jpg)
3
Visiting Malicious Web Sites
Predict what is safe without
committing to risky actions
• Safe URL?
• Malicious download?
• Spam-advertised site?
• Phishing site?
URL = Uniform Resource Locator
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
http://fblight.com
http://mail.ru
http://www.cs.ucsd.edu/~jtma/
![Page 4: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/4.jpg)
4
Problem in a Nutshell URL features to identify malicious Web sites
No context, no content
Different classes of URLs Benign, spam, phishing, exploits, scams... For now, distinguish benign vs. malicious
facebook.com fblight.com
![Page 5: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/5.jpg)
5
What we want...
![Page 6: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/6.jpg)
6
How to build this service?http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
Blacklist
Hand-picked features (properties of URL)
Machine learning-based classifier
![Page 7: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/7.jpg)
7
State of the Practice
Current approaches Blacklists [SORBS,
URIBL, SURBL, Spamhaus, SiteAdvisor, WOT, IronPort, WebSense]
Learning on hand-tuned features [Kan & Thi '05, Garera et al. '07, CANTINA, Guan et al. '09]
Limitations Cannot learn from newest examples quickly Cannot quickly adapt to newest features
Arms race: fast feedback cycle is critical
More automated approach?
![Page 8: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/8.jpg)
8
Contributions
URL classification system Used by large Web mail providers
Practical large-scale machine learning in computer security
Large feature sets + online learning Related work: Whittaker, Ryner, Nazif, “Large-Scale
Automatic Classification of Phishing Pages” (NDSS'10)
![Page 9: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/9.jpg)
9
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
![Page 10: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/10.jpg)
10
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
![Page 11: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/11.jpg)
11
Moving Beyond Blacklists
Preliminary study: Do our features work? Batch algorithms for now 104 examples and features
Outline System overview Features ← focus of this segment Experimental results
[Ma, Saul, Savage, Voelker (KDD 2009)]
![Page 12: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/12.jpg)
12
URL Classification System
Label Example Hypothesis
![Page 13: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/13.jpg)
13
Data Sets
Malicious URLs 5,000 from PhishTank (phishing) 15,000 from Spamscatter (spam, phishing, etc)
Benign URLs 15,000 from Yahoo Web directory 15,000 from DMOZ directory
Malicious x Benign → 4 Data Sets 30,000 – 55,000 features per data set
![Page 14: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/14.jpg)
14
Algorithms
Logistic regression w/ L1-norm regularization
Other models Naive Bayes Support vector machines (linear, RBF kernels)
Implicit feature selection Easier to interpret
![Page 15: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/15.jpg)
15
Features
Example
![Page 16: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/16.jpg)
16
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical
![Page 17: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/17.jpg)
17
Features to consider?
1)Blacklists
2)Simple heuristics
3)Domain name registration
4)Host properties
5)Lexical
![Page 18: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/18.jpg)
18
(1) Blacklist Queries List of known malicious sites Providers: SORBS, URIBL, SURBL, Spamhaus Not comprehensive
http://www.bfuduuioo1fp.mobiIn blacklist?
Yes
http://fblight.com
No
In blacklist?
http://www.bfuduuioo1fp.mobi
Blacklist queries as features
........................................
........................................
![Page 19: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/19.jpg)
19
stopgap.cn registered 2 May 2010
(2) Manually-Selected Features
Considered by previous studies IP address in hostname? Number of dots in URL WHOIS (domain name) registration date
[Fette et al., 2007][Zhang et al., 2007][Bergholz et al., 2008]
http://72.23.5.122/www.bankofamerica.com/
http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
![Page 20: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/20.jpg)
20
(3) WHOIS Features
Domain name registration Date of registration, update, expiration Registrant: Who registered domain? Registrar: Who manages registration?
http://sleazysalmon.com
http://angryalbacore.com
http://mangymackerel.com
http://yammeringyellowtail.com
Registered on4 May 2010
By SpamMedia
![Page 21: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/21.jpg)
21
(4) Host-Based Features
Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
WHOIS: registrar, registrant, dates
IP address: Which ASes/IP prefixes?
DNS: TTL? PTR record exists/resolves?
Geography-related: Locale? Connection speed?
75.102.60.0/2269.63.176.0/20
facebook.com fblight.com
Bad Parts of the Internet
![Page 22: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/22.jpg)
22
(5) Lexical Features
Tokens in URL hostname + path Length of URL Number of dots
http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
[Kolari et al., 2006]
![Page 23: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/23.jpg)
23
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
4,000
# Features
13,000
4
7
17,000
More features → Better accuracy
Error rate (%)
![Page 24: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/24.jpg)
24
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full 96—99% accuracy
4,000
# Features
13,000
4
7
17,000
30,000
Error rate (%)
![Page 25: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/25.jpg)
25
Which feature sets?
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
7
17,000
30,000
26,000
Error rate (%)
Blacklists and WHOIS are not comprehensive
![Page 26: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/26.jpg)
26
Beyond Blacklists
Blacklist
Full features
Yah
oo-P
his
hTan
k
Higher detection rate for given false positive rate
![Page 27: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/27.jpg)
27
Summary
Detect malicious URLs with high accuracy Only using URL Diverse feature set helps: 99% w/ 30,000+ features Model analysis (more in KDD'09 paper)
What about scalable and adaptive malicious URL detection system?
![Page 28: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/28.jpg)
28
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
![Page 29: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/29.jpg)
29
URL Classification System
Label Example Hypothesis
![Page 30: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/30.jpg)
30
Live URL Classification System
Label Example Hypothesis
![Page 31: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/31.jpg)
31
Large-Scale Online Learning
How do we scale to live, large-scale data? Outline
Live training feed Challenges: scale and non-stationarity Need for large, fresh training sets Online learning
[Ma, Saul, Savage, Voelker (ICML 2009)]
![Page 32: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/32.jpg)
32
Live Training Feed
Malicious URLs (spamming and phishing) 6,000—7,500 per day from Web mail provider
Benign URLs From Yahoo Web directory
Total of 20,000 URLs per day Collected Jan – May, 2009
120 days More than two million examples
![Page 33: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/33.jpg)
33
Feature Vector Constructionhttp://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
WHOIS registration: 3/25/2009Hosted from 208.78.240.0/22IP hosted in San MateoConnection speed: T1Has DNS PTR record? YesRegistrant “Chad”...
[ _ _ … 0 0 0 1 1 1 … 1 0 1 1 …]Real-valued Host-based Lexical
60+ features 1.1 million 1.8 millionGROWING
Day 100
![Page 34: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/34.jpg)
34
New Features Encountered All the Time
Many binary features Enumerating tokens, ISPs, registrants, etc... 2.9 million features after 100 days
![Page 35: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/35.jpg)
35
Live URL Classficiation System
Online learning
![Page 36: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/36.jpg)
36
Practical Challenges of ML in Systems
Industrial concerns Scale: millions of examples, features Non-stationarity: examples change over time (arms
race w/ criminals)
Pivotal decision: batch or online?
![Page 37: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/37.jpg)
37
Batch vs. Online Learning
Batch/offline learning SVM, logistic regression,
decision trees, etc Multiple passes over
data No incremental updates Potentially high memory
and processing overhead
Online learning Perceptron-style
algorithms Single pass over data Incremental updates Low memory and
processing overhead
Online learning addresses scale and non-stationarity
![Page 38: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/38.jpg)
38
Evaluations
Online learning for URL reputation Need for large, fresh training sets Comparing online algorithms Continuous retraining Growing feature vector
![Page 39: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/39.jpg)
39
Need lots of fresh training data?SVM trained once
SVM retrained daily
Fresh data helps
![Page 40: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/40.jpg)
40
Need lots of fresh training data?
Fresh data helps More data helps
SVM trained once on 2 weeks
SVM w/ 2-week sliding window
![Page 41: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/41.jpg)
41
Which online algorithm?
Perceptron
Stochastic Gradient Descent for Logistic Regression
Confidence-Weighted Learning
![Page 42: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/42.jpg)
42
Perceptron
Convergence result:
[Rosenblatt, 1958]
+−+ +
+
+
+−
−
−
−
−
Update on each mistake:
radius
margin
Number of mistakes ≤
![Page 43: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/43.jpg)
43
Logistic Regression with SGD
Log likelihood:
where
[Bottou, 1998]
For every example:
Proportional
![Page 44: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/44.jpg)
44
Confidence-Weighted Learning
Maintain Gaussian distribution over weight vector:
[Dredze et al., 2008] [Crammer et al., 2009]
Constrained problem:
Closed-form update:Update features at different rates
Diagonal covariance matrix
for evals
![Page 45: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/45.jpg)
45
Which online algorithms?
Perceptron
![Page 46: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/46.jpg)
46
Which online algorithms?
LR w/ SGD
Proportional update helps
Perceptron
![Page 47: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/47.jpg)
47
Which online algorithms?
Proportional update helps Per-feature confidence really helps
Confidence-Weighted
LR w/ SGD
Perceptron
![Page 48: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/48.jpg)
48
Batch...
Fresh data helps More data helps
Batch
![Page 49: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/49.jpg)
49
Batch vs. Online
Confidence-Weighted
Batch
Fresh data helps More data helps Online matches batch
![Page 50: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/50.jpg)
50
Why online does well?
SVM w/ 2-week sliding window
Confidence-Weighted
![Page 51: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/51.jpg)
51
Why online does well?
Confidence-Weightedonce-a-day
More data eventually helps Continuous retraining helps
SVM w/ 2-week sliding window
Confidence-Weighted
![Page 52: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/52.jpg)
52
Growing feature vector?
Confidence-Weightedfixed features
![Page 53: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/53.jpg)
53
Growing feature vector?
Confidence-Weightedfixed features
Confidence-Weightedgrowing features
Growing feature vector helps
![Page 54: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/54.jpg)
54
Fixed vs. Variable Features
Perceptronfixed features
Growing feature vector helps
![Page 55: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/55.jpg)
55
Fixed vs. Variable Features
Perceptronfixed features
Growing feature vector helps Growing + CW really helps
Perceptrongrowing features
![Page 56: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/56.jpg)
56
Proof-of-Concept Plugin
![Page 57: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/57.jpg)
57
Examine URL details...
![Page 58: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/58.jpg)
58
It sure looks like phishing...The Real Site
![Page 59: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/59.jpg)
59
Blacklisted later on
![Page 60: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/60.jpg)
60
Summary
Detecting malicious URLs Relevant real-world problem Successful application of online learning
What helps? More, fresher data Continuous retraining Growing feature vector
Confidence-Weighted vs. Batch As accurate More adaptive Fewer resources
![Page 61: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/61.jpg)
61
Today's Talk
Problem Moving Beyond Blacklists Large-Scale Online Learning Conclusion
![Page 62: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/62.jpg)
62
Impact
Public data set (UCI ML repo) Industrial impact
Mail providers have adopted our approach for classifying URLs in email messages
Project Info + Data Sethttp://sysnet.ucsd.edu/projects/url/
![Page 63: Learning to Detect Malicious URLs Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker Computer Science & Engineering UC San Diego Presentation for Google](https://reader035.vdocument.in/reader035/viewer/2022062308/56649f005503460f94c16089/html5/thumbnails/63.jpg)
63
Final Thoughts: Systems + ML
Systems: high-impact, large-scale applications ML: Methodical approaches Systems: Embrace real-world constraints ML: More than “plug-and-play” solutions
Systems ↔ Machine Learning