![Page 1: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/1.jpg)
KDD Cup ’99: Classifier LearningPredictive Model for Intrusion Detection
Charles Elkan1999 Conference on Knowledge
Discovery and Data MiningPresented by Chris Clifton
![Page 2: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/2.jpg)
KDD Cup Overview
• Held Annually in conjunction with Knowledge Discovery and Data Mining Conference (now ACM-sponsored)
• Challenge problem(s) released well before conference– Goal is to give best solution to problem– Relatively informal “contest”– Gives “standard” test for comparing techniques
• Winner announced at KDD conference– Lots of recognition to winner
![Page 3: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/3.jpg)
Classifier Learning forIntrusion Detection
• One of two KDD’99 challenge problems– Other was a knowledge discovery problem
• Goal is to learn a classifier to define TCP/IP connections as intrusion/okay– Data: Collection of features describing TCP
connection
– Class: Non-attack or type of attack
• Scoring: Cost per Test Sample– Wrong answers penalized based on type of “wrong”
![Page 4: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/4.jpg)
Data: TCP “connection” information
• Dataset developed for 1998 DARPA Intrusion Detection Evaluation Program– Nine weeks of raw TCP dump data from simulated USAF LAN– Simulated attacks to give positive examples– Processed into 5 million training “connections”, 2 million test– Some “attributes” derived from raw data
• Twenty-four attack types in training data, four classes:– DOS: denial-of-service, e.g. syn flood; – R2L: unauthorized access from a remote machine, e.g. guessing
password; – U2R: unauthorized access to local superuser (root) privileges, e.g.,
various ``buffer overflow'' attacks; – probing: surveillance and other probing, e.g., port scanning.
• Test set includes fourteen attack types not found in training set
![Page 5: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/5.jpg)
Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc.
discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise
discrete
wrong_fragment number of ``wrong'' fragments continuous
urgent number of urgent packets continuous
![Page 6: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/6.jpg)
Content features within a connection suggested by domain knowledge
feature name description type
hot number of ``hot'' indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of ``compromised'' conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if ``su root'' command attempted; 0 otherwise discrete
num_root number of ``root'' accesses continuous
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the ``hot'' list; 0 otherwise discrete
is_guest_login 1 if the login is a ``guest''login; 0 otherwise discrete
![Page 7: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/7.jpg)
Traffic features computed using a two-second time window
feature name description type
count number of connections to the same host as the current connection in the past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have ``SYN'' errors continuous
rerror_rate % of connections that have ``REJ'' errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current connection in the past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have ``SYN'' errors continuous
srv_rerror_rate % of connections that have ``REJ'' errors continuous
srv_diff_host_rate
% of connections to different host continuous
![Page 8: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/8.jpg)
Scoring
• Each prediction gets a score:– Row is correct answer
– Column is prediction made
• Score is average over all predictions
normal probe DOS U2R R2L
normal 0 1 2 2 2
probe 1 0 2 2 2
DOS 2 1 0 2 2
U2R 3 2 2 0 2
R2L 4 2 2 2 0
![Page 9: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/9.jpg)
Results
• Twenty-four entries, scores:0.2331 0.2356 0.2367 0.2411 0.2414 0.2443 0.2474 0.2479 0.2523 0.2530 0.2531 0.2545 0.2552 0.2575 0.2588 0.2644 0.2684 0.2952 0.3344 0.3767 0.3854 0.3899 0.5053 0.9414
• 1-Nearest Neighbor scored 0.2523
![Page 10: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/10.jpg)
Winning Method:Bagged Boosting
• Submitted by Bernhard Pfahringer, ML Group, Austrian Research Institute for AI
• 50 samples from the original 5 million odd examples set– Contrary to standard bagging the sampling was slightly biased:– all of the examples of the two smallest classes U2R and R2L– 4000 PROBE, 80000 NORMAL, and 400000 DOS examples– duplicate entries in the original data set removed
• Ten C5 decision trees induced from each sample– used both C5's error-cost and boosting options.
• Final predictions computed from 50 single predictions of each training sample by minimizing “conditional risk”– minimizes sum of error-costs times class-probabilities
• Took approximately 1 day of 200MHz 2 processor Sparc to train
![Page 11: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/11.jpg)
Confusion Matrix(Breakdown of score)
Winning Entry: Predicted actual
0 1 2 3 4 %correct
0 60262 243 78 4 6 99.5% 1 511 3471 184 0 0 83.3% 2 5299 1328 223226 0 0 97.1% 3 168 20 0 30 10 13.2% 4 14527 294 0 8 1360 8.4%
correct 74.6% 64.8% 99.9% 71.4% 98.8% For 1-NN: Predicted actual
0 1 2 3 4 %correct
0 60322 212 57 1 1 99.6% 1 697 3125 342 0 2 75.0% 2 6144 76 223633 0 0 97.3% 3 209 5 1 8 5 3.5% 4 15785 308 1 0 95 0.6%
%correct 72.5% 83.9% 99.8% 88.9% 92.2%
![Page 12: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/12.jpg)
Analysis of winning entry
• Result comparable to 1-NN except on “rare” classes– Training sample of winner biased to rare
classes– Does this give us a general principle?
• Misses badly for some attack categories– True for 1-NN as well– Problem with feature set?
![Page 13: KDD Cup ’99: Classifier Learning Predictive Model for Intrusion Detection Charles Elkan 1999 Conference on Knowledge Discovery and Data Mining Presented](https://reader036.vdocument.in/reader036/viewer/2022083008/56649ef65503460f94c09c65/html5/thumbnails/13.jpg)
Second and Third places(Probably not statistically significant)
• Itzhak Levin, LLSoft, Inc.: Kernel Miner– Link broken?
• Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin, MP13, Moscow, Russia
• Verbal rules constructed by an expert• First echelon of voting decision trees • Second echelon of voting decision trees
– Steps sequentially– Branch to the next step occurs whenever the current one has
failed to recognize the connection– Trees constructed using their own (previously developed) tree
learning algorithm