presented by: rag mayur chevuri dept of computer & information sciences university of delaware
DESCRIPTION
Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware. B Behavioural Classification Tony Lee and Jigar J Mody. Automatic malware classification. Human analysis inefficient and inadequate. Large number of new virus/spyware families - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/1.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Presented by: Rag Mayur ChevuriDept of Computer & Information Sciences
University of Delaware
BBehavioural Classification
Tony Lee and Jigar J Mody
![Page 2: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/2.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Automatic malware classification • Human analysis inefficient and inadequate.
• Large number of new virus/spyware families
• Our focus : Classification problem
• Effective classification
Better Detection
Better Cleaning
Better Analysis solutions
![Page 3: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/3.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Classification Process
![Page 4: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/4.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Objectives of classification methodologies
• Efficiently and automatically.
• Minimal information loss.
• Structured to be stored, analyzed and referenced efficiently.
![Page 5: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/5.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Objectives of classification methodologies (contd..)
• Applies learned knowledge to identifyfamiliar pattern and similarity relations in a given target automatically
• Adaptable and has innate learning abilities.
![Page 6: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/6.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Approach
• Automated classification method based on:-runtime behavioral data -machine learning.
• Represent a file by its runtime behavior• Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects
![Page 7: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/7.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
A “good” knowledge representation
• Effectively capture knowledge of the object to represent
• The representation can persist in permanent storage.
• Enable classifiers to efficiently and effectively correlate data across large number of objects.
![Page 8: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/8.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Representing behavior:• The meaning of a particular action -
resulted state• Construct the representation in a
consistent canonical format.
Vector Approach• Process data in vector format using
statically and probabilistic algorithms • Problem: vector size, scalability, and
factorability.
![Page 9: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/9.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
The Opaque Object Approach
• Objects represent data in rich syntax
• Rich semantic representation of theactual object
• Precise distance between objects used for Clustering
![Page 10: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/10.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Events Representation
• Sequence of events • Ordered according to
• time of the occurrence of program actions
• environment state transitions.
00:00 00:04
Registry Query File Write
Open Process
Network Listen
Registry Write
Allocate VM
Write VM
Terminate Process
Open Mutant Create Mutant
![Page 11: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/11.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Event Properties
• Event ID • Event object (e.g registry, file,
process, socket, etc.) • Event subject if applicable (i.e. the
process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value,
file path, IP address) • Status of the action (e.g. file handle
created, registry removed, etc.)
![Page 12: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/12.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
An example (Register Event)
![Page 13: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/13.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Generate Classifier for Classification
![Page 14: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/14.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Which classifier?
• Case-based Classifier by treating existing malware collection as a database of solutions.
• Learn by CBR
• Nearest Neighbor algorithms.
• To make the CBR approach scalable, Apply “Clustering”.
![Page 15: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/15.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Clustering
• Unsupervised learning • Organize objects into clusters• A cluster is a collection of objects
which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
![Page 16: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/16.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Distance Measure
• Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ”
• Operation = Op (Event) • Cost (Transformation) = Σi Cost
(Operationi) • Cost of operation depends on operator
as well as the operand
![Page 17: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/17.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Operation Cost Matrix for Similarity Measure
![Page 18: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/18.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
k-medoid partitioning clustering algorithm
Place K points into the space.These points represent initial group Medoids.
Assign each object to the group that has the closest Medoid
Recalculate the positions of the K Medoids.
Repeat 2 and 3 until the Medoids no longer move.
![Page 19: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/19.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Classifying a new object
Nearest Neighbor Classification
Compare the new object to all the medoids .
Assign the new object the family name of the closest medoid.
![Page 20: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/20.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Experiment
• an automated distributed replication system
![Page 21: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/21.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Data Analysis
• Test data :Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11
families. • 10 fold cross validation • We vary and contrast experiments by
adjusting two parameters: • number of clusters (K),maximum
number of events(E)• Measure Error rate &Accuracy Gain
![Page 22: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/22.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
• Error rate is defined as ER = number of incorrectly classified samples / total number of samples.
• Accuracy , AC = 1 – ER
• Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |
![Page 23: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/23.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Experiment A
![Page 24: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/24.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
![Page 25: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/25.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
![Page 26: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/26.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
![Page 27: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/27.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Observations
• Accuracy vs. #Clusters Error rate reduces as number of clusters
increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap
increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.
![Page 28: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/28.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
• Accuracy Gain vs. Number of Events The gain in accuracy is more substantial
at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000)
• Accuracy vs. Number of Families The 11-family experiment outperforms in
accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).
![Page 29: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/29.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion
• Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics
• Lack of code structural information• Combine static analysis to improve
classification accuracy• “Developing automated classification
process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “
![Page 30: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware](https://reader035.vdocument.in/reader035/viewer/2022062408/5681404f550346895dabc0f5/html5/thumbnails/30.jpg)
CISC 879 - Machine Learning for Solving Systems Problems
References
• Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System.
• Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection.
• Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology
• Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.