rechkov. lomonosov report
TRANSCRIPT
Introduction Assembler as a native language Anomalies detection
Detecting abnormal executable files usingbinary code mining
Rechkov Anton
TU Berlin Germany & TTI SFU Russia
21th March 2012
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 1 / 31
Introduction Assembler as a native language Anomalies detection
Malware evolution
CipheredEncrypted malware code of viruses
OligomorphicGeneration of a decryptor by randomly selecting each piece of the decryptorfrom several predefined alternatives.
PolymorphicGeneration of a sample by encypting malware body and modifying decryptoreach replication
MetamorphicReprograming all virus body by some obfuscation engine.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 2 / 31
Introduction Assembler as a native language Anomalies detection
Modern detection technique
Signature analysisSearching a determine pattern in code.
EmulationUnpacking and analysis through the emulation of malware code and continuesignature analysis.
Behavioral analysisAnalysis of functions graph flow.
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 3 / 31
Introduction Assembler as a native language Anomalies detection
Code modification
ObfuscationTransformation of executable program code which preserves functionality, butcomplicates the analysis and understanding algorithms.
DeobfuscationResolving irrelevant code by
Algebraic models
Formal grammars
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31
Introduction Assembler as a native language Anomalies detection
Code modification
ObfuscationTransformation of executable program code which preserves functionality, butcomplicates the analysis and understanding algorithms.
DeobfuscationResolving irrelevant code by
Algebraic models
Formal grammars
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 4 / 31
Introduction Assembler as a native language Anomalies detection
Outline
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detection
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 5 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 6 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Structure of compiler
Code generator engine:Machine code generator,Optimizers:
interproceduraloptimization (IPO),profile-guidedoptimization (PGO),high-level optimizations
Mutation code generator /obfuscator.
Common compiler scheme
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 7 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Common Code generator features
high-level optimizations
Unique intermediate language
Preoptimizing in intermediate representation
Code generation
Code templates from Intermediate to Target
Number of used instruction types
Machine dependent optimizer
Instructions cost
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 8 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Approving theory
Experiment
Determine instruction sequences
Compile source code with compilers
Compare distributions
Compilers
⇒ MSVC
⇒ LLVM
⇒ GCC
⇒ Intel C++ Compiler
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Approving theory
Experiment
Determine instruction sequences
Compile source code with compilers
Compare distributions
Compilers
⇒ MSVC
⇒ LLVM
⇒ GCC
⇒ Intel C++ Compiler
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 9 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
XTEA distribution test
Frequency of words in binary.
(a) LLVM (b) MSVC
(c) Intel C++ (d) GCC
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 10 / 31
Introduction Assembler as a native language Anomalies detection
Binary code mining
Optimize binary’s mean distribution
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 11 / 31
Introduction Assembler as a native language Anomalies detection
Native language processing
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 12 / 31
Introduction Assembler as a native language Anomalies detection
Native language processing
Text Mining
Language detection
Author detection
Text Classification
Document clustering
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 13 / 31
Introduction Assembler as a native language Anomalies detection
Stochastic models
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 14 / 31
Introduction Assembler as a native language Anomalies detection
Stochastic models
Neural networks
Advantages
+ effectively with small number of training vectors
+ assessment of all samples proximity
Disadvantages
- predetermining model
manual words definitionmanual excessive elements analysisreeducation limitations
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 15 / 31
Introduction Assembler as a native language Anomalies detection
Stochastic models
Probability model
Advantages
+ self-sufficient word definition
+ education only by positive vectors
+ education unification(flexible reeducation)
Disadvantages
- big sample set for education
- errors while distribution determination
- computational complexity
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 16 / 31
Introduction Assembler as a native language Anomalies detection
Outline
1 Assembler as a native language
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 17 / 31
Introduction Assembler as a native language Anomalies detection
Preparation
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 18 / 31
Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
MatlabStochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
MatlabStochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
Introduction Assembler as a native language Anomalies detection
Preparation
Collect statistics samples
Python
Detection list of max repeated sequences
Disassembling
Searching strings
MatlabStochastic models
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 19 / 31
Introduction Assembler as a native language Anomalies detection
Code generator lexemes
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 20 / 31
Introduction Assembler as a native language Anomalies detection
Code generator lexemes
From disassembling to lexemes
Lexem3 to 6 instruction length sequences
ignore unknown bytes
maximum repeated sequences
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 21 / 31
Introduction Assembler as a native language Anomalies detection
Code generator lexemes
Lexemes analysis
Suffix tree:Economy memory,String searching faster then O(N2),Fast assessment of maximumrepeats in strings
Suffix Tree example
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 22 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 23 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Radial basis networks
no need to choose the number ofhidden layerslack of the pathology convergencefast convergence through acombination of learning algorithms.
Neural net architecture
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 24 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by neural networks
Detection compilers
Compiler detection testing
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 25 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Table of Contents
1 Assembler as a native languageBinary code miningNative language processingStochastic models
2 Anomalies detectionPreparationCode generator lexemesAnomalies detection by neural networksAnomalies detection by probability model
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 26 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Multivariate Gamma
Using a set of bi- and 3-variateGamma:
Suggest GammadistributionSample proximityFast education
Empirical and theoretical PDFof element
−0.02 0 0.02 0.04 0.06 0.08 0.1 0.120
5
10
15
20
25
30
35
40
X
PD
F
Gamma PDF
Empirical PDF
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 27 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Probability model testing
Error graphs of compiler probabilities based on coefficient ofminimal value P i
p = P imin ∗ 10coef
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
coeff for min value
err
or
false positive GCC O0
false negative Clang
false negative Intel
false negative GCC O2
false negative MS
0 2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
coeff for min value
err
or
false positive MS
false negative LLVM
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 28 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Probability model testing
Problem of existing zero elements
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
coeff for min value
err
or
false positive GCC O2
false negative Clang
false negative Intel
false negative GCC O0
false negative MS
0 1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
coeff for min value
err
or
false positive GCC O2
false negative Clang
false negative Intel
false negative GCC O0
false negative MS
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 29 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Conclusion
Proposed connection between native language andassemblerDeveloped algorithms of lexical assembler languageanalyzesDeveloped experimental stochastic models:
Based on neural networksBased on probability model
Realized lexical assembler language analysis.Approximate false positive errors of compiler detection:
27%10-15%
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 30 / 31
Introduction Assembler as a native language Anomalies detection
Anomalies detection by probability model
Questions?
Rechkov Anton Lomonosov Scholarship Report 21th March 2012 31 / 31