m. zubair shafiq syed ali khayam muddassar farooq · embedded malware detection using markov...

26
Embedded Malware Detection using Markov ngrams M Zubair Shafiq 1 Syed Ali Khayam 2 Muddassar Farooq 1 M. Zubair Shafiq , Syed Ali Khayam , Muddassar Farooq 1 Next Generation Intelligent Networks Research Center 2 School of Electrical Engineering & Computer Sciences National University of Computer & Emerging Sciences Islamabad, Pakistan http://www.nexginrc.org National University of Sciences & Technology Rawalpindi, Pakistan http://wisnet.niit.edu.pk 1

Upload: others

Post on 30-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Embedded Malware Detection using 

Markov n‐grams

M Zubair Shafiq1 Syed Ali Khayam2 Muddassar Farooq1M. Zubair Shafiq , Syed Ali Khayam , Muddassar Farooq

1 Next Generation Intelligent Networks Research Center 2 School of Electrical Engineering & Computer Sciencesg

National University of Computer & Emerging Sciences

Islamabad, Pakistan

http://www.nexginrc.org

g g p

National University of Sciences & Technology

Rawalpindi, Pakistan

http://wisnet.niit.edu.pk

1

Page 2: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

AgendaAgenda

Introduction to problem domain

Mathematical Modeling

Introduction to problem domain

Discussion on Results

Mathematical Modeling

Conclusions

Discussion on Results

Future Work

Conclusions

2

Future Work

Page 3: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Introduction to Problem Domain

3

Page 4: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

IntroductionIntroduction

4

Page 5: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Problem with State‐of‐the‐art antivirus

Si t t hiSignature matching• Identify sequence of instructions unique to a virus => virus signaturevirus signature

• Match program to a database of signatures

P blProblems• Inability to detect zero‐day attacks• Scan only starting portions of files due to high overhead and false alarms; vulnerable!

• Size of signature database cannot scale in future

5

Size of signature database cannot scale in future

Page 6: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Embedded MalwareEmbedded Malware

benign program infected program

6

Page 7: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Mathematical Modeling

7

Page 8: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Our ApproachOur Approach

• Differentiate anomalous behaviorAnomaly Differentiate anomalous behavior from the normal workflow

• Primarily utilize statistical modeling

Anomaly detection

Statistical model of benign DOC, EXE, JPG, MP3, PDF, ZIP files using Markov n‐gramsg

Feature extraction using well known information‐theoretic measure, entropy rate, py

Threshold based detection using Gaussianity of sum of sampled entropy rate distribution

8

py

Page 9: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

n‐gram analysisn gram analysis

n‐gram Definition [en‐wikipedia]• An n gram is a sequence of n symbols in a given sequence• An n‐gram is a sequence of n symbols in a given sequence

3-gram

9

Page 10: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Whole file n‐gram analysisWhole file n gram analysis

1-gram benign PDF 1-gram infected PDF

Whole file, 1-gram analysis DOES NOT show significant perturbations

10

Page 11: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Block wise 1‐gram analysisBlock wise 1 gram analysis

Calculate Mahanalobis distance between benign model andCalculate Mahanalobis distance between benign model and given file in a block wise manner (blocksize = 1000bytes)

Benign model distribution is:

• average byte value frequencyaverage byte value frequency• their standard deviation

11

Page 12: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Block wise 1‐gram analysisBlock wise 1 gram analysis

12

Page 13: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Block wise 2‐gram analysisBlock wise 2 gram analysis

13

Page 14: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Analysis of resultsAnalysis of results

2‐gram is better than 1‐gram; at the cost of exponentially2 gram is better than 1 gram; at the cost of exponentially higher computational complexity!

Cannot increase n due to a fixed block size of 1000 bytes!

2‐gram distribution is a joint distribution

Some redundant information in joint distribution that can be removed using conditional distribution. HOW?

14

Page 15: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Advantage of Conditional DistributionAdvantage of Conditional Distribution

Reduction in sample spaceReduction in sample space

15

Page 16: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

From Conditional Distribution to k hMarkov Chain

A conditional distribution can be directly mapped to aA conditional distribution can be directly mapped to a Markov chain 

S b l f di ib i d i M kSymbols of distribution mapped to states in Markov Chain

How to select order of Markov chain?

Byte level autocorrelation!

16

Page 17: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Byte‐level autocorrelation resultsByte level autocorrelation results

Random information beyond lag=1y g

1st order Discrete Time Markov Chain

17

Page 18: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Markov n‐gram ModelMarkov n gram Model

1st order 256 state Markov Chain1st order, 256 state Markov Chain

Can be constructed using conditional 2‐gram distribution

18

Page 19: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Quantification featureQuantification featureaverage information in the Markov Chain

Entropy rate gives • Time density of average information in the Markov chain• Time density of average information in the Markov chain

For a 256 state Markov chain

19

Page 20: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Entropy rate Resultspy

20

Page 21: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Threshold selectionThreshold selection

Sum of sampled entropy rate distributions approach Gaussianity*Sum of sampled entropy rate distributions approach Gaussianity

Threshold selected at µ ± 5σ (99.99%)

21

* Direct consequence of central limit theorem

Page 22: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

ResultsMahanalobis n‐gram Detector

(%)Markov n‐gram Detector

(%)Percentage Improvement

(%)

MP3

TP rate 63.8 95.0 31.2

FP rate 32.3 0.2 32.1

JPG

TP rate 76.3 95.4 19.1

FP rate 35.7 2.7 33.0

PDF

TP rate 75.4 84.5 9.1

FP rate 46.8 31.8 15.0

ZIP

TP rate 60.0 90.4 30.4

FP rate 29.9 8.3 21.6

EXE

TP rate 54.1 84.9 30.8

FP t 47 3 16 7 10 6

22

FP rate 47.3 16.7 10.6

DOC

TP rate 65.6 66.3 0.7

FP rate 48.8 29.2 19.6

Page 23: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Conclusion & Future Work

23

Page 24: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

ConclusionConclusion

AdvantagesAdvantages• Ability do identify location of malware• Significantly improved TP rate and FP rate as compared• Significantly improved TP rate and FP rate as compared to Mahanalobis detector

LimitationsLimitations• Still slightly high false positive rate for DOC and PDF files• Embedded Malware ‐> dormant malware•Mimicry Attacks

24

Page 25: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Future WorkFuture Work

How to reduce false positive rate?How to reduce false positive rate?

Complement with signature based detectorp g

• Inability to detect zero‐day attacks 

Multiple features and correlation

• Initial results have shown promise ☺• Subject of forthcoming journal publication…

Patent pending on current work

25

p g

Page 26: M. Zubair Shafiq Syed Ali Khayam Muddassar Farooq · Embedded Malware Detection using Markov n‐grams M. Zubair Shafiq1, Syed Ali Khayam2, Muddassar Farooq1 1 Next Generation IntelligentNetworks

Thank you!Thank you!

Contact authors for queries/suggesstions:

[email protected]@[email protected]

26