getting fuzzy with it - jesse kornblumjessekornblum.com/presentations/htcia12.pdffuzzy hashing •...
TRANSCRIPT
Getting Fuzzy With It
Jesse Kornblum
2
Outline • Introduction • Identical • Similarity • Generic Data • Fuzzy Hashing • Similar Text • Similar Pictures • Similar Programs • Questions
3
Introduction • U.S. Air Force Office of Special Investigations • U.S. Department of Justice
• Applied Computer Forensics Research
• Tools I've written – Memory forensics – md5deep/hashdeep – fuzzy hashing (ssdeep) – Foremost
Motivation
4
Identical • A == B • Difficult for humans • Easy for computers
• Requires storing the original A and B – Big files – Could be illegal or private content
5
Identical • Cryptographic Hashing shortcut • SHA-3 and friends • If SHA-3(A) == SHA-3(B) then A == B*
– * to within a high degree of certainty – Chance of random collision is very very very very small
• Hashes signatures are small • Impossible to recover input from signature
6
Identical Data • Cryptographic
hashes are spoiled by even a single byte difference in the input
• Very similar things have wildly different cryptographic hashes
7
Image courtesy of Flickr user krystalchu and used under Create Commons license, http://www.flickr.com/photos/krystalclear/3189597813/
Similarity • What does it mean for two things to be similar?
8
Similarity • Depends on:
– The kind of things be compared – How they’re being compared
9
Example
10
Example • Both live in Washington DC • Both like a good hamburger • Both like dogs
• Conclusion: Similar
• President Obama is 7 cm taller • Jesse does not have gray hair • Work in different career fields
• Conclusion: Not similar
11
Generic Data • Thing being compared:
– Ones and zeros of the data • How being compared:
– Files with identical runs of ones and zeros
12
Disclaimer
• I didn’t invent this math • Originally Dr. Andrew Tridgell
– Samba – rsync was part of his thesis – Modified slightly for spamsum
• Spam detector in his “junk code” folder
Fuzzy Hashing • Combination of a rolling hash and traditional hash • Rolling hash looks only at last few bytes
F o u r s c o r e -> 83,742,221 F o u r s c o r e -> 5 F o u r s c o r e -> 90,281 • When processing a file, compute block size using file size • If rolling hash equals random magic value, it’s a trigger point
14
Fuzzy Hashing • Compute traditional hash while processing file • On each trigger point, record value • Reset traditional hash and continue
• Example – Excerpt from "The Raven" by Edgar Allan Poe – Triggers on ood and ore
15
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever dared to dream before; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore?, This I whispered, and an echo murmured back the word, "Lenore!" Merely this, and nothing more
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever
dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever
dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
28163
491522
57
145410213
738210
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever
dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
28163
491522
57
145410213
738210
Signature = 32730
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, fearing Doubting, dreaming dreams no mortals ever
dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting,
dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting,
dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
28163
381739
57
145410213
738210
Fuzzy Hashing Deep into the darkness peering, long I stood there, wondering, I AM THE LIZARD KING!!!1! fearing Doubting,
dreaming dreams no mortals ever dared to dream before ; But the silence was unbroken, and the stillness gave no token, And the only word there spoken was the whispered word, Lenore ?, This I whispered, and an echo murmured back the word, "Lenore !" Merely this, and nothing more
28163
381739
57
145410213
738210 Original Signature = 32730 New Signatures = 39730
Fuzzy Hashing • Comparing Signatures • Edit Distance
– Number of changes to turn one into the other
– 32730 – 39730 – Edit distance = 1
• If edit distance is small relative to length, fuzzy hash match
25
Fuzzy Hashing • Data format agnostic
– Doesn't consider the type of data being processed • Gets confused easily
– Identifies surrounding format rather than content – Many small changes throughout input
26
sdhash • Another algorithm for matching similar files • Developed by Dr. Vassil Roussev, University of New Orleans
• Like ssdeep, ignores data format • Variable sized hashes
– About 3% of input size • Handles data reordering
• Matches more files than ssdeep – This is not, necessarily, “better” – But usually is better
• Code, paper, roadmap: – http://sdhash.org/
27
Similar Text • What makes two text documents similar?
28
Similar Text • Documents which contain identical words • Documents which contain identical phrases
– N-grams
• Reduce the data • Stop words
– Language dependent – A, the, I, of, and, for
• Porter Stemming – Reduce words to a root – Example: dog dogs dogged dogging
• dog
29
Similar Text • Example:
• Input 1: • the dog run was small
– dog run wa small
• Input 2: • running with dogged pursuit
– run with dog pursuit
• These are documents which contain identical words
30
Similar Text • Synonyms • Reduce words to a base synonym • Example:
– darkness, dimness, dusk, gloom, nightfall, shade, shadows
• "Deep into the darkness peering" – Deep into the $DARK peer
• "The gloom that breathes upon me with these airs" – The $DARK that breath upon me with these air
31
sdtext • Tool to identify similar text
– Professor Clay Shields at Georgetown University – Formerly GUCS – http://sdtext.com/
• Identifies documents with identical words • Which words? • Select words which are unique, but not too unique
– Avoid stop words – Avoid words which only appear in one document
• Bit-vector of words • Compare with cosine similarity
32
Similar Pictures
33
WARNING: EXPLICIT IMAGERY
LAW ENFORCEMENT SENSITIVE - DO NOT DUPLICATE
Similar Pictures
34
Similar Pictures
35
kitty1.jpg kitty1.png
$ md5deep -b kitty1* 1dcfeb21e006836bb4a999e07ad54571 kitty1.jpg 12de826dce2567c85af22750644bc9ed kitty1.png $ ssdeep -b -a -d kitty1* kitty1.png matches kitty1.jpg (0)
Similar Pictures • Just like generic data and text • Simplify the data • Compare the simplified versions
• Example from Dr. Neal Krawetz – http://bit.ly/kuTKXR
• Reduce picture size • Reduce number of colors • Compute average pixel value • Record whether each pixel is above or below average • Create hash based on bits from the last step
36
Similar Pictures
37
268407468280 268407468280
• Hamming distance is zero • Images are similar!
Perceptual Hashing
38
Perceptual Hashing
39
268407468280 259820151032
• Small Hamming distance, but not zero • Number of matches to any image depends on match threshold
Perceptual Hashing Tools • Several exist and are free to law enforcement
– Plug in to major forensics programs – Have signatures you can share
• Free and open source library – pHash, http://phash.org/
• This technique works for audio and video too
40
Similar Programs • What makes computer programs similar?
41
Similar Programs • Programs which
– Do the same thing – Process the same kind of files – Have the same look and feel – Written by the same person – Connect to the same Command and Control server
42
Features • What features can we extract from a program?
43
Features • Signed code? • Which APIs are called • How often APIs are called • Order in which APIs are called • Entropy • DLLs used • Percentage of code coverage • Magic strings • N-grams of instructions • Control-flow graph • IP addresses accessed • …
44
Image courtesy of Flickr user doctor_keats and used under Create Commons license.
Similar Programs • Programs aren't static
• They can be different every time – Depend on user interaction
• No single best way to compare them
45
46
Outline • Introduction • Identical • Similarity • Generic Data • Fuzzy Hashing • Similar Text • Similar Pictures • Similar Programs • Questions