dec 29, 2015...

66
Stylistic fingerprints Stylometry has been applied to: Fineart Music Unconventional text Translated text Source code Dec 29, 2015 1 of 60

Upload: others

Post on 20-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Stylistic  fingerprints• Stylometry  has  been  applied  to:– Fine-­‐art– Music– Unconventional  text– Translated  text– Source  code

Dec 29, 2015

1 of 60

Page 2: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Supervised  stylometry• Given  a  set  of  documents  of  known  authorship,  classify  a  document  of  unknown  authorship– Classifier  trained  on  undisputed  text

• Scenario:  Alice  the  Anonymous  Blogger  vs.  Bob  the  Abusive  Employer– Alice  blogs  about  abuses  in  Bob’s  company

• Blog  posted  anonymously  (Tor,  pseudonym,  etc)– Bob  obtains  5,000 words  of  each  employee’s  writing

• Bob  uses  stylometry  to  identify  Alice  as  the  blogger

Dec 29, 2015

2 of 60

Page 3: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

channel  partner  advocate  for  Cisco  Alert

Dec 29, 2015

3 of 60

Page 4: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

• Stylistic  fingerprints

– Use  stylometric  fingerprints  to  find  who  “theconnor”  is.– Collect  the  rest  of  the  tweets  in  the  timeline,  compare  to  the  cover  

letters  submitted  to  Cisco  and  identify  theconnor.

Frequency  of  function  words

Frequency  of  punctuation

Fingerprints  in  textual  data

tweets of

theconnor

Ciscocover letter

from Person A B C D …modelfor

A,  B,  C,…

Who istheconnor?

extract features

extract features

Dec 29, 2015

4 of 60

Page 5: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

theconnor made her Twitter profile private and deleted all information on her homepageright after the event but it was too late since search engines cache search results which can lead to old information.

Dec 29, 2015

5 of 60

Page 6: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

theconnor  à Connor  Riley  à

Dec 29, 2015

6 of 60

Page 7: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

What  about  fingerprints  in  source  code?Dec 29, 2015

7 of 60

Page 8: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

DE-­‐ANONYMIZING  PROGRAMMERS  VIA

CODE  STYLOMETRY

De-anonymizing Programmers via Code Stylometry. Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt.Usenix Security Symposium, 2015

Page 9: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometry

• Everyone  learns  coding  on  an  individual  basis,  as  a  result  code  in  a  unique  style,  which  makes  de-­‐anonymization  possible.

• Software  engineering  insights  – programmer  style  changes  while  implementing   sophisticated  

functionality  – differences  in  coding  styles  of  programmers  with  different  skill  sets

• Identify  malicious  programmers.

Dec 29, 2015

9 of 60

Page 10: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometry:Who  wrote  this  code?

• Scenario  1:Alice  analyzes  a  library  with  malicious  source  code.Bob  has  a  source  code  collection  with  known  authors.Bob  will  search  his  collection  to  find  Alice’s  adversary.

Dec 29, 2015

10 of 60

Page 11: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometry:Who  wrote  this  code?

• Scenario  2:Alice  got  an  extension   for  her  programming  assignment.  Bob,  the  professor  has  everyone  else’s  code.  Bob  wants  to  see  if  Alice  plagiarized.

Dec 29, 2015

11 of 60

Page 12: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometryIran  confirms  death  sentence  for  'porn  site'  web  programmer

No technical difference between security-enhancing and privacy-infringing…

Dec 29, 2015

12 of 60

Page 13: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Comparison  to  related  workRelatedWork

Author

Size

Instances AverageLOC

Language Fetaures Method Result

MacDonellet  al.

7 351 148 C++ lexical &layout

Case-­‐based  reasoning

88.0%

Frantzeskou et  al. 8 107 145 Java lexical &layout

Nearestneighbor

100.0%

Elenbogen and  Seliya

12 83 100 C++ lexical &layout

C4.5  decision  tree

74.7%

Shevertalov et.  al. 20 N/A N/A Java lexical &layout

Genetic  algorithm

80%

Frantzeskou et  al. 30 333 172 Java lexical &layout

Nearest  neighbor

96.9%

Ding  andSamadzadeh

46 225 N/A Java lexical &layout

Nearest  neighbor

75.2%

Ours 35 315 68 C++ lexical &layout  &syntactic

Randomforest

100.0%

Ours 250 2,250 77 C++ 98.0%

Ours 1,600 14,400 70 C++93.6%

Dec 29, 2015

13 of 60

Page 14: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Comparison  to  related  workRelatedWork

AuthorSize

Instances AverageLOC

Language Fetaures Method Result

Frantzeskouet  al.

30 333 172 Java lexical &layout

Nearest  neighbor

96.9%

Ding  andSamadzadeh

46 225 N/A Java lexical &layout

Nearest  neighbor

75.2%

Ours 35 315 68 C++lexical &layout  &syntactic

Randomforest

100.0%

Ours 250 2,250 77 C++ 98.0%Ours 1,600 14,400 70 C++

93.6%

Dec 29, 2015

14 of 60

Page 15: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Comparison  to  related  work

RelatedWork

AuthorSize

Instances

AverageLOC

Language Fetaures Method Result

Frantzeskouet  al.

30 333 172 Java lexical &layout

Nearest  neighbor

96.9%

Ding  andSamadzadeh

46 225 N/A Java lexical &layout

Nearest  neighbor

75.2%

Ours 35 315 68 C++lexical &layout  &syntactic

Randomforest

100.0%Ours 250 2,250 77 C++ 98.0%Ours 1,600 14,400 70 C++

93.6%

Dec 29, 2015

15 of 60

Page 16: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometryMachine  learning  workflow

Dataset in CPP ~100,000 users

preprocessing

Fuzzy AST parser

Extract features

Random Forest

classificationmajority

vote

A B C D

Dec 29, 2015

16 of 60

Page 17: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

• 2008-­‐2014• Same  problems• Limited  time• Problems  get  harder• C++  most  common

Dec 29, 2015

17 of 60

Page 18: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometry

• Stylometry  can  be  used  in  source  code  to  identify   the  author  of  a  program.  

• Extract  layout  and  lexical  features  from  source  code.

• Abstract   syntax  trees   (AST)   in  code   represent   the  structure   of   the  program.      

• Preprocess   source   code   to  obtain   AST.

• Parse  AST   to  extract  coding   style   features.

Source CodeAST

Dec 29, 2015

18 of 60

Page 19: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Random  Forests  are  made  of  decision  treesDec 29, 2015

19 of 60

Page 20: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Decision  Trees• Representation– Each  internal  node  tests  an  attribute– Each  branch  is  an  attribute  value– Each  leaf  assigns  a  classification

Dec 29, 2015

20 of 60

Page 21: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Choosing  an  attributeDec 29, 2015

21 of 60

Which is better?

Page 22: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Dec 29, 2015

22 of 60

Information gain

• A chosen attribute A divides the training set E into subsets E1, … , Ev according to their values for A, where A has v distinct values.

• Information Gain (IG) or reduction in entropy from the attribute test:

• remainder(A) is the remaining uncertainty after splitting on the attribute

• Choose the attribute with the largest IG

Page 23: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Dec 29, 2015

23 of 60

Information gainFor the training set, p = n = 6, I(6/12, 6/12) = 1 bit

Consider the attributes Patrons and Type (and others too):

Patrons has the highest IG of all attributes and so is chosen by the DTL algorithm as the root

Page 24: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

934  important  features  of  code  stylometry  by  information  gainOUT  OF  120,000  FEATURES

FeatureType Percentage CountWord  Unigram Frequency 55% 517AST  Node-­‐Bigram  Frequency 31% 291AST  Node  AverageDepth 5% 48AST  Node Frequency 4% 38AST  Node TFIDF 2% 19C++ Keywords 2% 15Layout  Features 1% 6

Dec 29, 2015

24 of 60

Page 25: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

When  machine  learning  goes  wrong

• Bias  vs  variance

Dec 29, 2015

25 of 60

Page 26: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Random  Forest• Individual  Trees  have  low  bias,  but  high  variance

• Problem:  Overfitting• Solution:  Forest  not  a  tree,  build  trees  on  subsets  of  the  training  data,  using  subsets  of  the  features

Dec 29, 2015

26 of 60

Page 27: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Source  code  stylometryMethod①Use  random  forest  as  the  machine  learning  classifier,  

①avoid  over-­‐fitting②multi-­‐class  classifier  by  nature

②K-­‐fold  cross  validation③Validate  method  on  a  different  dataset

Dec 29, 2015

27 of 60

Page 28: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  1:  Authorship  attribution• Who  is  this  anonymous  programmer?• Who  is  Satoshi?

Dec 29, 2015

28 of 60

Page 29: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  1:  Authorship  attribution

Train on 1,600 authorsto identify the authors of

14,400 files train

test

94% accuracy

• 94%  accuracy  in  identifying  1,600  authors  of  14,400  anonymous  program  files.

Dec 29, 2015

29 of 60

Page 30: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  1:  Authorship  attribution

Train on the suspect setto de-anonymize theinitial Bitcoin author

train

test

Satoshi = git contributor

• If  only  we  had  a  suspect  set  for  Satoshi…

Dec 29, 2015

30 of 60

Page 31: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  Obfuscation• Who  is  the  programmer  of  this  obfuscated  source  code?

• Code  is  obfuscated  to  become  unrecognizable.• Our  authorship  attribution  technique  is  impervious  to  common  off-­‐the-­‐shelf  source  code  obfuscators.

Dec 29, 2015

31 of 60

Page 32: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C++  Obfuscation  -­‐ STUNNIXDec 29, 2015

32 of 60

Page 33: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C++  Obfuscation  -­‐ STUNNIXDec 29, 2015

33 of 60

Page 34: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C++  Obfuscation  -­‐ STUNNIX

Same set  of  20  authorswith  180  program  files

Classification  Accuracy

Original   source code 99%STUNNIX-­‐Obfuscated  source  code 99%

Dec 29, 2015

34 of 60

Page 35: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C  Obfuscation  -­‐ TIGRESSDec 29, 2015

35 of 60

Page 36: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C  Obfuscation  -­‐ TIGRESSDec 29, 2015

36 of 60

Page 37: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  2:  C  Obfuscation  -­‐ TIGRESS

Same set  of  20  authorswith  180  program  files

Classification  Accuracy

Original  C  source code 96%TIGRESS-­‐Obfuscated  source  code 67%Random chance  of  correct  de-­‐anonymization 5%

Dec 29, 2015

37 of 60

Page 38: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  3:  Coding  style  throughout  years• Is  programming  style  consistent?• If  yes,  we  can  use  code  from  different  years  for  authorship  attribution.

Dec 29, 2015

38 of 60

Page 39: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  3:  Coding  style  throughout  years• Coding  style  is  preserved  up  to  some  degree  throughout  years.

train

Train on 25 authors from 2012to identify the author of

25 files in 2014test

96% accuracy

Dec 29, 2015

39 of 60

Page 40: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Case  3:  Coding  style  throughout  years• 98%  accuracy,  train  and  test  in  2014• 96%  accuracy,  train  on  2012,  test  on  2014

train

Train on 25 authors from 2012to identify the author of

25 files in 2014test

96% accuracy

Dec 29, 2015

40 of 60

Page 41: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Feature set: Using ‘only’ the Python equivalents of syntactic featuresApplication Programmers Instances Result

Python programmer  de-­‐anonymization 229 2,061 53.9%

Top-­‐5  relaxed  classification 229 2,061 75.7%

Python programmer  de-­‐anonymization 23 207 87.9%

Top-­‐5  relaxed  classification 23 207 99.5%

Case  4:  Generalizing  the  approach  -­‐ pythonDec 29, 2015

41 of 60

Page 42: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Results

Application Classes Instances Accuracy

Stylometric  plagiarism detection 250  class 2,250 98.0%

Large scale  de-­‐anonymization 1,600 class 14,400 93.6%

Copyright investigation Two-­‐class 1,080 100.0%

Authorship   verification Two-­‐class/One-­‐class 2,240 91.0%

Open world  problem Multi-­‐class 420 96.0%

A new principled method with a robust syntactic feature set for de-anonymizing programmers.

Dec 29, 2015

42 of 60

Page 43: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Git Blame  Who?• A  lot  of  code  is  collaborative• >  70%  accuracy  for  individual  attribution  for  a  single  git commit,  higher  for  multiple  commits/account

• For  accounts,  we  can  attribute  single  commits  and  ensemble  them  (errors  are  uncorrelated,  accuracy  quickly  reaches  close  to  100%)

43 of 66

Page 44: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Attributing  Code  FragmentsDec 29, 2015

44 of 60

Page 45: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Open  World  AttributionDec 29, 2015

45 of 60

Page 46: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

How  much  data  for  attribution?Dec 29, 2015

46 of 60

Page 47: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

How  much  code?Dec 29, 2015

47 of 60

Page 48: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Aggregated  SamplesDec 29, 2015

48 of 60

Page 49: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

What  about  executable  binaries?Compiled  code  looks  cryptic

00100000   00000000  00001000  00000000  00101000   00000000      00000000   00000000  00110100  00000000  00000000   00000000    00000100   00001000  00000000  00000001  00000000   00000000    00000000   00000001  00000000  00000000  00000101   00000000    00000000   00000000  00000100  00000000  00000000   00000000    00000011   00000000  00000000  00000000  00110100   00000001    00000000   00000000  00110100  10000001  00000100   00001000    00000000   00000000  00010011  00000000  00000000   00000000    00000100   00000000  00000000  00000000  00000001   00000000    00000000   00000000  00000001  00000000  00000000   00000000    00000000   00000000  00000000  00000000  00000000   10000000    00000100   00001000  00000000  10000000  00000100   00001000    11001000   00010111  00000000  00000000  11001000   00010111    00000000   00000000  00000101  00000000  00000000   00000000    00000000   00010000  00000000  00000000  00000001   00000000    00000000   00000000  11001000  00010111  00000000   00000000    11001000   10100111  00000100  00001000  11001000   10100111    00000100   00001000  00101100  00000001  00000000   00000000    00000000   00000000  00000000  00010000  00000000   00000000    00000010  00000000   00000000  00000000  11011100  00010111    

Source  Code#include   <cstdio>#include   <algorithm>using  namespace  std;#define For(i,a,b)   for(int i  =  a;  i  <  b;  i++)#define FOR(i,a,b)  for(int i  =  b-­‐1;  i  >=  a;  i-­‐-­‐)double   nextDouble()   {

double   x;scanf("%lf",  &x);return x;}

int nextInt()   {int x;scanf("%d",  &x);return   x;  }

int n;double   a1[1001],  a2[1001];int main()  {

freopen("D-­‐small-­‐attempt0.in",   "r",  stdin);freopen("D-­‐small.out",   "w",  stdout);int tt  =  nextInt();For(t,1,tt+1)   {

int n  =  nextInt(); . . . . . .

Dec 29, 2015

49 of 60

Page 50: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Can  we  identify  the  author  of  this  binary?00100000  00000000   00001000  00000000  00101000  00000000      00000000  00000000  00110100   00000000  00000000  00000000    00000100  00001000  00000000   00000001  00000000  00000000    00000000  00000001  00000000   00000000  00000101  00000000    00000000  00000000  00000100   00000000  00000000  00000000    00000011  00000000  00000000   00000000  00110100  00000001    00000000  00000000  00110100   10000001  00000100  00001000    00000000  00000000  00010011   00000000  00000000  00000000    00000100  00000000  00000000   00000000  00000001  00000000    00000000  00000000  00000001   00000000  00000000  00000000    00000000  00000000  00000000   00000000  00000000  10000000    00000100  00001000  00000000   10000000  00000100  00001000    11001000  00010111  00000000   00000000  11001000  00010111    00000000  00000000  00000101   00000000  00000000  00000000    00000000  00010000  00000000   00000000  00000001  00000000    00000000  00000000  11001000   00010111  00000000  00000000    11001000  10100111  00000100   00001000  11001000  10100111    00000100  00001000  00101100   00000001  00000000  00000000    00000000  00000000  00000000   00010000  00000000  00000000    00000010  00000000  00000000   00000000  11011100  00010111    

.  .  .

Dec 29, 2015

50 of 60

Page 51: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

WHEN  CODING  STYLE  SURVIVES  COMPILATION:DE-­‐ANONYMIZING  PROGRAMMERS  FROM  

EXECUTABLE  BINARIES

When Coding Style Survives Compilation: De-anonymizing Programmers from Executable BinariesAylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan.Under Submission, 2016 available on arXiv

Page 52: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Finding  the  author  of  an  executable  binary?

• Coding  style  in  compiled  code

• Threat  to  privacy  and  anonymity

• Malware  classification?

Dec 29, 2015

52 of 60

Page 53: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Related  workDec 29, 2015

53 of 60

Page 54: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Our  workflow

54 of 60

Page 55: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

FeaturesDec 29, 2015

55 of 60

Page 56: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Feature  set  from  100  programmersDec 29, 2015

56 of 60

Page 57: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Large  Scale  Programmer  DeanonymizationDec 29, 2015

57 of 60

Page 58: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Dec 29, 2015

58 of 60

Page 59: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Compiler  Optimization

The  drop  in  accuracy  is  not  tragic!  

Dec 29, 2015

59 of 60

Page 60: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Reconstructing  original  features

• Original  vs predicted  features– Average  cos distance:  0.81

• Original  vs decompiled  features– Average  cos distance:  0.35

Dec 29, 2015

60 of 60

Page 61: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Reconstructing  original  features• Original  vs predicted  features– Average  cos distance:  0.81

• This  suggests  that  original  features  are  transformed  but  not  entirely  lost  in  compilation.

Dec 29, 2015

61 of 60

Page 62: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Insights

More  advanced  programmers  are  easier  to  de-­‐anonymize

Dec 29, 2015

62 of 60

Page 63: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Real  World  Binary  Attribution• Code  repositories  from  GitHub– 65%  accuracy  (less  code  to  train  on)

• Binaries  mined  from  leaked  Nulled.io forum– 4  users,  3  with  enough  data  to  train  a  model.  3  correctly  identified,  4th identified  as  not  one  of  the  other  3.

63 of 66

Page 64: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Future  work• Anonymizing  executable  binaries– optimizations  is  not  the  answer

• De-­‐anonymizing  collaborative  binaries• Malware  family  classification

Dec 29, 2015

64 of 60

Page 65: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

Available  tools• Programmer  de-­‐anonymization– https://github.com/calaylin

• Jstylo– prose  authorship  attribution  framework

• Anonymouth– writing  anonymization

Dec 29, 2015

65 of 60

Page 66: Dec 29, 2015 Stylistic(fingerprintssummerschool-croatia.cs.ru.nl/2016/slides/RachelGreenstadt.pdf · Stylistic(fingerprints • Stylometry(has(been(applied(to: – Fine7art – Music

THANKS CO-AUTHORS Edwin Dauber and Dr. Richard Harang Dr. Konrad Rieck Dr. Arvind Narayanan

Dr. Clare Voss Dr. Fabian Yamaguchi Dr. Aylin Caliskan-Islam