graph mining for vulnerability...

42
GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN Graph Mining for Vulnerability Discovery Konrad Rieck Computer Security Group University of Göttingen

Upload: others

Post on 07-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

Graph Mining for Vulnerability Discovery

Konrad RieckComputer Security Group

University of Göttingen

Page 2: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹University of Göttingen

2

117,000 citizens 31% students

201,000 citizens 31% students

Göttingen

Rennes

Page 3: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹About me

› Konrad Rieck

› Fun with security and machine learning for 10 years

› Research group at the University of Göttingen

› Research focus: intelligent security systems

› Intrusion detection, malware analysis, vulnerability discovery

3

Page 4: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerabilities in Software

› Vulnerabilities — A root cause for security breaches

› 02/2014: Security flaw in Apple’s TLS/SSL code All Apple devices vulnerable to MITM attacks

› 04/2014: Security flaw in OpenSSL library Memory readable on millions of Internet servers

› 09/2014: Security flaw Unix shell Bash Millions of web services vulnerable via CGI

› 07/2015: Security flaw in Android platformRemote code execution on majority of devices

4

Heartbleed

goto fail; goto fail;

Goto Fail

Shellshock

Stagefright

Page 5: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Finding Vulnerabilities

› Discovery of vulnerabilities far from trivial

› Some low-hanging fruits (strcat, strcpy, sprintf, …) › More often subtle errors in programming

› Current strategies for discovery of vulnerabilities in code

› Testing and fuzzing of implementations › Taint analysis and symbolic execution › ... still many bugs only discovered by manual auditing

↯ Fully automated discovery impossible (Rice’s theorem)

5

Page 6: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Finding Vulnerabilities

6

Researchers

Vulnerabilities in code

Page 7: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Finding Vulnerabilities

6

Researchers

Vulnerabilities in code

Our focus

Page 8: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Our Research Focus

› Methods to make vulnerability discovery more effective

› No tools for monkeys: supporting, not replacing the analyst

› Address different scenarios encountered during analysis

› If the method is not practical, we do not care about it

› Main concept: Assisted discovery of vulnerabilities

› Analysis supported by data mining and machine learning

› Augment view of analyst and help her save time

› Suggest interesting code and guide auditing

7

Page 9: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Our Approaches

› Vulnerability Extrapolation (ACSAC 2012)

› Finding code similar to a known vulnerability

› Missing-Check Detection (CCS 2013)

› Discovery of missing and faulty security checks

› Code Property Graphs (IEEE S&P 2014)

› Mining for vulnerabilities using graph databases

› Taint-Style Vulnerabilities (IEEE S&P 2015)

› Detection patterns for common vulnerabilities

8

This talk

Page 10: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

‹Code Property GraphsModeling and Discovering Vulnerabilities with Code Property Graphs.

Fabian Yamaguchi, Nico Golde, Daniel Arp, and Konrad Rieck. 35th IEEE Symposium on Security & Privacy, 2014

Page 11: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability in LibSSH2

10

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

AST edgeCFG edgePDG edge

Fig. 4: Code property graph for the code sample given in Figure 1.

shown. The nodes of the graph mainly match the AST inFigure 2a (except for the irrelevant FUNC and IF node), whilethe transformed CFG and PDG are indicated by colored edges.

V. TRAVERSALS FORWELL-KNOWN TYPES OF VULNERABILITIES

The code property graph allows many different kinds ofprogramming patterns to be expressed, however, it is notimmediately clear how it can be employed to discover vulner-abilities. In this section, we show that code property graphscan be effectively mined to identify many different types ofsecurity flaws and develop templates for the description ofvulnerabilities. We begin by exploring the limitations of purelysyntactic descriptions of code in Section V-B and proceed toshow that additional control flow information only provides aslight improvement (Section V-C). Finally, in Section V-D, dataflow, control flow and syntactical information are combined,thus making a large variety of vulnerabilities accessible.

A. Motivational ExampleWe begin with a recent example of a buffer overflow found

in an SSH implementation by Esser [7] exposing many AppleiOS applications to attack. Esser employed a regular expressionto spot the vulnerable code shown in Figure 5.

[...] 1

if (channelp) { 2

/* set signal name (without SIG prefix) */ 3

uint32_t namelen = 4

_libssh2_ntohu32(data + 9 + sizeof("exit-signal")); 5

channelp->exit_signal = 6

LIBSSH2_ALLOC(session, namelen + 1); 7

[...] 8

memcpy(channelp->exit_signal, 9

data + 13 + sizeof("exit_signal"), namelen); 10

channelp->exit_signal[namelen] = ’\0’; 11

[...] 12

} 13

[...] 14

Fig. 5: Excerpt from the code of libssh2 showing a vulnera-bility in the function libssh2 packet add

The vulnerable statement (marked in red) allocates memoryfor the buffer exit signal using the function LIBSSH2 ALLOC

on line 6. The amount of memory to allocate is calculateddirectly in the argument by adding 1 to the variable namelen.Unfortunately, this variable is attacker-controlled and thus ifit is chosen to be the maximum size of a 32-bit unsignedinteger, the summation wraps and a value of 0 is passed to theallocation function resulting in the allocation of only a fewpadding bytes. When namelen bytes are then copied into theundersized buffer on line 9, a buffer overflow occurs.

Esser was able to discover the vulnerable statement on line 6using the following regular expression:

ALLOC[A-Z0-9_]*\s*\([ˆ,]*,[ˆ;]*[*+-][ˆ>][ˆ;]*\)\s*; .

Unfortunately, the regular expression only describes thesummation inside the allocation call, one of a number ofnecessary conditions for the vulnerability. Moreover, the de-scription is inherently vague as regular expressions cannotmatch the nested structure of code. However, the biggestdrawback of the formulation is that the description fails tomodel attacker control over the variable namelen. Furthermore,the vulnerability would not exist if the variable had beenproperly sanitized. Finally, the width of the variable is vitalfor the vulnerability.

This simple example gives insight into the different prop-erties of code that play a role in the characterization ofvulnerability patterns. In summary, the following aspects needto be covered.

1) Sensitive operations. Security sensitive operations suchas calls to protected functionality, copying into buffersor the allocation of memory need to be describable.As the example shows, nested code such as arithmeticoperations inside allocations are of great interest, andthus full access to an AST is necessary.

2) Type usage. Many vulnerabilities are tightly boundto data types used in a program. For example, thevulnerability shown in Figure 5 would not exist ifnamelen was a 16 bit integer as opposed to a 32 bitinteger. This information is present in the AST.

3) Attacker control. Analysts must be able to expresswhich data sources are under attacker control. Referringto the example, it is highly likely that variables returnedby libssh2 ntohu32 are attacker-controlled as the rou-tine converts an integer from network to host byte orderand hence the integer is almost certainly received from

Discovered by Stefan Esser (SyScan’13)

Page 12: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability in LibSSH2

10

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

AST edgeCFG edgePDG edge

Fig. 4: Code property graph for the code sample given in Figure 1.

shown. The nodes of the graph mainly match the AST inFigure 2a (except for the irrelevant FUNC and IF node), whilethe transformed CFG and PDG are indicated by colored edges.

V. TRAVERSALS FORWELL-KNOWN TYPES OF VULNERABILITIES

The code property graph allows many different kinds ofprogramming patterns to be expressed, however, it is notimmediately clear how it can be employed to discover vulner-abilities. In this section, we show that code property graphscan be effectively mined to identify many different types ofsecurity flaws and develop templates for the description ofvulnerabilities. We begin by exploring the limitations of purelysyntactic descriptions of code in Section V-B and proceed toshow that additional control flow information only provides aslight improvement (Section V-C). Finally, in Section V-D, dataflow, control flow and syntactical information are combined,thus making a large variety of vulnerabilities accessible.

A. Motivational ExampleWe begin with a recent example of a buffer overflow found

in an SSH implementation by Esser [7] exposing many AppleiOS applications to attack. Esser employed a regular expressionto spot the vulnerable code shown in Figure 5.

[...] 1

if (channelp) { 2

/* set signal name (without SIG prefix) */ 3

uint32_t namelen = 4

_libssh2_ntohu32(data + 9 + sizeof("exit-signal")); 5

channelp->exit_signal = 6

LIBSSH2_ALLOC(session, namelen + 1); 7

[...] 8

memcpy(channelp->exit_signal, 9

data + 13 + sizeof("exit_signal"), namelen); 10

channelp->exit_signal[namelen] = ’\0’; 11

[...] 12

} 13

[...] 14

Fig. 5: Excerpt from the code of libssh2 showing a vulnera-bility in the function libssh2 packet add

The vulnerable statement (marked in red) allocates memoryfor the buffer exit signal using the function LIBSSH2 ALLOC

on line 6. The amount of memory to allocate is calculateddirectly in the argument by adding 1 to the variable namelen.Unfortunately, this variable is attacker-controlled and thus ifit is chosen to be the maximum size of a 32-bit unsignedinteger, the summation wraps and a value of 0 is passed to theallocation function resulting in the allocation of only a fewpadding bytes. When namelen bytes are then copied into theundersized buffer on line 9, a buffer overflow occurs.

Esser was able to discover the vulnerable statement on line 6using the following regular expression:

ALLOC[A-Z0-9_]*\s*\([ˆ,]*,[ˆ;]*[*+-][ˆ>][ˆ;]*\)\s*; .

Unfortunately, the regular expression only describes thesummation inside the allocation call, one of a number ofnecessary conditions for the vulnerability. Moreover, the de-scription is inherently vague as regular expressions cannotmatch the nested structure of code. However, the biggestdrawback of the formulation is that the description fails tomodel attacker control over the variable namelen. Furthermore,the vulnerability would not exist if the variable had beenproperly sanitized. Finally, the width of the variable is vitalfor the vulnerability.

This simple example gives insight into the different prop-erties of code that play a role in the characterization ofvulnerability patterns. In summary, the following aspects needto be covered.

1) Sensitive operations. Security sensitive operations suchas calls to protected functionality, copying into buffersor the allocation of memory need to be describable.As the example shows, nested code such as arithmeticoperations inside allocations are of great interest, andthus full access to an AST is necessary.

2) Type usage. Many vulnerabilities are tightly boundto data types used in a program. For example, thevulnerability shown in Figure 5 would not exist ifnamelen was a 16 bit integer as opposed to a 32 bitinteger. This information is present in the AST.

3) Attacker control. Analysts must be able to expresswhich data sources are under attacker control. Referringto the example, it is highly likely that variables returnedby libssh2 ntohu32 are attacker-controlled as the rou-tine converts an integer from network to host byte orderand hence the integer is almost certainly received from

Attacker-controlled data in unsigned 32-bit integer

Discovered by Stefan Esser (SyScan’13)

Page 13: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability in LibSSH2

10

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

AST edgeCFG edgePDG edge

Fig. 4: Code property graph for the code sample given in Figure 1.

shown. The nodes of the graph mainly match the AST inFigure 2a (except for the irrelevant FUNC and IF node), whilethe transformed CFG and PDG are indicated by colored edges.

V. TRAVERSALS FORWELL-KNOWN TYPES OF VULNERABILITIES

The code property graph allows many different kinds ofprogramming patterns to be expressed, however, it is notimmediately clear how it can be employed to discover vulner-abilities. In this section, we show that code property graphscan be effectively mined to identify many different types ofsecurity flaws and develop templates for the description ofvulnerabilities. We begin by exploring the limitations of purelysyntactic descriptions of code in Section V-B and proceed toshow that additional control flow information only provides aslight improvement (Section V-C). Finally, in Section V-D, dataflow, control flow and syntactical information are combined,thus making a large variety of vulnerabilities accessible.

A. Motivational ExampleWe begin with a recent example of a buffer overflow found

in an SSH implementation by Esser [7] exposing many AppleiOS applications to attack. Esser employed a regular expressionto spot the vulnerable code shown in Figure 5.

[...] 1

if (channelp) { 2

/* set signal name (without SIG prefix) */ 3

uint32_t namelen = 4

_libssh2_ntohu32(data + 9 + sizeof("exit-signal")); 5

channelp->exit_signal = 6

LIBSSH2_ALLOC(session, namelen + 1); 7

[...] 8

memcpy(channelp->exit_signal, 9

data + 13 + sizeof("exit_signal"), namelen); 10

channelp->exit_signal[namelen] = ’\0’; 11

[...] 12

} 13

[...] 14

Fig. 5: Excerpt from the code of libssh2 showing a vulnera-bility in the function libssh2 packet add

The vulnerable statement (marked in red) allocates memoryfor the buffer exit signal using the function LIBSSH2 ALLOC

on line 6. The amount of memory to allocate is calculateddirectly in the argument by adding 1 to the variable namelen.Unfortunately, this variable is attacker-controlled and thus ifit is chosen to be the maximum size of a 32-bit unsignedinteger, the summation wraps and a value of 0 is passed to theallocation function resulting in the allocation of only a fewpadding bytes. When namelen bytes are then copied into theundersized buffer on line 9, a buffer overflow occurs.

Esser was able to discover the vulnerable statement on line 6using the following regular expression:

ALLOC[A-Z0-9_]*\s*\([ˆ,]*,[ˆ;]*[*+-][ˆ>][ˆ;]*\)\s*; .

Unfortunately, the regular expression only describes thesummation inside the allocation call, one of a number ofnecessary conditions for the vulnerability. Moreover, the de-scription is inherently vague as regular expressions cannotmatch the nested structure of code. However, the biggestdrawback of the formulation is that the description fails tomodel attacker control over the variable namelen. Furthermore,the vulnerability would not exist if the variable had beenproperly sanitized. Finally, the width of the variable is vitalfor the vulnerability.

This simple example gives insight into the different prop-erties of code that play a role in the characterization ofvulnerability patterns. In summary, the following aspects needto be covered.

1) Sensitive operations. Security sensitive operations suchas calls to protected functionality, copying into buffersor the allocation of memory need to be describable.As the example shows, nested code such as arithmeticoperations inside allocations are of great interest, andthus full access to an AST is necessary.

2) Type usage. Many vulnerabilities are tightly boundto data types used in a program. For example, thevulnerability shown in Figure 5 would not exist ifnamelen was a 16 bit integer as opposed to a 32 bitinteger. This information is present in the AST.

3) Attacker control. Analysts must be able to expresswhich data sources are under attacker control. Referringto the example, it is highly likely that variables returnedby libssh2 ntohu32 are attacker-controlled as the rou-tine converts an integer from network to host byte orderand hence the integer is almost certainly received from

Attacker-controlled data in unsigned 32-bit integer

Addition in argument

Discovered by Stefan Esser (SyScan’13)

Page 14: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability in LibSSH2

10

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

AST edgeCFG edgePDG edge

Fig. 4: Code property graph for the code sample given in Figure 1.

shown. The nodes of the graph mainly match the AST inFigure 2a (except for the irrelevant FUNC and IF node), whilethe transformed CFG and PDG are indicated by colored edges.

V. TRAVERSALS FORWELL-KNOWN TYPES OF VULNERABILITIES

The code property graph allows many different kinds ofprogramming patterns to be expressed, however, it is notimmediately clear how it can be employed to discover vulner-abilities. In this section, we show that code property graphscan be effectively mined to identify many different types ofsecurity flaws and develop templates for the description ofvulnerabilities. We begin by exploring the limitations of purelysyntactic descriptions of code in Section V-B and proceed toshow that additional control flow information only provides aslight improvement (Section V-C). Finally, in Section V-D, dataflow, control flow and syntactical information are combined,thus making a large variety of vulnerabilities accessible.

A. Motivational ExampleWe begin with a recent example of a buffer overflow found

in an SSH implementation by Esser [7] exposing many AppleiOS applications to attack. Esser employed a regular expressionto spot the vulnerable code shown in Figure 5.

[...] 1

if (channelp) { 2

/* set signal name (without SIG prefix) */ 3

uint32_t namelen = 4

_libssh2_ntohu32(data + 9 + sizeof("exit-signal")); 5

channelp->exit_signal = 6

LIBSSH2_ALLOC(session, namelen + 1); 7

[...] 8

memcpy(channelp->exit_signal, 9

data + 13 + sizeof("exit_signal"), namelen); 10

channelp->exit_signal[namelen] = ’\0’; 11

[...] 12

} 13

[...] 14

Fig. 5: Excerpt from the code of libssh2 showing a vulnera-bility in the function libssh2 packet add

The vulnerable statement (marked in red) allocates memoryfor the buffer exit signal using the function LIBSSH2 ALLOC

on line 6. The amount of memory to allocate is calculateddirectly in the argument by adding 1 to the variable namelen.Unfortunately, this variable is attacker-controlled and thus ifit is chosen to be the maximum size of a 32-bit unsignedinteger, the summation wraps and a value of 0 is passed to theallocation function resulting in the allocation of only a fewpadding bytes. When namelen bytes are then copied into theundersized buffer on line 9, a buffer overflow occurs.

Esser was able to discover the vulnerable statement on line 6using the following regular expression:

ALLOC[A-Z0-9_]*\s*\([ˆ,]*,[ˆ;]*[*+-][ˆ>][ˆ;]*\)\s*; .

Unfortunately, the regular expression only describes thesummation inside the allocation call, one of a number ofnecessary conditions for the vulnerability. Moreover, the de-scription is inherently vague as regular expressions cannotmatch the nested structure of code. However, the biggestdrawback of the formulation is that the description fails tomodel attacker control over the variable namelen. Furthermore,the vulnerability would not exist if the variable had beenproperly sanitized. Finally, the width of the variable is vitalfor the vulnerability.

This simple example gives insight into the different prop-erties of code that play a role in the characterization ofvulnerability patterns. In summary, the following aspects needto be covered.

1) Sensitive operations. Security sensitive operations suchas calls to protected functionality, copying into buffersor the allocation of memory need to be describable.As the example shows, nested code such as arithmeticoperations inside allocations are of great interest, andthus full access to an AST is necessary.

2) Type usage. Many vulnerabilities are tightly boundto data types used in a program. For example, thevulnerability shown in Figure 5 would not exist ifnamelen was a 16 bit integer as opposed to a 32 bitinteger. This information is present in the AST.

3) Attacker control. Analysts must be able to expresswhich data sources are under attacker control. Referringto the example, it is highly likely that variables returnedby libssh2 ntohu32 are attacker-controlled as the rou-tine converts an integer from network to host byte orderand hence the integer is almost certainly received from

Attacker-controlled data in unsigned 32-bit integer

Addition in argument

Heap overflow!

Discovered by Stefan Esser (SyScan’13)

Page 15: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability Discovery in Practice

› Q: How did Stefan Esser find the bug?

› Black-box and white-box fuzzing? Nope.

› Dynamic taint tracking and symbolic execution? Nope.

› Theorem proving? Model checking? Nope.

11

ALLOC[A-Z0-9_]*\s* \([^,]*,[^;]*[*+-][^>][^;]*\)\s*;

› A: A regular expression for grep!

Page 16: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability Discovery in Practice

› Q: How did Stefan Esser find the bug?

› Black-box and white-box fuzzing? Nope.

› Dynamic taint tracking and symbolic execution? Nope.

› Theorem proving? Model checking? Nope.

11

ALLOC[A-Z0-9_]*\s* \([^,]*,[^;]*[*+-][^>][^;]*\)\s*;

Allocation function

Math in argument

› A: A regular expression for grep!

Page 17: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Vulnerability Discovery in Practice

› Expert knowledge key to discovery of vulnerabilities

› Not necessary need for specialized auditing tools

› Simple search tools often sufficient, e.g. grep

› Encoding of expert knowledge as search query

› Our example

› Knowledge: “Math risky in allocation functions”

› Query: ALLOC[A-Z0-9_]*\s*\([^,]*,[^;]*[*+-][^>][^;]*\)\s*;

12

Page 18: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹A Generic Framework

› Framework for searching vulnerabilities

› Design of a code database for vulnerability discovery

› Modeling vulnerabilities with queries for the system

› Link between automatic analysis and expert knowledge

› Vision of my PhD student Fabian Yamaguchi…

13

Source code Analysis Database

int lala(…){int foo(…)int woowoo(…){ if(… Data flow

Control flow

ParsingQuery

Expert

Page 19: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹What do we need?

› Comprehensive view on program code

› Syntactical analysis

› How does the program code look like?

› Control-flow analysis

› How is the program code executed?

› Data-flow analysis

› How is data processed by the code?

14

Data flow

Control flow

Parsing

Page 20: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Abstract Syntax Trees (AST)

› Representation of syntax and language constructs

› Nodes: Statements, declarations, calls, operators, …

› Edges: nesting of language constructs

15

DECL

COND

IF

STMT

ARG

FUNC

int =

x

source

CALL

=

y *

2 x

int

DECL<

x MAX

CALL

sink

y

DECL

PRED

IF

STMT

ARG

FUNC

int =

x

source

CALL

=

y *

2 x

int

DECL<

x MAX

CALL

sink

y

void foo() { int x = source(); if (x < MAX)

{ int y = 2 * x; sink(y); }}

Source code AST

Page 21: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Control-Flow Graph (CFG)

› Representation of code logic and execution

› Nodes: statements and conditions

› Edges: conditional flow of control

› Derivation from nodes of AST

16

ENTRY

EXIT

int x = source()

if (x < MAX)

y = 2 * x

sink(y)

false

true

ε

ε

ε

ε

ENTRY

EXIT

int x = source()

if (x < MAX)

sink(y)

int y = 2 * x

void foo() { int x = source(); if (x < MAX)

{ int y = 2 * x; sink(y); }}

Source code

CFG

Page 22: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Program Dependence Graph (PDG)

› Representation of data flow and dependencies

› Nodes: statements and conditions

› Edges: Control (C) and data (D) dependencies

› Derivation from nodes and edges of CFG

17

int x = source()

if (x < MAX) y = 2 * x

sink(y)

DxDx

Ctrue

Ctrue

Dy

int x = source()

if (x < MAX) int y = 2 * x

sink(y)

void foo() { int x = source(); if (x < MAX)

{ int y = 2 * x; sink(y); }}

Source code

PDG

Page 23: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹A Combined Representation

› Vulnerabilities often reflected in all representations

› “… find call X reachable from Y processing Z …”

› Idea: Merge AST/CFG/PDG in a combined representation

› Graph structure with shared nodes and edges

› Possibility to jump back and forth between views

› Basis for powerful search queries

18

Subtree in the AST

Path in the PDG

Path in the CFG and PDG

Page 24: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Code Property Graph

› Meet the Code Property Graph

› Nodes from AST & edges from AST, CFG and PDG

› Edge-labeled multi-graph with properties attached to nodes

› … or short: property graph

19

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

Page 25: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Code Property Graph

› Meet the Code Property Graph

› Nodes from AST & edges from AST, CFG and PDG

› Edge-labeled multi-graph with properties attached to nodes

› … or short: property graph

19

ENTRY EXIT

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL COND DECL CALLENTRY EXIT

false

int =

x

source

CALL

<

x MAX

=

y *

2 x

int ARGsink

y

DECL PRED DECL CALLtrue ε ε Dy Ctrue

Ctrue Dx

ε Dx ε

Example: Call to sink reachable if x < MAX

processing x

Page 26: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Queries for Graphs

› Queries modelled as traversals in code property graph

› Inspiration from modern graph databases

› Traversals = Walk from one set of nodes to another

› Walk based on edge labels and node properties

› Implementation with Gremlin (and Cypher)

› Query languages for graph databases

› Supported by common databases, such as Neo4J

20

Page 27: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Chaining Traversals

› Chaining of traversals by function composition

› Input and output domain identical

› Complex queries based on chain of simple traversals

› Idea: Construct utility traversals for vulnerability discovery

› Modelling of common analysis steps

› Example: A match traversal

› All AST nodes below nodes X satisfying predicate p

21

Matchp(X) = Filterp � TNodes(X)

All lower nodes in tree

Predicate to match

Page 28: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Example: LibSSH2 Bug

› Implementation as chain of traversals

› Extract all function calls & find all calls to malloc

› Extract all arguments & find all first arguments

› Check for math operations in first argument

› Gremlin code for querying the code property graph

22

getArguments(‘malloc', '0') .astNodes() .filter{it.type == ‘AdditiveExpression'}

Page 29: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Example: Overflows in Linux Kernel

› Query for hunting overflows in write handlers

› Find all calls to memcpy with argument called count

› Find all unsanitized data flows from variable count

› Gremlin code for this traversal

23

getFunctionASTsByName('*_write*').getArguments('memcpy', '2')

.unsanitized({ it._().or(

_().isCheck('.*' + paramName + '.*'), _().codeContains('.*alloc.*' + paramName + '.*'), _().codeContains('.*min.*') )})

.param( '.*c(ou)?nt.*' )

Source: variable from userspace

Sink: memcpy in write functions

Check forsanitization

Page 30: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Evaluation

› Are code property graphs and traversals really helpful?

› Evaluation with security expert from industry

› Nico Golde from Qualcomm

› Audit of internal and Linux kernel code

› Analysis of past Linux vulnerabilities

› Joint design of five traversals for common flaws

› Types: buffer overflows (2), zero-byte allocation (1), memory mapping bugs (1), memory disclosure (1)

24

Page 31: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Coverage of Vulnerabilities

› Analysis of all vulnerabilities in the Linux kernel in 2012

› 10 out of 12 types covered by coder property graphs

25

we present an evaluation of our approach, showing that thetypes of vulnerabilities covered are indeed relevant for today’ssecurity-critical code.

VI. EVALUATION

We proceed to evaluate the practical efficacy of our approachon the source code of the Linux kernel, a large code basethat is regularly audited for vulnerabilities by several softwarevendors and the open-source community. Our evaluation iscarried out in two steps: first, we conduct a coverage analysisby reviewing the code of all vulnerabilities reported for theLinux kernel in 2012 and determining which vulnerabilitytypes can be modeled using graph traversals (Section VI-B).Second, we study the ability of our approach to discovervulnerabilities by constructing traversals for prevalent vulner-abilities and applying them to the code property graph of theLinux kernel (Section VI-C).

A. ImplementationFor our evaluation we implement a static code analysis

system based on the idea of code property graphs4. Our systememploys a robust C/C++ parser to first extract ASTs for eachfunction in a given code base. We then transform these ASTsinto CFGs and PDGs and merge all three representations to acode property graph as outlined in Section IV. Additionally, weintroduce nodes for global variables and structure declarationsin the code. We finally link together the graphs of all functionsbased on visible caller-callee relationships, thus representingthe entire code base as one large code property graph.

For the source code of the Linux kernel, we obtain agraph with 52 million nodes and 87 million edges. Obviously,mining information in such a large graph on commodityhardware is far from being a trivial task. Fortunately, we canmake use of specialized graph databases that are capable ofproviding efficient access to very large property graphs (weemploy Neo4J Version 1.9.5). Moreover, these graph databasesallow us to benefit from sophisticated caching algorithms thataccelerate traversals over the graph.

Using a prototype implementation, importing the Linuxkernel version 3.10-rc1 with approximately 1.3 million linesof code takes a total of 110 minutes on a laptop computer witha 2.5 GHz Intel Core i5 CPU and 8 GB of main memory. Theresulting database requires 14 GB of disk space for nodes andedges as well as another 14 GB for efficient indexing.

For implementing graph traversals, we find Gremlin to be awell suited graph language, as it allows user-defined traversalsto be chained and provided to the database, thus implementinga mechanism similar to stored procedures in SQL databases.This enables us to convert the different traversals presented inSection V directly to Gremlin code. Furthermore, Gremlin isone of the few languages interfacing with databases over thecompatibility layer Blueprints, thereby enabling us to run allcrafted traversals against other graph database implementationswithout modification.

4http://mlsec.org/joern/

Running the traversals for vulnerability discovery presentedin this paper takes under 40 seconds on a cold database, i.e.,when database contents needs to be read from hard disk. Oncenodes and edges are cached in main memory, execution timereduces to 30 seconds where the vast majority of time is spentto determine viable control flow paths in large functions usingthe traversal UNSANITIZED.

B. Coverage AnalysisWe begin our analysis by querying the central vulnerability

database maintained by the MITRE organization for all CVEidentifiers allocated to vulnerabilities in the Linux kernel in2012. In total we retrieve 69 identifiers addressing 88 uniquevulnerabilities in the source code of the kernel. To categorizethese vulnerabilities into different types we manually inspectthe patches for each of the vulnerabilities and determine theroot-cause of the reported flaw. With this information we areable to assign the 88 vulnerabilities to 12 common types asshown in Table I. More than half of the vulnerabilities (47out of 88) are either memory disclosures, buffer overflows orresource leaks—all of which can be expressed well using graphtraversals as discussed in Section V.

To assess the coverage of our approach, we analyze whichcode representations are necessary to describe the 12 vulner-ability types discovered in the Linux kernel. In particular, weanalyze the coverage of (a) an AST alone, (b) the combinationof an AST and PDG, (c) the combination of an AST and CFG,and (d) the combination of an AST, PDG and CFG. The resultsof this analysis are presented in Table II.

Vulnerability types Code representations

AST AST+PDG AST+CFG AST+CFG+PDG

Memory Disclosure XBuffer Overflow (X) XResource Leaks X XDesign ErrorsNull Pointer Dereference XMissing Permission Checks X XRace ConditionsInteger Overflows XDivision by Zero X XUse After Free (X) (X)Integer Type Issues XInsecure Arguments X X X X

TABLE II: Coverage of different code representation formodeling vulnerability types.

Obviously, the AST alone provides only little informationfor spotting security flaws and thus only some forms ofinsecure arguments, e.g. incorrect type casts, can be discoveredusing this representation. By combining the information fromthe AST and PDG, we obtain a better view of the code and candescribe different classes of buffer overflows, missing permis-sion checks and divisions by zero. However, the combinationof an AST and PDG is of limited use in cases where the orderof statements matters, for example, when the location of asecurity check needs to be determined. The combination of anAST and CFG also misses some vulnerabilities, since most

Hard to model

Depends on runtime

Page 32: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Anything New?

18 unknown vulnerabilities in the Linux kernel

26

Type Location Developer Feedback Identifier

Buffer Overflow arch/um/kernel/exitcode.c Fixed CVE-2013-4512Buffer Overflow drivers/staging/ozwpan/ozcdev.c Fixed CVE-2013-4513Buffer Overflow drivers/s390/net/qeth_core_main.c Fixed CVE-2013-6381Buffer Overflow drivers/staging/wlags49_h2/wl_priv.c Fixed CVE-2013-4514Buffer Overflow drivers/scsi/megaraid/megaraid_mm.c Fixed -Buffer Overflow drivers/infiniband/hw/ipath/ipath_diag.c Fixed -Buffer Overflow drivers/infiniband/hw/qib/qib_diag.c Fixed -Memory Disclosure drivers/staging/bcm/Bcmchar.c Fixed CVE-2013-4515Memory Disclosure drivers/staging/sb105x/sb_pci_mp.c Fixed CVE-2013-4516Memory Mapping drivers/video/au1200fb.c Fixed CVE-2013-4511Memory Mapping drivers/video/au1100fb.c Fixed CVE-2013-4511Memory Mapping drivers/uio/uio.c Fixed CVE-2013-4511Memory Mapping drivers/staging/.../drv_interface.c Fixed -Memory Mapping drivers/gpu/drm/i810/i810_dma.c Fix underway -Zero-byte Allocation fs/xfs/xfs_ioctl.c Fixed CVE-2013-6382Zero-byte Allocation fs/xfs/xfs_ioctl32.c Fixed CVE-2013-6382Zero-byte Allocation drivers/net/wireless/libertas/debugfs.c Fixed CVE-2013-6378Zero-byte Allocation drivers/scsi/aacraid/commctrl.c Fixed CVE-2013-6380

TABLE III: Zero-day vulnerabilities discovered using our four graph traversals

suitable traversals T0 for attacker controlled sources, T s1 for

sanitizers and finally, T2 for security sensitive sinks. We beginwith the sources controlled by an attacker, which are specificto the application under examination. For the Linux kernel, thefollowing two prominent sources of potentially harmful inputare considered.

• User/kernel space interfaces. Data can be copiedfrom user to kernel space using a number of differ-ent API functions. As an example, we consider thecopy from user function which taints its first argumentwith data controlled by an attacker. This can be capturedusing the traversal T 0

0 = ARG1copy from user.

• Parameters of system call handlers. Attackers have di-rect control over the parameters of system call handlersby invoking the corresponding system call. We considercount-parameters of write system calls as an exampleand use the traversal T 1

0 = FUNC( write) � PARAMcntwhere PARAMp and FUNCf are non-empty for pa-rameters named p and nodes in functions with namescontaining the sub-string f respectively.

As data sinks, we consider the length fields (i.e., thirdarguments) passed to copy from user and memcpy calls as weare interested in identifying cases where an attacker controlsthe amount of data copied into a buffer. In the system callhandler case, we further restrict the analysis by analyzing onlycalls to copy from user to ensure that not only the lengthfield but also the copied data is under the attacker’s control.The sink traversals are thus given by T 0

2 = ARG3memcpy and

T 12 = ARG3

copy from user.Finally, we reduce the number of false positives by assuming

that a length field is properly sanitized if at least one of thefollowing conditions are met.

• Dynamic allocation of the destination buffer. The desti-nation buffer is dynamically allocated using the lengthfield to specify the size of the buffer and thus the bufferis large enough to hold the data.

• Relational expressions. The length field is used in arelational expression inside a condition, e.g., x <

buffer size, or in a call to the macro min. Note thatsuch checks may be incorrect and thus this rule is apractical example of a trade-off between false-positivesand false-negatives.

Hence, we define the sanitizer traversal T s1 to be T s

1 =OR(Vs

0 ,Vs1) where Vs

0 is a match traversal matching allocationswhere the first argument contains s and Vs

1 is a match traversalmatching relative expressions and calls to min containing s.The final traversal is then given by

OR�

T 00 � UNSANITIZEDT s

1� OR(T 0

2 , T 12 ),

T 10 � UNSANITIZEDT s

1� T 1

2

Running this traversal on the entire Linux source codereturns the eleven functions shown in Table IV. Of thoseeleven functions, seven are buffer overflow vulnerabilities.As an example, Figure 6 shows the vulnerable functionqeth snmp command. On line 13, attacker controlled data isused to initialize the local variable req len. This variable isused without performing sanitization on line 28 as a length-field of a copy-operation. An attacker can thus overflow thebuffer snmp possibly allowing execution of arbitrary code.

Filename Function

arch/um/kernel/exitcode.c exitcode proc writesecurity/smack/smackfs.c smk write rules listdrivers/staging/ozwpan/ozcdev.c oz cdev writedrivers/infiniband/hw/ipath/ipath_diag.c ipath diagpkt writedrivers/infiniband/hw/qib/qib_diag.c qib diagpkt writedrivers/scsi/megaraid/megaraid_mm.c mimd to kiocdrivers/scsi/megaraid.c megadev ioctldrivers/char/xilinx_.../xilinx_hwicap.c hwicap writedrivers/s390/net/qeth_core_main.c qeth snmp commanddrivers/staging/wlags49_h2/wl_priv.c wvlan uil put infoarch/ia64/sn/kernel/sn2/sn_hwperf.c sn hwperf ioctl

TABLE IV: The 11 functions extracted from the Linuxkernel using the graph traversal discussed in this section.Vulnerabilities are shaded.

… and an email from Linus Torvalds.

Page 33: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Anything New?

18 unknown vulnerabilities in the Linux kernel

26

Type Location Developer Feedback Identifier

Buffer Overflow arch/um/kernel/exitcode.c Fixed CVE-2013-4512Buffer Overflow drivers/staging/ozwpan/ozcdev.c Fixed CVE-2013-4513Buffer Overflow drivers/s390/net/qeth_core_main.c Fixed CVE-2013-6381Buffer Overflow drivers/staging/wlags49_h2/wl_priv.c Fixed CVE-2013-4514Buffer Overflow drivers/scsi/megaraid/megaraid_mm.c Fixed -Buffer Overflow drivers/infiniband/hw/ipath/ipath_diag.c Fixed -Buffer Overflow drivers/infiniband/hw/qib/qib_diag.c Fixed -Memory Disclosure drivers/staging/bcm/Bcmchar.c Fixed CVE-2013-4515Memory Disclosure drivers/staging/sb105x/sb_pci_mp.c Fixed CVE-2013-4516Memory Mapping drivers/video/au1200fb.c Fixed CVE-2013-4511Memory Mapping drivers/video/au1100fb.c Fixed CVE-2013-4511Memory Mapping drivers/uio/uio.c Fixed CVE-2013-4511Memory Mapping drivers/staging/.../drv_interface.c Fixed -Memory Mapping drivers/gpu/drm/i810/i810_dma.c Fix underway -Zero-byte Allocation fs/xfs/xfs_ioctl.c Fixed CVE-2013-6382Zero-byte Allocation fs/xfs/xfs_ioctl32.c Fixed CVE-2013-6382Zero-byte Allocation drivers/net/wireless/libertas/debugfs.c Fixed CVE-2013-6378Zero-byte Allocation drivers/scsi/aacraid/commctrl.c Fixed CVE-2013-6380

TABLE III: Zero-day vulnerabilities discovered using our four graph traversals

suitable traversals T0 for attacker controlled sources, T s1 for

sanitizers and finally, T2 for security sensitive sinks. We beginwith the sources controlled by an attacker, which are specificto the application under examination. For the Linux kernel, thefollowing two prominent sources of potentially harmful inputare considered.

• User/kernel space interfaces. Data can be copiedfrom user to kernel space using a number of differ-ent API functions. As an example, we consider thecopy from user function which taints its first argumentwith data controlled by an attacker. This can be capturedusing the traversal T 0

0 = ARG1copy from user.

• Parameters of system call handlers. Attackers have di-rect control over the parameters of system call handlersby invoking the corresponding system call. We considercount-parameters of write system calls as an exampleand use the traversal T 1

0 = FUNC( write) � PARAMcntwhere PARAMp and FUNCf are non-empty for pa-rameters named p and nodes in functions with namescontaining the sub-string f respectively.

As data sinks, we consider the length fields (i.e., thirdarguments) passed to copy from user and memcpy calls as weare interested in identifying cases where an attacker controlsthe amount of data copied into a buffer. In the system callhandler case, we further restrict the analysis by analyzing onlycalls to copy from user to ensure that not only the lengthfield but also the copied data is under the attacker’s control.The sink traversals are thus given by T 0

2 = ARG3memcpy and

T 12 = ARG3

copy from user.Finally, we reduce the number of false positives by assuming

that a length field is properly sanitized if at least one of thefollowing conditions are met.

• Dynamic allocation of the destination buffer. The desti-nation buffer is dynamically allocated using the lengthfield to specify the size of the buffer and thus the bufferis large enough to hold the data.

• Relational expressions. The length field is used in arelational expression inside a condition, e.g., x <

buffer size, or in a call to the macro min. Note thatsuch checks may be incorrect and thus this rule is apractical example of a trade-off between false-positivesand false-negatives.

Hence, we define the sanitizer traversal T s1 to be T s

1 =OR(Vs

0 ,Vs1) where Vs

0 is a match traversal matching allocationswhere the first argument contains s and Vs

1 is a match traversalmatching relative expressions and calls to min containing s.The final traversal is then given by

OR�

T 00 � UNSANITIZEDT s

1� OR(T 0

2 , T 12 ),

T 10 � UNSANITIZEDT s

1� T 1

2

Running this traversal on the entire Linux source codereturns the eleven functions shown in Table IV. Of thoseeleven functions, seven are buffer overflow vulnerabilities.As an example, Figure 6 shows the vulnerable functionqeth snmp command. On line 13, attacker controlled data isused to initialize the local variable req len. This variable isused without performing sanitization on line 28 as a length-field of a copy-operation. An attacker can thus overflow thebuffer snmp possibly allowing execution of arbitrary code.

Filename Function

arch/um/kernel/exitcode.c exitcode proc writesecurity/smack/smackfs.c smk write rules listdrivers/staging/ozwpan/ozcdev.c oz cdev writedrivers/infiniband/hw/ipath/ipath_diag.c ipath diagpkt writedrivers/infiniband/hw/qib/qib_diag.c qib diagpkt writedrivers/scsi/megaraid/megaraid_mm.c mimd to kiocdrivers/scsi/megaraid.c megadev ioctldrivers/char/xilinx_.../xilinx_hwicap.c hwicap writedrivers/s390/net/qeth_core_main.c qeth snmp commanddrivers/staging/wlags49_h2/wl_priv.c wvlan uil put infoarch/ia64/sn/kernel/sn2/sn_hwperf.c sn hwperf ioctl

TABLE IV: The 11 functions extracted from the Linuxkernel using the graph traversal discussed in this section.Vulnerabilities are shaded.

… and an email from Linus Torvalds.

Page 34: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

‹Conclusions and Outlook

Page 35: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Summary

› Finding vulnerabilities challenging and demanding

› Automatic approaches often fail due to complexity

› Idea: Assisted discovery of vulnerabilities

› Guided auditing: Suggest interesting code to analyst

› Better modeling: Design tools specifically for bug hunting

› Example: Code property graphs

› Blend of classic code analysis and graph mining

› Basis for finding 40 vulnerabilities in popular software

28

Page 36: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Next Steps

› Extension of code property graphs

› Automatic construction of traversals (IEEE S&P 2015)

› Incorporation of other programming languages, e.g. PHP

› Incorporation of external resources, e.g. network data

› Multi-layered view: source code ~ IR ~ binary code

› Open-source project “Joern”

› Developed by Fabian Yamaguchi and Alwin Maier

› http://www.mlsec.org/joern

29

Page 37: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

‹Thanks! Questions?

Page 38: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Gremlin Examples

31

>  g.v(1).out('knows')  ==>v[2]  ==>v[4]  >  g.v(1).out('knows').filter{it.age  <  30}    ==>v[2]  >  g.v(1).out('knows').filter{it.age  <  30}.name  ==>vadas  >  g.v(1).out.loop(1){it.loops  <  3}  ==>v[5]  ==>v[3]

Code property graph

Graph traversals

Page 39: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Defensive vs. Offensive Security

32

xxx

Detection ofattacks

Analysis of attacks

Prevention ofattacks

Defensive security

Page 40: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Defensive vs. Offensive Security

32

xxx

Detection ofattacks

Analysis of attacks

Prevention ofattacks

Defensive security Offensive security

Discovery of vulnerabilities

Analysis of vulnerabilities

Exploiting of vulnerabilities

Page 41: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Defensive vs. Offensive Security

32

xxx

Detection ofattacks

Analysis of attacks

Prevention ofattacks

Defensive security Offensive security

Discovery of vulnerabilities

Analysis of vulnerabilities

Exploiting of vulnerabilities

Page 42: Graph Mining for Vulnerability Discoveryseminaire-dga.gforge.inria.fr/2015/20151009_KonradRieck.pdfVulnerabilities in Software ... Millions of web services vulnerable via CGI › 07/2015:

GEORG-AUGUST-UNIVERSITÄT GÖTTINGEN

‹Defensive vs. Offensive Security

32

xxx

Detection ofattacks

Analysis of attacks

Prevention ofattacks

Defensive security Offensive security

Discovery of vulnerabilities

Analysis of vulnerabilities

Exploiting of vulnerabilities