[ieee 2011 international conference on recent trends in information systems (retis) - kolkata, india...

An Indexing Method for Efficient Querying of an

Attack Graph Atin Ruia, Vishal Parekh and Aveek Chakrabarti

Department of Computer Science and Engineering Jadavpur University

Kolkata,India

{atinruia.jucse, vishal.jucse, aveek.chakrabarti}@gmail.com

Abstract − Attack graphs are a novel way of examining how safe a

network is from attacks and analysing the shortcomings of these

networks. The analysis of the attack graph may help in assessing

network security. However, an attack graph can be very large in

size – containing a million nodes and a million edges. Thus

analysing such a large graph becomes problematic and time

consuming. We have proposed an indexing scheme for fast data

retrieval. Using this indexing scheme we can identify the

vulnerable machines in a network corresponding to an attack pattern.

Keywords−attack graph; graph mining; indexing; query

I. INTRODUCTION

An attack graph is a complete graph which gives a succinct

representation of different attack scenarios, depicted by attack

paths. An attack path is a logical succession of exploits where

each exploit in the series satisfies the preconditions for

subsequent exploits and makes a causal relationship among

them. Attack graphs have been proposed as a way to identify

critical network weaknesses, construct adversary models, analyse the security of a network and suggest changes to

improve the latter. It enumerates multi stage attacks. Thus

analysis of the attack graph may help in assessing network

security from the hackers' perspective. Network

security consists of the provisions and policies adopted by

the network administrator to prevent and

monitor unauthorized access, misuse, modification, or denial of

the computer network and network-accessible resources. It

deals with a trade-off between the degrees of accessibility and

protection of the network. The aim of network security is

providing resources and information to authorized users. However in practice it is extremely difficult to prevent

unauthorised users from accessing protected resources. An

attack graph plays a key role in providing information about

possible sequence of malicious actions and attacks in a

protected network in advance.

The process of identifying patterns and extracting

knowledge from large graph databases is known as graph

mining. A typical attack graph can range from hundreds of

nodes to millions of nodes and is generally represented using

the Resource Description Framework (RDF) [1-2]. As the size

of these RDF databases increases to millions of tuples, efficient graph matching and querying methods become increasingly

important. Due to the large size of these graphs, they cannot be

completely stored in main memory. This makes the scalability

of querying and searching methods extremely problematic.

Typically, indexes can be constructed for fast access. An index

is a data structure that improves the speed of data retrieval

operations. The space required to store the index is typically

less than the amount of space required to store the actual data.

In this paper an indexing method has been proposed, which

generates a key value for each node in the attack graph based

on their labels. These values are then inserted into an indexing structure which is highly optimized for searching, i.e. a B-tree.

Thus, according to our proposed method, searching for an

object involves mapping a label to its key value and then

performing a lookup in the index to find the desired object.

This method has several advantages. Firstly, it is not necessary

to know the schema of the data in advance as the indexing

depends on the data itself. The data accompanying each node

and edge of the attack graph may have a large number of

attributes associated with it. These attributes could differ for

different objects of the graph and could also be of different

types. However our scheme requires only the label of each node or edge, which is a string, and is independent of all other

attributes. Furthermore, the index structure can handle dynamic

changes of the graph.

Certain queries are very useful in analysing large attack

graphs. It is often required to identify certain machines

corresponding to an attack pattern. These patterns could also be

present in different parts of the graphs a multiple number of

times. In large graphs it is not possible to manually locate these

unknown machines. Thus, an algorithm has been proposed to

query an attack graph to identify all the possible groups of

machines compromised in a given attack pattern.

The rest of the article has been arranged as follows. Section

II describes the related work. The proposed approach has been

described in Section III with the proposed algorithm in Section

IV. Section V shows the time complexity and Section VI

concludes the article.

II. RELATED WORK

There have been many methods proposed in literature for

the generation and analysis of large attacks graphs [3-6].

However, to the best of our knowledge this is the first time an

indexing and searching approach for attack graphs has been

proposed. A number of systematic approaches to statically

analyse attack graphs by means of reasoning mechanisms based on logical expressions and conditional preference

2011 International Conference on Recent Trends in Information Systems

978-1-4577-0792-6/11/$26.00 ©2011 IEEE82

Debasish Jana

IEEE CS Logo Stamp

networks has been presented in [3]. It also provides a method

to compute preventative paths to prevent the network from

malicious attacks and select appropriate countermeasures based

on a given conditional decision preferences and relevant

factors. Instead of generating full attack graph the concept of

minimal attack graphs have been introduced in [4] and [6]. A

minimal attack graph is an attack graph in which all the attack

paths terminate to a goal condition. In [7], Jha et al proposes a

minimisation analysis of attack graphs. It addresses the issue of

finding a minimum set of countermeasures to prevent all

attacks in a given attack graph. In [6], an index has been proposed for sub-graph matching

and query processing by partitioning the graph. However the

index method is inefficient for dynamic graphs as the index

structure changes drastically with any change in the network.

An index for semi-structured data has been proposed in [9] .

The authors have proposed the use of a trie to provide fast

access to the data. There has also been some research on query

answering over graph datasets related to bioinformatics [10].

However in these cases the datasets are small enough to fit into

main memory itself and hence the need for efficient storage

and retrieval does not arise.

III. PROPOSED APPROACH

An attack graph G (V, E) consists of nodes and edges, all of

which have labels. However, the number of nodes in a graph

could run up to a million nodes and a million edges. An

indexing method using B-trees [11] has been proposed. B-trees

are balanced search trees designed for minimizing disk I/O

operations. An assumption has been made in our indexing

scheme, that no two nodes of the graph will have the same

label.

This assumption is realistic and applicable to all attack

graphs. A node in an attack graph specifies a service which is provided by one device to another device. In a network, one

service may be provided by many devices. However the

combination of the device providing the service and the device

using the service makes each node unique. Thus, no two nodes

of the attack graph will have the same label.

Consider Figure 1 as an example. The diagram shows the

network configuration. In the network configuration an attacker

is positioned at the workstation machine 0. The attacker can

initially remotely execute a shell in machine 0 without having

to provide a password. The goal of the attacker is to gain super-

user or root privilege at machine 2. A firewall is located

between the attacker and the internal network.

Figure 1 – Network Configuration of a network

Figure 2 shows the attack graph corresponding to the network in Figure 1. In the attack graph a node is used to

denote capabilities or conditions while edges are used to depict

exploits or attacks. Capabilities or conditions are denoted in

two ways –

a. f(x,y) – machine x has the capability to

perform the service f on y.

b. f(x) – the service f can be performed locally

by x.

The terms service and privilege have been used

interchangeably in the rest of the paper. A node with an

outgoing edge signifies a pre-condition and a node with an

incoming edge signifies a post-condition. Exploits or attacks

can be denoted as –

e(x,y) – machine x can execute the exploit e on y.

Figure 2 - Attack graph of the network in Figure 1

83

A node may have multiple edges leading to it having the

same label. In this case all of the pre-conditions corresponding

to that node for that label must be simultaneously present for

the exploit to be executed. For example in Figure 1, the node

trust(2,0) has two edges leading to it having the label

ftp_rhost(0,2). Thus, the attacker must have both the

capabilities ftp_c(0,2) and execute(0) before being able to

execute the attack ftp_rhost(0,2).

An index of this attack graph is to be created. A single B-

tree for storing all the nodes of the graph is constructed. A simple function is used to generate a key value for each node in

the graph. The nodes are inserted into the B-tree based on this

key value which is derived from the label of each node –

f (n) =∑ ASCII values of all the characters of the privilege

denoted by node n.

The sum of the ASCII values of all the characters of the

privilege denoted by a node is used to generate the key value of

that node. The label of the first node in Figure 2 is ftp_c(0,2).

The key value of the node is sum of the characters of the

privilege denoted by the node i.e. ftp_c. Thus the key value for

that node is 102+116+112+95+99 or 524. The data pointers of each node of the B-tree along with the key values are stored in

a single memory block. Whenever a node of the B-tree is to be

accessed its corresponding block is loaded into the main

memory, provided it is not already present. So the whole B-tree

does not need to be loaded into the main memory at any time.

Each data pointer of this B-tree, points to a secondary

memory location where that service is stored. In this location a

list mid_list and a table exploit_table are present. The list

mid_list will store all the machines ids, in a sorted manner,

between which that service is present. The table exploit_table will store all exploits which are incident on that service. For

each exploit in this table the location and name of the service at

the opposite end of the exploit are also stored.

For example in Figure 2, in case of the privilege ftp_c the

data pointer for the key value 524 will point to a location in

secondary memory where the details of that service are stored.

The list mid_list for that service will have two attributes. The

values for this list will be (0,1), (0,2), (1,2). Also the

exploit_table for that table will store the exploit ftp_rhosts.

Along with the exploit, the service at the opposite end i.e. trust

and its location will also be stored.

However, there may be a multiple number of nodes in the

attack graph having the same key value. For example the two

nodes denoting the services abcd and dcba both have the same

key value. In this case provisions are made such that the data

pointer of the node in the B-tree points to a table,

collision_table. This table will contain the actual names of the

service of the nodes having the same key value, along with the

corresponding secondary memory locations of the services.

According to the requirements a B-tree or a B+-tree may

have been used. In this paper a B-tree has been considered

under the assumption that the number of insertion and deletion

of nodes and edges will be much lesser as compared to the

number of search operation required. This is because, for a

typical enterprise network, an attack graph is generated once

and minor changes occur during its maintenance. In this case a

B-tree is most effective as the data pointers are not all required

to be in the leaf level. In a B+-tree all the data nodes are present

in the leaf level. Thus, searching for a value will require a

larger number of block accesses, as the whole height of the B+-

tree will have to be traversed, than in the case of a B-tree.

Thus, searching such a large graph can be time consuming.

An attack graph can also be stored in a relational database

(RDBMS). It is possible to use the inbuilt searching and

indexing techniques provided by the database. However there

are certain drawbacks in using this method. The searching

techniques adopted by the database are not specifically

designed for optimising the searching of an attack graph. The

methods provided may be of no use or worse could have an

adverse effect in this specific application.

IV. PROPOSED ALGORITHM

The formation of the index requires basic B-tree operations like insertion and deletion. The key values of nodes are

determined using the functions described above. The nodes are

then inserted into the node B-tree using standard B-tree

algorithms [11]. The node B-tree formed from the attack graph

in Figure 2 is shown below in Figure 3.

Figure 3 - Node B-tree corresponding to the attack graph in Fig. 2

The minimum degree, t of each node in the B-tree is taken

to be 2. Thus the minimum number of key values in each node

is (t-1) or 1 and the maximum number is (2t-1) or 3. The key

values of all the nodes are determined and are inserted

accordingly.

The proposed algorithm is used for identifying unknown

machines from an attack pattern. The query is shown in Figure

4 for the attack graph in Figure 2. The attack query is also in

the form of an attack graph. The unknown machine ids are

denoted by x and y. The index structure formed for the attack

graph in Figure 3 is partially shown in Figure 5. According to the algorithm initially the start node (selected arbitrarily) of the

query is taken to be ftp_c(x,y). The location of the service ftp_c

is found from the node B-tree using the function lookup, shown

in Figure 5 as step 1. If the service cannot be found in the B-

tree then there exists no possible solution for the query and null

is returned. If there exists multiple services for that key value

i.e. there are collisions, then the required service location is

obtained from collision_table using the locate function.

84

Algorithm : for identifying unknown machine ids from an attack graph

Input: Attack Graph G , Query-graph GQ, B – Tree nodeBTree

Output: Machine list result

1

2

3

4

5

6

7

8

9 10

11

12

13

14

15

16

17

18

19

20

21 22

23

24

25

26

27

28

Q ← createQueue

service_name ← get_service(startNode(GQ))

key ← generate _key(service_name)

service_addr ← lookup(nodeBTree,key) /* service_addr denotes a memory location */

if service_addr is null

return null

else if addr points to a collision_table

service_addr ← locate(collision_table,service_name)

enqueue service_addr in Q n ← load the service stored at service_addr

result ← getMIDlist(n)

while Q is not empty

service_addr ← dequeue from Q

n ← load the service stored at service_addr

result ← result ∩ getMIDlist(n)

for all edges e incident on n in GQ

expolit ← search(exploit_table,e)

if exploit is null

return null

service_addr = get_opposite_service_addr(exploit_table,exploit)

service_name = get_opposite_service_name(expoit_table,exploit) if(e.endnode.service_name does not match service_name)

return null

if(e.endnode is not visited)

enqueue(service_addr)

end for

end while

return result

In step 2 the location of the service in secondary memory is

loaded into main memory. Initially the machine id list midlist

of this service is stored as the result. The location of this

service is then inserted into a queue.

Figure 4 - Search query

At each iteration a service location is dequeued and the service is loaded into a main memory. A new possible set of

solutions is obtained from the midlist of this service (step 3).

An intersection operation is then carried out between this new

list and the old result list to obtain a new result list. In step 4,

for all the exploits on that service in the query graph, a check is

made to determine whether the same exploits also exist on that

service in the attack graph. The function search in the

algorithm is used for this purpose. If this function returns null,

then there is no possible solution and null is returned. For each

exploit another check is also necessary. The two services at the

opposite end of each exploit i.e. e.endnode.service in case of

the query graph, and service_name in case of the attack graph,

must match. If that is not the case then also null is returned.

Step 4 also shows how the location of the service trust at the

other end of the exploit ftp_rhost is obtained. If the node in the

query graph on the other of the exploit has not been visited

then the service represented by that node is enqueued. In the

next iteration the entire process is repeated for the next service as shown in steps 2 ,́ 3 ,́ and 4 ́ in Figure 5. This process is

repeated until the queue is empty.

The machine id lists are always stored in a sorted manner.

This is to ensure that the intersection operation can be carried

out efficiently. The common element or elements of the lists

are to be compared. At the first step, the first element of the

first list, say list 1 is compared with the corresponding element

of the second list, say list 2. The lists being sorted, while

checking for the second element of list 1, it is only needed to

start from that element in list 2 which had been checked last.

This guarantees that each list needs to be traversed completely, at most once. Thus the intersection operation can be carried out

in linear time.

85

Figure 5 – Steps involved in querying the attack graph

V. TIME COMPLEXITY

Number of nodes in the attack graph = n

Minimum degree of each node in the node B-tree = t

Height of the node B-tree = h

Avg. number of edges incident on a node = e

Avg. number of machine id combinations in a machine list = m Number of nodes in a query graph = p

Number of edges in a query graph = q

Time required to search a node from the node B-tree, T (N) =

O (t.h) = O (t logt n)

Time required to search an edge from edge_table, T (E) =

O(log2e)

In first case, the time to search for a key value within a node is

negligible as compared to the time required to load a new block

from secondary memory to main memory.

Thus, T(N) = O (logt n)

Time required to perform an intersection operation between two sorted machine lists T (I) = O(m)

Thus, time to process a query graph

= Time required to search for the first service in the node B-

tree + (p-1) × (Time required to perform an intersection

operation) + (q) × (Time required to search for an exploit in the

exploit table).

= T (N) + (p-1) × T (I) + q × T(E)

= O (logt n) + (p-1) × O (m) +(q) × O(log2e)

The time required to search for an exploit in the exploit table is

very low as for each service in an attack graph the number of possible exploits is limited. Further the traversal of a B-tree,

which requires comparitively more time is performed only

once at the beginning. Thus the number of nodes in the query

graph and the time required to perform an intersection

operation are the two major factors.

VI. CONLCLUSIONS

In this paper we have suggested an indexing method for the

efficient querying of attack graphs. In large attack graphs

consisting of a number of machines, it is often desirable to

identify which machines are vulnerable to attacks. Due to the

presence of such vulnerable machines, the whole network may be compromised. Thus the identification of these machines in a

network helps the network administrator in preventing attacks

by introducing remedial measures. We have proposed an

algorithm which uses the above index to identify the presence

of all such vulnerable machines in a network from a given

attack pattern. Our algorithm is efficient as the solution is

generated by a single traversal of the query graph. Further as

shown, the time complexity depends mainly on the time taken

to perform an intersection operation.

This is the first time such an analysis has been performed

on attack graphs, to the best of our knowledge. In future, we plan to utilise this indexing scheme for solving other types of

queries which will lead to a better analysis of attack graphs.

ACKNOWLEDGMENT

We would like to acknowledge our guide Mridul Sankar

Barik of the Department of Computer Science and

Engineering, Jadavpur University for his constant support. His

knowledge of attack graphs and his tutelage has made our work

possible.

REFERENCES

[1] A. Kiryakov, D. Ognyanov, D. Manov, OWLIM - a pragmatic semantic repository for OWL. In: WISE Workshops. (2005) 182–192.

[2] Y. Theoharis, V. Christophides, G. Karvounarakis, Benchmarking database representations of RDF/S Stores. (2005) 685–701.

[3] P. Kijsanayothin, R. Hewett, “Analytical Approach to Attack Graph Analysis for Network Security”, in ARES „10, Krakow 2010.

[4] P. Ammann, J. Pamula, R. Ritchey, J. Street, “ A host- based approach to network attack chaining analysis”, in Proceedings of the 21st Annual

Computer Security Applications Conference (ACSAC 2005), 2005, pp. 72-84.

86

[5] S. Noel, S. Jajodia, “Understanding complex network attack graphs

through clustered adjacency matrices.” In: Proceedings Computer Security Applications Conference (ACSAC), pp. 160–169 (2005).

[6] N. Ghosh, and S. K. Ghosh, “An intelligent technique for generating minimal attack graph”, First Workshop on Intelligent Security (Security

and Artificial Intelligence) Sec Art 2009. [7] S. Jha, O. Sheyner and J. Wing, “Two formal analysis of attack graphs”,

in CSFW ‟02, proceedings of the 15th IEEE workshop on Computer

Security Foundations, Washington D.C., USA 2002.

[8] Matthias Brocheler, Andrea Pugliese, V.S. Subrahmanian; “DOGMA: A Disk-Oriented Graph Matching Algorithm for RDF Databases”,

Proceedings of the 8th International Semantic Web Conference 2009. [9] Brian Cooper, Neal Sample, Michael Franklin, Gisli Hjaltason, Moshe

Shadmon, A Fast Index for Semistructured Data, Proceedings of the 27th

VLDB Conference, Rome, Italy 2001.

[10] Tian, Y., McEachin, R.C., Santos, C.: “SAGA: A Subgraph Matching Tool for Biological Graphs”. Bioinformatics 23(2) (2007) 232.

[11] T. Cormen, C. Leiserson, R. Rivest, C. Stein, Introduction to Algorithms, MIT Press.

87

[ieee 2011 international conference on recent trends in information systems (retis) - kolkata, india...

Documents