an efﬁcient features-based processing technique for...

An Efficient Features-Based Processing Technique forSupergraph Queries

Sherif SakrNICTA and University of New South Wales

Sydney, NSW, [email protected]

Ghazi Al-NaymatCemagref - LISC

Aubiere Cedex, [email protected]

ABSTRACTGraphs are widely used for modeling complicated data suchas social networks, chemical compounds, protein interac-tions, XML documents and multimedia databases. To beable to effectively understand and utilize any collection ofgraphs, a graph database that efficiently supports elemen-tary querying mechanisms is crucially required. Supergraphquery is an important type of graph queries which has manypractical applications. Given a graph database D, the an-swer set of a supergraph query q is computed by retrievingall graphs in D which are fully contained in q. A primarychallenge in computing the answers of graph queries is thatpair-wise comparisons of graphs are usually hard problems.For example, subgraph isomorphism is known to be NP-complete. Clearly, the success of any graph database appli-cation is directly dependent on the efficiency of the graphindexing and query processing mechanisms. In this paper,we study the problem of using the relational infrastructureto achieve an efficient evaluation of supergraph queries. Werely on an effective and efficient layer of features-based sum-mary structures, called graph features knowledge, to reducethe required number of pair-wise graph comparisons andboost the efficiency of query processing. Finally, we con-duct an extensive set of experiments on real and syntheticdata sets to demonstrate the efficiency and the scalability ofour approach.

Categories and Subject DescriptorsH.1 [Information Systems Models and Principles]: Sys-tems and Information Theory; E.1 [Data Structures]: Graphsand networks

General TermsAlgorithms, Performance

KeywordsGraph Database, Graph Query, Supergraph Query

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise,to republish, to post on servers or to redistribute to lists, requires prior spe-cific permission and/or a fee. IDEAS10 2010, August 16-18, Montreal, QC[Canada]Editor: Bipin C. DESAICopyright©2010 ACM 978-1-60558-900-8/10/08 $10.00

1. INTRODUCTIONGraphs are among the most complicated and general form ofdata structures. Recently, graphs have been widely used tomodel many complex structured and schemaless data such assemantic web [19], protein networks [15], social networks [5]and chemical compounds [17]. Retrieving related graphscontaining a query graph from a large graph database is akey performance issue in all of these graph-based applica-tions. Therefore, the success of any graph database appli-cation is directly dependent on the efficiency of the graphindexing and query processing mechanisms. In principle,there are two main types of graph queries:

1. Subgraph queries: This category searches for a specificpattern in the graph database. Formally, given a graphdatabase D = {g1, g2, ..., gn} and a subgraph query q,the query answer set A = {gi|q ⊆ gi, gi ∈ D}.

2. Supergraph queries: This category searches for the graphdatabase members of which their whole structures arecontained in the input query. Formally, given a graphdatabase D = {g1, g2, ..., gn} and a supergraph queryq, the query answer set A = {gi|q ⊇ gi, gi ∈ D}.

Clearly, it is an inefficient and a very time consuming taskto perform a sequential scan over the entire graph databaseD to check whether each graph database member gi be-longs to the answer set of a graph query q. Hence, thereis a clear necessity to build graph indices in order to im-prove the performance of processing graph queries. In re-cent years, the problem of subgraph query processing has at-tracted much research attention and many algorithms havebeen proposed [8, 10, 14, 16, 26, 29, 31, 33, 34]. Althoughthe importance of the supergraph query processing prob-lem in many practical application domains (e.g, chemicalcompounds), it has not been extensively considered in theresearch literature. Thus far, only a few approaches havebeen presented to deal with this problem [7, 32].

Relational Database Management Systems (RDBMSs) haverepeatedly been shown to be highly efficient, scalable andsuccessful in hosting types of data which have formerly notbeen anticipated to be stored inside relational databasessuch as: complex objects [9, 24], spatio-temporal data [4]and XML data [30, 13]. In addition, RDBMSs have shownits ability to handle vast amounts of data very efficientlyusing its powerful indexing mechanisms.

Graph Database

Supergraph Query Query Results

Figure 1: An illustrating example of supergraphsearch queries in graph databases

In our previous work [20], we presented an approach for em-ploying the powerful features of the relational infrastructureto achieve an efficient and scalable processing for subgraphqueries. In this paper we extend our approach to efficientlyexecute the supergraph queries. The proposed approach in-volves utilizing a layer of compact feature-based summarystructures of the underlying graph databases, called graphfeatures knowledge, to reduce the required number of pair-wise graph comparisons and boost the efficiency of queryprocessing. In summary, we make the following contribu-tions:

1. We systematically investigate the use of the relationalinfrastructure for indexing graph databases and pro-cessing supergraph search queries. We propose an ap-proach for extracting descriptive features for the un-derlying graph database members. The extracted fea-tures are then used to effectively filter the candidategraphs and to get rid of many false positives.

2. We present a novel graph features-based mechanismfor minimizing the number of subgraph isomorphismtests which are required during a relational processingfor supergraph queries. The mechanism relies on acompact features summary structure of the underlyinggraph database.

3. We show the efficiency and the scalability of the per-formance of our approach through an extensive set ofexperiments.

The remainder of the paper is organized as follows: We dis-cuss some background knowledge in Section 2. Section 3provides an overview of our relational approach for stor-ing graph databases. Section 4 describes the components ofthe summary layer of graph features knowledge. Section 5presents our mechanism to utilize the graph features knowl-edge in order to reduce the number of subgraph isomorphismtests and improve the relational processing for supergraphqueries. We evaluate our method by conducting an exten-sive set of experiments which are described in Section 6. Therelated work is surveyed in Section 7 before we conclude thepaper in Section 8.

2. PRELIMINARIESThis paper focuses on the directed-labeled and connectedsimple graphs (we refer to them as graphs in the rest of thepaper). However, the algorithms proposed in the paper canbe easily extended to other kinds of graphs.

Definition 1. [Labeled Graph] A labeled graph g is de-noted as (V,E, Lv, Le, Fv, Fe) where V is the set of vertices;

E ⊆ V × V is the set of edges joining two distinct vertices;Lv is the set of vertex labels; Le is the set of edge labels; FV

is a function V → Lv that assigns labels to vertices and Fe

is a function E → Le that assigns labels to edges.

Definition 2. [Graph Database] A graph database D is acollection of member graphs gi where D = {g1, g2, ..., gn}

In principle, there are two main types of graph databases.The first type consists of a few number of very large graphssuch as the web graph and social networks. The secondtype consists of a large set of small graphs such as chemicalcompounds and biological pathways. This paper focuses onthe second type of graph databases.

Definition 3. [Subgraph Isomorphism]Let g1 = (V1, E1, Lv1, Le1, Fv1, Fe1) andg2 = (V2, E2, Lv2, Le2, Fv2, Fe2) be two graphs, g1 is definedas a graph isomorphism to g2, if and only if there exists atleast one bijective function f : V1 → V2 such that:1) For any edge uv ∈ E1, there is an edge f(u)f(v) ∈ E2.2) Fv1(u) = Fv2(f(u)) and Fv1(v) = Fv2(f(v)). 3) Fe1(uv) =Fe2(f(u)f(v)).

The subgraph search and supergraph search problems ingraph databases are formulated as follows.

Definition 4. [Subgraph Search] Given a graph databaseD = {g1, g2, ..., gn} and a subgraph query q, the query an-swer set A = {gi|q ⊆ gi, gi ∈ D}.

Definition 5. [Supergraph Search] Given a graph databaseD = {g1, g2, ..., gn} and a supergraph query q, the query an-swer set A = {gi|q ⊇ gi, gi ∈ D}.

Figure 1 provides an illustrative example for the supergraphsearch problem in graph databases which is the main focusof this paper.

3. RELATIONAL ENCODING OF GRAPHDATABASES

The starting point of our approach for processing super-graph queries is to find an efficient and suitable encodingfor each graph member gi in the graph database D. There-fore, we propose the Vertex-Edge mapping scheme as anefficient, simple and intuitive relational storage scheme forstoring our targeting directed labeled graphs. In this map-ping scheme, each graph database member gi is assigned aunique identity graphID. Each vertex is assigned a sequencenumber (vertexID) inside its graph. Each vertex is repre-sented by one tuple in a single table (Vertices table) whichstores all vertices of the graph database. Each vertex isidentified by the graphID to which the vertex belongs andthe vertex ID. Additionally, each vertex has an additionalattribute to store the vertex label. Similarly, all edges ofthe graph database are stored in a single table (Edges table)where each edge is represented by a single tuple in this ta-ble. Each edge tuple describes the graph database member

g1

graphID vertexID vLabel

1 1 A

1 2 B

1 3 A

2 1 C

2 2 D

2 3 A

3 1 B

3 2 C

3 3 C

3 4 D

graphID sVertex dVertex eLabel

1 1 2 m

1 1 3 n

1 3 2 x

2 1 2 m

2 1 3 x

2 3 2 x

3 1 2 x

3 1 3 x

3 2 4 m

3 3 4 m

1A

B A

C

A

D

n

xx

m

B C

C D

mxg3

3

2

3

1

2

2

43

1

x

m

g2

x

C Dm43

Edges TableVertices Table

Figure 2: Vertex-Edge relational mapping scheme ofgraph database

vertexID vLabel

1 B

2 C

3 C

4 D

5 A

sVertex dVertex eLabel

1 2 x

1 3 x

2 4 m

3 4 m

2 5 x

5 4 x

B C

C D

mx

m

q2

43

1

Edges Table (q)Vertices Table (q)

x

A 5

x

x

Figure 3: Vertex-Edge mapping scheme of a samplesupergraph query

to which the edge belongs, the id of the source vertex of theedge, the id of the destination vertex of the edge and theedge label. Therefore, the relational storage scheme of ourVertex-Edge mapping is described as follows:

• V ertices(graphID, vertexID, vertexLabel)• Edges(graphID, sV ertex, dV ertex, edgeLabel)

Figure 2 illustrates an example of the Vertex-Edge relationalmapping scheme of a sample graph database. Figure 3 illus-trates the encoding of a supergraph query q. Evaluating thesample supergraph query against the sample graph databaseshould return an answer set that contains the graph databasemembers {g1, g2}. Let us assume a supergraph query q withM nodes and N edges. The answer set of this query wouldthus contain all graph database members gi with x nodesand y edges where x 6 M , y 6 N and structurally con-tained in q. Apparently, using a straightforward SQL-basedapproach for evaluating the answer of this type of queriesis quite difficult and expensive. A traditional filtering-and-verification can go through the following steps:

• Filtering phase: This phase can be represented by adistinct union statement between a set of SQL-basedintermediate results which represent the set of graphdatabase members for each pair of integer values x andy where x 6 M , y 6 N , x > y using the SQL-basedtemplate illustrated in Figure 4In this template, each referenced table Vi (Line number2) represents an instance from the table V ertices andmaps the information of one vertex of the set of verticesof the supergraph query q represented by the tablesQVi (Line number 4). Similarly, each referenced tableEj (Line number 3) represents an instance from thetable Edges and maps the information of one edge ofthe supergraph query q represented by the tables QEj

1 SELECT V1.graphID2 FROM Vertices as V1,..., Vertices as Vx,3 Edges as E1,..., Edges as Ey,4 qVertices as QV1,...,qVertices as QVx,5 qEdges as QE1,...,qEdges as QEy

6 WHERE7 ∀xi=2(V1.graphID = Vi.graphID)8 AND ∀yj=1(V1.graphID = Ej .graphID)9 AND ∀xi=1(Vi.vertexLabel = QVi.vertexLabel)10 AND ∀yj=1(Ej .edgeLabel = QEj .edgeLabel);

Figure 4: SQL-Based Template of the FilteringPhase

(Line number 5). Line number 7 of the SQL trans-lation template represents a set of x − 1 conjunctiveconditions to ensure that all queried vertices belong tothe same graph. Similarly, Line number 8 of the SQLtranslation template represents a set of y conjunctiveconditions to ensure that all queried edges belong tothe same graph of the queried vertices. Lines number9 and 10 represent the set of conjunctive predicatesof the vertex and edge labels respectively. The dis-tinct union statement is required to get rid of dupli-cate graphs which can appear in different intermediateresults because they contain subgraphs with differentsizes of the supergraph query q.• Verification phase: This phase needs to ensure that

each candidate graph database member is fully con-tained in the supergraph query q and does not haveany extra non-filtered nodes or edges which are notconsidered in the SQL-based template of the filteringphase. An obvious problem of the straightforward SQLtranslation template of the filtering phase, in its cur-rent form, is that it will return many false positivegraphs because it considers any graph database mem-ber that has a subgraph of the supergraph query q as acandidate result. The main reason behind this is thatthe filtering predicates of the SQL template is quitelimited as it does not consider any information aboutthe features or the topology of the supergraph query.Moreover, it has no means to check if these filteredgraphs are fully contained in the supergraph query q ornot. In the next section, we will describe our approachto tackle this problem and dramatically improve theperformance and the effectiveness of the SQL-based fil-tering phase of supergraph queries. Clearly, improvingthe effectiveness of the filtering phase will consequentlyreduce the cost of the verification phase.

4. GRAPH FEATURES KNOWLEDGEIn order to improve the efficiency of the SQL-based approachfor evaluating supergraph queries, we employ the idea of au-tomatic creation and usage of higher-level data about data(metadata) [3, 6], hereafter referred to as graph featuresknowledge. This information is used during the query opti-mization and execution phases to improve the performanceof query evaluation. We propose two metadata layers: graphdescriptors and aggregate graph descriptors.

g1

gID NN NE NL EL MID MOD MD MPL

1 3 3 2 3 2 2 2 2

2 4 4 3 2 2 2 2 2

3 5 6 5 6 3 2 3 3

1A

B A

nm

B C

C D

mx

m

g2

32

2

43

1

x

g3x

A B

C D

my

z

2

43

1x

Em

n5

Graph Descriptors

Figure 5: An illustrating example of graph descrip-tors

Graph DescriptorsIn the first layer, we maintain a description record aboutthe features of each graph database member in the graphdatabase. Each graph descriptor includes the following com-ponents:

• No. Nodes (NN): represents the total number of nodesin the graph.• No. Edges (NE): represents the total number of edges

in the graph.• Unique Node Labels (NL): represents the number of

unique node labels in the graph.• Unique Edge Labels (EL): represents the number of

unique edge labels in the graph.• Maximum in-Node Degree (MID): represents the max-

imum number of incoming edges for a node in thegraph.• Maximum out-Node Degree (MOD): represents the max-

imum number of outgoing edges for a node in thegraph.• Maximum io-Node Degree (MD): represents the maxi-

mum number of total number of incoming and outgo-ing edges for a node in the graph.• Maximum Path Length (MPL): where a path length

in a graph represents a sequence of consecutive edgesand the length of the path is the number of traversededges.

Figure 5 illustrates an example of a graph descriptor ta-ble for a sample graph database. In fact, the componentsof the graph descriptor can vary from one database to an-other based on the characteristics of the underlying graphdatabase members. For example, if there is a guarantee thateach node or edge in has a unique label within a graph thenwe can avoid storing the components which represent thenumber of unique node and edge labels in each graph. Itis also possible to add some more descriptive fields such as:the number of occurrences of the maximum path length orthe number of occurrences of the maximum (in/out/io)-nodedegrees.

On the one hand, the more indexed features the higher thepruning power of the filtering phases. However, it also in-creases the space cost. On the other hand, the less indexedfeatures would lead to poor pruning power in the filteringprocess and consequently resulting in a performance hit inthe query processing time.

In general, computing the components of our graph descrip-tor (except the maximum path length which is recursivein nature) can be achieved in a straightforward way usingSQL-based aggregate queries. For example, computing thenumber of nodes graph descriptor component (NN) can berepresented using the following SQL query:

SELECT graphID, count(*) as NN FROM VerticesGROUP BY graphID;

Similarly, computing the number of edges descriptor com-ponent (NE) can be represented using the following SQLquery:

SELECT graphID, count(*) as NE FROM EdgesGROUP BY graphID;

Aggregate Graph DescriptorsIn the second layer of our graph features knowledge, we main-tain a compact aggregated version of the graph descriptorstable in the following form:(BID,NN,NE,NL,EL,MID,MOD,MD,MPL, graphCount)

In fact, the main idea behind the aggregate graph descrip-tors is that it groups the graph database members with thesame features into one bucket identified by its bucket ID(BID). Hence, it represents a guide summary of the fea-tures of the graph database members [11]. The size of thisaggregated version is usually, by far, less than the size ofthe descriptor table and can easily fit into the main mem-ory. Based on a defined threshold t, we create an auxiliaryinverted list [2] of all the related graph database membersfor each summary bucket with a number of graph databasemembers (graphCount) which is less than t. The main goalof these inverted lists is to allow for cheap and quick accessto the set of the few graphs which share the same featuresof that bucket.

5. FEATURES-BASED SQL EVALUATION OFSUPERGRAPH QUERIES

In this section, we describe how to use the two layers ofgraph features knowledge (Section 4) to improve the SQL-based evaluation of supergraph queries. Figure 6 gives anoverview of our approach for features-based SQL translationof supergraph queries which goes through the following mainsteps:

1. Database Preprocessing : In this step we analyze theunderlying graph database to build the components ofthe layer of the graph features knowledge (GFK).

2. Determining The Pruning Strategy : The query proces-sor recognize the features of the input graph query andprobes the components of theGFK with the query fea-tures to identify the pruning strategy of the filteringphase.

3. Filtering : It uses the outcome decisions from probingthe GFK to apply an effective and efficient pruningstrategy for filtering the graph database members, getrid of many false positive graphs (graphs that are notpossible to belong to the query answer) and producea list of graph database members which are candidateto constitute the query result.

4. Verification: It validates each candidate graph to checkif it really belongs to the answer set of the query graph.

Graph Database

Graph Features

Knowledge

SupergraphQuery

Query Processor

Off-line Database Preprocessing

ProbesSupergraphQuery

Descriptor

Computes

Candidate Answer Set

Final Answer Set

Filtering Phase (Features-Based SQL Evaluation)

Verification Phase

Figure 6: An overview of features-based SQL Eval-uation of supergraph queries

Clearly, the query response time consists of the filtering timeand the verification time. The less candidates identified af-ter the filtering process, the faster the query response time.Hence, pruning power in the filter process is critical to theoverall query performance. Therefore, given a supergraphquery q, the query evaluation process starts by computingthe graph query descriptor (qd) after which by using theaggregated graph descriptors layer (AGD), we identify theset of buckets where the features of each bucket are fullycontained in the graph query. In fact, the graph query de-scriptor is considered to be fully contained in a bucket if andonly if the value of each component in the query descriptoris greater than or equal to its associated component in thebucket. This matching process is defined as follows:

Definition 6. [Bucket Match] Given an aggregate graphdescriptor AGD = {b1, b2, ..., bn} and a supergraph querydescriptor qd, the matched bucket set MB = {bi|qd ⊆bi, bi ∈ AGD} where qd ⊆ bi iff qd.NN ≥ bi.NN ∧qd.NE ≥bi.NE ∧ qd.NL ≥ bi.NL ∧ qd.EL ≥ bi.EL ∧ qd.MID ≥bi.MId∧qd.MOD ≥ bi.MOD∧qd.MD ≥ bi.MD∧qd.MPL ≥bi.MPL.

Figure 9 illustrates an example of the matching process ofa supergraph query descriptor against the buckets of an ag-gregate graph descriptors. Based on this matching process,the supergraph query evaluation strategy is determined asdepicted in the steps of Algorithm 1 (Figure 7) and Algo-rithm 2 (Figure 8). The strategies of our supergraph queryevaluation process are described as follows:

• Case SupStrategy1: If the count of the matchedbuckets is equal to zero, we return an empty resultset because none of the graph database members hasfeatures which are fully contained in the supergraphquery descriptor.

• Case SupStrategy2: If the count of the matchedbuckets is greater than zero and the size of each matchedbucket (bi.graphcount) is less than the defined thresh-old t then we unfold their associated inverted lists toretrieve and verify their related graph database mem-bers in order to compute the answer set of the super-graph query q.

Require: q: Supegraph Query,AD: Aggregate Descriptors,t: Threshold of Bucket Size with an Inverted List,

Ensure: SupStrategy: integermb: Array of Matched Aggregated Descriptors Entries

1: qd:= ComputeGraphQueryDescriptor(q)2: mb:= IdentifySupMatchedBuckets(qd,AD)3: if mb.length = 0 then4: SupStrategy:= 15: return6: end if7: SupStrategy:= 2;8: for i = 1 to mb.length do9: if mb[i].graphcount > t then

10: SupStrategy :=311: return12: end if13: end for

Figure 7: Algorithm 1 (IdentifySupStrateg): Deter-mines the supergraph query evaluation strategy

Require: D: Graph Database, q: Supergraph Query,GD: Graph Descriptors,AD: Aggregate Descriptors,t: Threshold of Bucket Size with an Inverted List,

Ensure: ASGQ: Answer set of the supergraph query (q)1: IdentifySupStrategy(q, AD, t, SupStrategy, mb)2: if SupStrategy = 1 then3: return empty4: end if5: if SupStrategy = 2 then6: for i = 1 to mb.length do7: for all g ∈ mb[i].InvertedList.graphs do8: if Verify(g, q) then9: ASGQ.ADD(g)

10: end if11: end for12: end for13: return ASGQ14: end if15: if SupStrategy = 3 then16: for i = 1 to mb.length do17: if mb[i].graphcount ≤ t then18: for all g ∈ mb[i].InvertedList.graphs do19: if Verify(g, q) then20: ASGQ.ADD(g)21: end if22: end for23: else24: sqg:= ExtractFeaturedSubgraphs(q,mb[i])25: for all eqg ∈ sqg do26: cg:= SQLFeaturedFilter(D, eqg,GD,mb[i])27: for all g ∈ cg do28: if Verify(g, q) then29: ASGQ.ADD(g)30: end if31: end for32: end for33: end if34: end for35: return ASGQ36: end if

Figure 8: Algorithm 2: Features-Based Evaluationof Supergraph Queries

BID NN NE NL EL MID MOD MD

1 4 4 2 3 3 1 3

2 4 6 2 4 2 1 2

3 5 6 3 6 1 1 2

4 6 10 6 7 3 1 4

Aggregated Graph Database Descriptors

NN NE NL EL MID MOD MD

4 5 2 5 3 2 4

Supergraph Query Descriptor

XX

X

Aggregated Graph Database Descriptors

Figure 9: Matching a graph query descriptor againstan aggregated graph descriptors

• Case SupStrategy3: This case handles the situationwhen we have a number of matched buckets with asize less than the threshold t and a number of matchedbuckets with a size greater than the threshold t. Sim-ilar to SupStrategy2, for each matched bucket witha size less than t, we unfold its associated invertedlist to retrieve and verify the related graphs wherethe target graphs are added to the query answer set.However, for each matched bucket (mbs) with a sizegreater than t, we convert the supergraph search prob-lem into a subgraph search problem. We use the fea-tures of the matched bucket (mbs) to apply a back-tracking step which uses the features of the matchedbucket to extract from the supergraph query (q) allsubgraphs (sqg) with exactly the same features. Wethen treat each extracted subgraph (eqg) as a subgraphquery where we issue a features-based SQL query us-ing the translation template depicted in Figure 10 toretrieve all graph database members that fulfill thefeatures of both the extracted subgraph (eqg) and thematched bucket (mbs). Clearly, this backtracking stepdramatically reduces the search space. In this tem-plate, the number of the instantiated instances of theV ertices and Edges tables are determined by the val-ues of the number of nodes (NN) and the number ofedges (NE) components in the identified bucket (mbs)respectively. Each referenced table Vi and Ej maps theinformation of vertices and edges of the extracted sub-graph of the supergraph query q respectively. Linesnumber 7,8,9,10,11,12 and 13 use the graph descrip-tors information to retrieve only the graph databasemembers which satisfy the features of the identifiedbuckets. Lines number 14 and 15 of the SQL transla-tion template represents a set of conjunctive conditionsto ensure that all queried vertices and edges belong tothe same graph respectively. Similarly, Lines number16 and 17 represent the set of conjunctive predicatesof the vertex and edge labels respectively. Lines num-ber 18 and 19 represents the topology conditions ofthe edges between the mapped vertices where f is themapping function between each subgraph query vertexand its associated vertices table instance Vi.

1 SELECT DISTINCT V1.graphID2 FROM Vertices as V1,..., Vertices as Vmbs.NN ,3 Edges as E1,..., Edges as Embs.NE ,4 GraphDescriptors as GD5 WHERE6 V1.graphID = GD.graphID7 AND GD.NN = mbs.NN8 AND GD.NE = mbs.NE9 AND GD.NL = mbs.NL10 AND GD.EL = mbs.EL11 AND GD.MID = mbs.MID12 AND GD.MOD = mbs.MOD13 AND GD.MD = mbs.MD14 AND ∀mi=2(V1.graphID = Vi.graphID)15 AND ∀nj=1(V1.graphID = Ej .graphID)16 AND ∀mi=1(Vi.vertexLabel = QVi.vertexLabel)17 AND ∀nj=1(Ej .edgeLabel = QEj .edgeLabel)18 AND ∀nj=1(Ej .sV ertex = Vf .vertexID AND19 Ej .dV ertex = Vf .vertexID);

Figure 10: SQL Template for Features-Based Sub-graph Search

In contrast to the verification phase described in Sec-tion 3, the features-based verification phase does notneed to ensure that the candidate graph database mem-bers are fully contained in the supergraph query q anddo not have any extra non-filtered nodes or edges. Thisis already guaranteed by the conditions of the filteringphase. In principle, this phase is optionally requiredif and only if more than one vertex of the set of sub-graph query vertices have the same label. In this case,it verifies that each vertex in the set of filtered verticesfor each candidate graph database member is distinct.This can be easily achieved using their vertexID. Al-though the fact that the conditions of the verificationprocess could be injected into the SQL translation tem-plate of the features-based filtering phase, we foundthat it is more efficient to avoid the cost of performingthese conditions over each graph database membersby delaying their processing (if required) to a separatephase after pruning the candidate list.

6. PERFORMANCE EVALUATIONIn this section, we present an extensive experimental studyto demonstrate the efficiency and scalability of our proposedtechniques. We conducted our experiments using the Post-greSQL RDBMS running on a PC with 3.2 GHZ Intel Xeonprocessors, 4 GB of main memory storage and 250 GB ofSCSI secondary storage. In our experiments we used thefollowing two kinds of datasets:

1) The real AIDS antiviral screen dataset (NCI/NIH) whichis publicly available on the website of the Developmen-tal Therapeutics Program [1]. For this dataset, differ-ent query sets are used, each of which has 1000 queries.These 1000 queries are constructed by randomly selecting1000 graphs from the antiviral screen dataset and thenextracting a connected m edge subgraph from each graphrandomly. Each query set is denoted by its edge size (m)

D2kV6E10L40M50 D10kV9E15L70M100 D50kV12E18L100M150 AIDS

0

10

20

30

40

50

Dis

k S

tora

ge

Siz

e (i

n M

B)

Graph Database

(a) Size of Graph Features Knowledge

D2kV6E10L40M50 D10kV9E15L70M100 D50kV12E18L100M150 AIDS

0

50

100

150

200

250

Off

line

Pro

cess

ing

Tim

e (i

n S

eco

nd

s)

Graph Database

(b) Construction Time of Graph Features Knowledge

Figure 11: Offline Database Preprocessing Cost.

as Qm.2) A set of synthetic datasets which is generated by our

implemented data generator that follows the same ideaproposed by Kuramochi and Karypis in [18]. The gener-ator allows the user to specify the number of graphs (D),the average number of vertices for each graph (V ), theaverage number of edges for each graph (E), the numberof distinct vertices labels (L) and the number of distinctedge labels (M). We generated different datasets withdifferent parameters according to the nature of each ex-periment. We use the notation DdV vEeLlMm to rep-resent the generation parameters of each data set. Forexample a data set D5000V 10E20L40M50 means thatthe data set contains 5000 graphs, the average number ofvertices in each graph is 10, the average number of edgesin each graph is 20, there are 40 possible vertices labelsand there are 50 possible edge labels.

Figure 11 shows the offline preprocessing cost of differentdatasets which are used in our experiments. Figure 11(a)shows the disk storage size of the graph features knowl-edge. The results of the experiments show that the storagesize increases linearly w.r.t the size of the graph databases(more specifically the number of graph database members).Obviously, the main reason behind this is that each graphdatabase member requires a new record in the graph de-scriptor table. In fact, the size of the aggregate graph de-scriptors is very small and could be neglected. The sizeincrease of this aggregate component is not affected by theincrease in the number of graph database members. How-ever, it is affected by the number of unique features of thegraph database members.

One of the main advantages of using a relational databaseto store and query graph databases is to exploit their well-know scalability feature. Figure 12 shows the query perfor-mances of our features-based SQL evaluation of supergraphqueries. Figure 12(a) illustrates the average execution timesfor the features-based SQL translations of 1000 supergraph

queries with different sizes on graph databases with differentcharacteristics. In this figure, the running time for the su-pergraph query processing is presented in the Y-axis, the X-axis represents the different query sizes. The running time ofthese experiments include both the filtering and verificationphases. However, on average the running time of the veri-fication phase represents 3% of the total running time andcan be considered to have a negligible effect on all querieswith small result sets.

In practice, the SQL translation template of the filteringphase involves join operations between several instances ofthe encoding relations. Hence, for the buckets with large val-ues for the NN and NE components, some of the relationalquery engines may fail to execute their SQL translations be-cause they are too long or too complex (this does not meanthey must consequently be too expensive). Therefore, werely on our selectivity-aware decomposition algorithm pre-viously introduced in [20] to tackle this problem. Since themain focus of this paper is on graph databases which containlarge sets of small graphs, the occurrences of these patternsare not frequent in our experiments. In labeled graphs, it isgenerally the case that the number of distinct vertices andedges labels are far less than the number of vertices andedges respectively. In order to improve the performance ofthe filtering phase, we created partitioned B-tree indexes [12]over the vertices and edges labels to reduce the access costsof the secondary storage. The experiment results show thatthe execution times of our approach performs and scales ina near linear fashion with respect to the graph database andquery sizes.

Although the size of the graph features knowledge can beconsidered as an additional overhead, it is really worth topay as it significantly improves the performance of queryevaluation especially when considering its relatively cheapand scalable construction time (Figure 11(b)). Figure 12(b)indicates the percentage of the speed-up improvement onthe execution times of the features-based SQL evaluation of

Q8 Q12 Q16 Q20

100

1000

10000

100000

Ex

ec

uti

on

Tim

e (

ms

)

Query Size (Number of Edges)

D2kV4E8L3M5D10kV4E8L3M5D50kV4E8L3M5AIDS

(a) Execution Times

Q8 Q12 Q16 Q200.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Sp

ee

du

p Im

pro

ve

me

nt

(%)

Query Size (Number of Edges)

D2kV4E8L3M5D10kV4E8L3M5D50kV4E8L3M5AIDS

(b) Speed up Improvement

Figure 12: Query Performance.

supergraph queries in comparison to the straightforward ap-proach. The reported percentage of speed up improvementsare computed using the formula: (1 − F

C)% where F repre-

sents the execution time of our features-based SQL transla-tion while C represents the execution time of the straightfor-ward SQL translation. Clearly, using our approach improvesthe query performance with several order of magnitudes.The percentage of improvement increases linearly with re-spect to the increase in the database and query sizes due tothe effectiveness of the filtering phase of our features-basedapproach to get rid of many of the false positives. In partic-ular, we specify the reasons behind the improvement of thefeatures-based evaluation of supergraph queries as follows:

• It effectively reduces the size of the intermediate re-sults. Matching the information of the graph descrip-tors with the features of the identified buckets dramat-ically reduces the size of each intermediate result andgets rid of many false positive candidates.• It avoids the need of having an intermediate result for

each pair of values x and y where x > y, x 6 the num-ber of the query nodes and y 6 the number of queryedges (Section 3). Instead, the intermediate results aredetermined by the extracted subgraph from the super-graph queries according to the features of the matchedbuckets which are fully contained in the features of thesupergraph query. Thus, these intermediate results areensured to be very effectively pruned. Moreover, evalu-ating the intermediate results for some of the matchedbuckets can be avoided if their sizes (graphCount) areless than the defined threshold t.• The intermediate results are disjoint and duplicate-

free. Each candidate graph database member can be-long to only one of the intermediate result based onthe matching of its graph descriptor with the uniquefeatures of each identified bucket. Disjointness andduplicate-free play an effective role in reducing the to-tal number of records in the intermediate results.

7. RELATED WORKRecently, graph database has attracted a lot of attentionfrom the database community [21]. There are many graphindexing techniques that have been proposed to deal withthe problem of subgraph query processing [20, 23, 29, 33].They can be classified into the following two main approaches:

• Non Mining-Based Graph Indexing Techniques:These techniques focus on indexing the whole con-structs of the graph database instead of indexing onlysome selected features (GraphGrep [10], Closure-Tree [14],gString [16], GraphREL [20], GDIndex [26]). The maincriticisms of these approaches are: 1) They can be lesseffective in their pruning power. 2) They may need toconduct expensive structure comparisons in the filter-ing process and thus degrades the filtering efficiency.Therefore, these techniques need to employ efficientfiltering and prunning mechanisms to overcome theselimitations.• Mining-Based Graph Indexing Techniques: These

techniques apply graph mining methods [25, 27, 28] toextract some features from the graph database mem-bers (gIndex [29], FG-Index [8], TreePi [31], Tree-δ [33]).An inverted index is created for each feature. Answer-ing a subgraph query q is achieved through two steps:1) Identifying the set of features of the subgraph query.2) Using the inverted index to retrieve all graphs thatcontain the same features of q. The rationale behindthis type of query processing techniques is that if somefeatures of graph q do not exist in a data graph G,then G cannot contain q as its subgraph. Clearly, theeffectiveness of these filtering methods depends on thequality of mining techniques to effectively identify theset of features.

Sakr [20] has presented a purely relational framework forprocessing graph queries named GraphREL. The main op-timization technique of GraphREL is based on the observa-tion that the size of the intermediate results dramatically af-fects the overall evaluation performance of SQL scripts [22].

Therefore, GraphREL keeps statistical information aboutthe less frequently existing nodes and edges in the graphdatabase. For a graph query q, the maintained statisticalinformation is used to identify the highest pruning pointon its structure (nodes or edges with very low frequency)to firstly filter out, as many as possible, the false positivesgraphs that are guaranteed to not exist in the final resultsbefore passing the candidate result set to an optional verifi-cation process.

Few approaches have been presented to deal with the su-pergraph search problem. Chen et al. [7] have presented anapproach named cIndex . The indexing unit of this approachis the subgraphs which are extracted based on the rarity oftheir occurrence in historical query graphs. Sometimes, theextracted subgraphs are very similar to each other. cIndexuses a redundancy-aware selection mechanism to sort out themost significant and distinctive contrast subgraphs betweendatabase graphs and queries. Zhang et al. [32] proposedan approach for a compact organization of graph databasemembers named GPTree. In this approach, all of the graphdatabase members are stored into one graph where the com-mon subgraphs are stored only once. Based on the contain-ment relationship between the sets of the extracted features,a mathematical approach is used to determine the orderingof the feature set which can reduce the number of subgraphisomorphism tests during query processing.

8. CONCLUSIONEfficient supergraph query processing plays a critical rolein many application domains which involve complex struc-tures such as: bioinformatics and chemical compounds. Inthis paper, we studied the problem of using the relational in-frastructure to achieve an efficient evaluation of supergraphqueries. Our approach converts a graph into an intuitive re-lational schema and uses compact summary structures of theunderlying graph databases to achieve efficient SQL execu-tion plans for evaluating supergraph queries. We presentedan algorithm for features-based SQL-based evaluation of su-pergraph queries. The proposed algorithm reduces the re-quired number of pair-wise graph comparisons and booststhe efficiency of relational-based query processing. We eval-uated the performance of our approach with extensive exper-iments using real and synthetic datasets. The results showthat our approach is up to several orders of magnitude fasterthan the straightforward solution.

9. REFERENCES[1] Developmental therapeutics Program NCI/NIH. .

http://dtp.nci.nih.gov/.

[2] Ricardo A. Baeza-Yates and Berthier A. Ribeiro-Neto.Modern Information Retrieval. ACM Press /Addison-Wesley, 1999.

[3] Kevin S. Beyer, Peter J. Haas, Berthold Reinwald,Yannis Sismanis, and Rainer Gemulla. On synopsesfor distinct-value estimation under multisetoperations. In Proceedings of the ACM SIGMODInternational Conference on Management of Data,pages 199–210, 2007.

[4] Viorica Botea, Daniel Mallett, Mario A. Nascimento,and Jorg Sander. PIST: An Efficient and PracticalIndexing Technique for Historical Spatio-Temporal

Point Data. GeoInformatica, 12(2):143–168, 2008.

[5] Deng Cai, Zheng Shao, Xiaofei He, Xifeng Yan, andJiawei Han. Community Mining from Multi-relationalNetworks. In Proceedings of the 9th EuropeanConference on Principles and Practice of KnowledgeDiscovery in Databases (PKDD), pages 445–452, 2005.

[6] Sunil Chakkappen, Thierry Cruanes, Benoıt Dageville,Linan Jiang, Uri Shaft, Hong Su, and Mohamed Zaıt.Efficient and scalable statistics gathering for largedatabases in Oracle 11g. In Proceedings of the ACMSIGMOD International Conference on Management ofData, pages 1053–1064, 2008.

[7] Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han,Dong-Qing Zhang, and Xiaohui Gu. Towards GraphContainment Search and Indexing. In Proceedings ofthe 33rd International Conference on Very Large DataBases (VLDB), pages 926–937, 2007.

[8] James Cheng, Yiping Ke, Wilfred Ng, and An Lu.Fg-index: towards verification-free query processingon graph databases. In Proceedings of the ACMSIGMOD International Conference on Management ofData, pages 857–872, 2007.

[9] Shirley Cohen, Patrick Hurley, Karl W. Schulz,William L. Barth, and Brad Benton. Scientific formatsfor object-relational database systems: a study ofsuitability and performance. SIGMOD Record,35(2):10–15, 2006.

[10] Rosalba Giugno and Dennis Shasha. Graphgrep: Afast and universal method for querying graphs. InProceedings of the 16th International Conference onPattern Recognition (ICPR), pages 112–115, 2002.

[11] Roy Goldman and Jennifer Widom. DataGuides:Enabling Query Formulation and Optimization inSemistructured Databases. In Proceedings of the 23rdInternational Conference on Very Large Data Bases(VLDB), pages 436–445, 1997.

[12] Goetz Graefe. Sorting And Indexing With PartitionedB-Trees. In Proceedings of the 1st Biennial Conferenceon Innovative Data Systems Research (CIDR), 2003.

[13] Torsten Grust, Sherif Sakr, and Jens Teubner. XQueryon SQL Hosts. In Proceedings of the 30th InternationalConference on Very Large Data Bases (VLDB), pages252–263, 2004.

[14] Huahai He and Ambuj K. Singh. Closure-Tree: AnIndex Structure for Graph Queries. In Proceedings ofthe 22nd International Conference on DataEngineering (ICDE), page 38, 2006.

[15] Jun Huan, Wei Wang, Deepak Bandyopadhyay, JackSnoeyink, Jan Prins, and Alexander Tropsha. Miningprotein family specific residue packing patterns fromprotein structure graphs. In Proceedings of the 8thAnnual International Conference on ComputationalMolecular Biology (RECOMB), pages 308–315, 2004.

[16] Haoliang Jiang, Haixun Wang, Philip S. Yu, andShuigeng Zhou. GString: A Novel Approach forEfficient Search in Graph Databases. In Proceedings ofthe 23rd International Conference on DataEngineering (ICDE), pages 566–575, 2007.

[17] Stefan Klinger and Jim Austin. Chemical similaritysearching using a neural graph matcher. In Proceedingsof the 13th European Symposium on Artificial NeuralNetworks (ESANN), pages 479–484, 2005.

[18] Michihiro Kuramochi and George Karypis. FrequentSubgraph Discovery. In Proceedings of the IEEEInternational Conference on Data Mining (ICDM),pages 313–320, 2001.

[19] Frank Manola and Eric Miller. RDF Primer: WorldWide Web Consortium Proposed Recommendation,February 2004. http://www.w3.org/TR/rdf-primer/.

[20] Sherif Sakr. GraphREL: A Decomposition-Based andSelectivity-Aware Relational Framework forProcessing Sub-graph Queries. In Proceedings of the14th International Conference on Database Systemsfor Advanced Applications (DASFAA), pages 123–137,2009.

[21] Sherif Sakr and Ghazi Al-Naymat. Graph Indexingand Querying: A Review. International Journal ofWeb Information Systems (IJWIS), 6(2):101–120,2010.

[22] Jens Teubner, Torsten Grust, Sebastian Maneth, andSherif Sakr. Dependable cardinality forecasts forXQuery. Proceedings of the VLDB Endowment(PVLDB), 1(1):463–477, 2008.

[23] Yuanyuan Tian and Jignesh M. Patel. TALE: A Toolfor Approximate Large Graph Matching. InProceedings of the 24th International Conference onData Engineering (ICDE), pages 963–972, 2008.

[24] Can Turker and Michael Gertz. Semantic integritysupport in SQL: 1999 and commercial(object-)relational database management systems.VLDB J., 10(4):241–269, 2001.

[25] Takashi Washio and Hiroshi Motoda. State of the artof graph-based data mining. SIGKDD Explorations,5(1):59–68, 2003.

[26] David W. Williams, Jun Huan, and Wei Wang. GraphDatabase Indexing Using Structured GraphDecomposition. In Proceedings of the 23rdInternational Conference on Data Engineering(ICDE), pages 976–985, 2007.

[27] Xifeng Yan and Jiawei Han. gSpan: Graph-BasedSubstructure Pattern Mining. In Proceedings of theIEEE International Conference on Data Mining(ICDM), pages 721–724, 2002.

[28] Xifeng Yan and Jiawei Han. CloseGraph: miningclosed frequent graph patterns. In Proceedings of the9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD), pages286–295, 2003.

[29] Xifeng Yan, Philip S. Yu, and Jiawei Han. GraphIndexing: A Frequent Structure-based Approach. InProceedings of the ACM SIGMOD InternationalConference on Management of Data, pages 335–346,2004.

[30] Masatoshi Yoshikawa, Toshiyuki Amagasa, TakeyukiShimura, and Shunsuke Uemura. XRel: a path-basedapproach to storage and retrieval of XML documentsusing relational databases. ACM Transactions onInternet Technology (TOIT), 1(1):110–141, 2001.

[31] Shijie Zhang, Meng Hu, and Jiong Yang. TreePi: ANovel Graph Indexing Method. In Proceedings of the23rd International Conference on Data Engineering(ICDE), pages 966–975, 2007.

[32] Shuo Zhang, Jianzhong Li, Hong Gao, and ZhaonianZou. A novel approach for efficient supergraph query

processing on graph databases. In Proceedings of the12th International Conference on Extending DatabaseTechnology (EDBT), pages 204–215, 2009.

[33] Peixiang Zhao, Jeffrey Xu Yu, and Philip S. Yu.Graph Indexing: Tree + Delta >= Graph. InProceedings of the 33rd International Conference onVery Large Data Bases (VLDB), pages 938–949, 2007.

[34] Lei Zou, Lei Chen, Jeffrey Xu Yu, and Yansheng Lu.A novel spectral coding in a large graph database. InProceedings of the 11th International Conference onExtending Database Technology (EDBT), pages181–192, 2008.

an efﬁcient features-based processing technique for...

Documents