33 rd international conference on very large data bases, sep. 2007, vienna towards graph containment...

33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Towards Graph Containment Search and Indexing

Chen Chen1, Xifeng Yan2, Philip S. Yu2, Jiawei Han1, Dong-Qing Zhang3, Xiaohui Gu2

1 University of Illinois at Urbana-Champaign2 IBM T. J. Watson Research Center

3 Thomson Research

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Problem Definition

Given a graph database D = {g1, … gn}, and a graph query q, one could formulate two basic search problems:

(1) (traditional) graph search: find all graphs gi in D s.t. q is a subgraph of gi.

GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS 2002. gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD 2004. C-Tree: H. He and A. K. Singh. ICDE 2006. Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB 2007.

(2) graph containment search: find all graphs gi in D s.t. q is a supergraph of gi


Applications

Chem-Informatics

Pattern Recognition

Cyber Security (Virus Signature Detection)

Information Management (User-interest Mapping)


Example


Preliminary Definitions

Subgraph Isomorphism: For two labeled graphs g and g’, a subgraph isomorphism is an i

njective functionf : V(g) -> V(g’), s.t.

∀v ∈ V(g), l(v) = l’(f(v));

∀(u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)). f is called an embedding of g in g’

Subgraph and Supergraph If there exists an embedding of g in g’, then g is a subgraph of g’,

denoted by g ⊆ g’, and g’ is a supergraph of g.


Feature-based Indexing Methodology

Naïve solution (SCAN): Examines the database D sequentially and compares each grap

h gi with the query graph q to decide whether q ⊇ gi.

Subgraph isomorphism problem is NP-complete.

Feature-based indexing: Similar model graphs gi and gj are likely to have similar isomorphi

sm testing results w.r.t. the same query graph. Let f be a common substructure shared by gi and gj. If f ⊆ q, then

gi ⊆ q and gj ⊆ q. Therefore, we can save on isomorphism test.

Select a feature set F from graph database D. If feature f ∈ F is not a subgraph of q, then the graphs having f as subgraph are pruned.


Basic Framework

Off-line index construction: Generate and select a feature set F from the graph database D. For feature f ∈ F, Df = {g|f ⊆ g, g ∈ D}, which can be represented by an inverted list over D.

Search: Test indexed features in F against the query q which returns all f ⊆ q, and compute the candidate query answer set, Cq = D – ∪f Df (f ⊆ q, f ∈ F).

Verification: Check each graph g in the candidate set Cq to see whether g is really a subgraph of q.


Cost Model

Search Time Formula:

|F| + |Cq| (negligible)

Ff Cg

index

q

TqgTqfT ),(),(


Feature Graph Matrix

ga gb gc

f1 1 1 1

f2 1 1 0

f3 1 1 0

f4 1 0 0


Feature Generation

Good features should be frequent, but not too frequent in the database. frequent: index more graphs in database too frequent: simple and easy to be contained by query graph

Use frequent subgraph mining algorithms, e.g. gSpan[1], to generate an initial set of frequent subgraphs.


Feature Selection

Given a set of queries {q1, q2, …, qr}, an optimal index should be able to maximize the total gain from naïve SCAN:

||||

|)|(||

,

1

1

FrD

FrCqDrJ

fFf

r

l

qf

r

l

ltotal

l


Feature Selection

Set i-th row to 0 if the query has feature fi as its subgraph.

Concatenate feature graph matrix to form a global matrix. fi covers a set of columns -> Maximum Coverage with Cost:


Maximum Coverage with Cost

Given a set of subsets S = {S1, S2, …, Sm} of the universal set U = {1, 2, …, n} and a cost parameter λassociated with any Si ∈ S, find a subset T of S such that |∪Si ∈TSi| - λ|T| is maximized.

Can be reduced from set cover, and therefore is NP-complete.

Greedy heuristic method, in each iteration: Select a row i with the most # of non-zero entries from global mat

rix M. Set j-th column to 0 if Mij = 1

Note that selecting a row is associate with a cost r, so stop the iteration if no rows have more than r non-zero entries.

The greedy heuristic achieves an approximation ratio of 1-1/e.


Algorithm: cIndex-Basic

Input: Graph Matrix M over r queries Output: Selected Features F.

1: F = ;∅ 2: while ∃i, ∑j Mij > r do

3: select row i with most non-zero entries in M; 4: F = F {∪ fi};

5: for each column j s.t. Mij is not zero do

6: delete column j from M 7: delete row i; 8: return F;


Complexity

Time Complexity: O(|F0||D||r|), where |D| and |r| can be reduced by sampling and clustering

on graph database and queries.

Space Complexity: Use a compact matrix, reduce the space complexity from O(|F0||D||r|) to

O(|F0||D| + |F0||r|)

q1 q2 q3

f1 0 3 0

f2 2 2 0

f3 0 2 2

f4 1 1 1


Hierarchical Indexing Models

The cIndex-Basic algorithm builds a flat index structure, where each feature is tested sequentially and deterministically against any input queries.

Hierarchical index may improve the performance: Bottom-up Top-down


cIndex-BottomUp

Build index layer by layer staring from the bottom-level graphs. The first-level index L1 is built on the original graph database by cIndex-Basic.

The features in L1 can be regarded as another graph database, where cIndex-Basic can be executed again to form second-level index L2.

Disadvantage: high-level features are simple and easy to be contained be queries.


cIndex-TopDown

Select feature fi that covers most columns in global graph matrix M.

Divide queries into two groups: contain fi and do not contain fi. Divide M into two parts according to query groups.

Run the above steps recursively on new matrices, until we reach a small number of queries in a group (to avoid overfitting).


Experiments


Thank You!

33 rd international conference on very large data bases, sep. 2007, vienna towards graph containment...

Documents