33 rd international conference on very large data bases, sep. 2007, vienna towards graph containment...

20
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1 , Xifeng Yan 2 , Philip S. Yu 2 , Jiawei Han 1 , Dong-Qing Zhang 3 , Xiao hui Gu 2 1 University of Illinois at Urbana-Champaign 2 IBM T. J. Watson Research Center 3 Thomson Research

Post on 21-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Towards Graph Containment Search and Indexing

Chen Chen1, Xifeng Yan2, Philip S. Yu2, Jiawei Han1, Dong-Qing Zhang3, Xiaohui Gu2

1 University of Illinois at Urbana-Champaign2 IBM T. J. Watson Research Center

3 Thomson Research

Page 2: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Problem Definition

Given a graph database D = {g1, … gn}, and a graph query q, one could formulate two basic search problems:

(1) (traditional) graph search: find all graphs gi in D s.t. q is a subgraph of gi.

GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS 2002. gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD 2004. C-Tree: H. He and A. K. Singh. ICDE 2006. Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB 2007.

(2) graph containment search: find all graphs gi in D s.t. q is a supergraph of gi

Page 3: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Applications

Chem-Informatics

Pattern Recognition

Cyber Security (Virus Signature Detection)

Information Management (User-interest Mapping)

Page 4: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Example

Page 5: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Preliminary Definitions

Subgraph Isomorphism: For two labeled graphs g and g’, a subgraph isomorphism is an i

njective functionf : V(g) -> V(g’), s.t.

∀v ∈ V(g), l(v) = l’(f(v));

∀(u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)). f is called an embedding of g in g’

Subgraph and Supergraph If there exists an embedding of g in g’, then g is a subgraph of g’,

denoted by g ⊆ g’, and g’ is a supergraph of g.

Page 6: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Feature-based Indexing Methodology

Naïve solution (SCAN): Examines the database D sequentially and compares each grap

h gi with the query graph q to decide whether q ⊇ gi.

Subgraph isomorphism problem is NP-complete.

Feature-based indexing: Similar model graphs gi and gj are likely to have similar isomorphi

sm testing results w.r.t. the same query graph. Let f be a common substructure shared by gi and gj. If f ⊆ q, then

gi ⊆ q and gj ⊆ q. Therefore, we can save on isomorphism test.

Select a feature set F from graph database D. If feature f ∈ F is not a subgraph of q, then the graphs having f as subgraph are pruned.

Page 7: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Basic Framework

Off-line index construction: Generate and select a feature set F from the graph database D. For feature f ∈ F, Df = {g|f ⊆ g, g ∈ D}, which can be represented by an inverted list over D.

Search: Test indexed features in F against the query q which returns all f ⊆ q, and compute the candidate query answer set, Cq = D – ∪f Df (f ⊆ q, f ∈ F).

Verification: Check each graph g in the candidate set Cq to see whether g is really a subgraph of q.

Page 8: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Cost Model

Search Time Formula:

|F| + |Cq| (negligible)

Ff Cg

index

q

TqgTqfT ),(),(

Page 9: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Feature Graph Matrix

ga gb gc

f1 1 1 1

f2 1 1 0

f3 1 1 0

f4 1 0 0

Page 10: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Feature Generation

Good features should be frequent, but not too frequent in the database. frequent: index more graphs in database too frequent: simple and easy to be contained by query graph

Use frequent subgraph mining algorithms, e.g. gSpan[1], to generate an initial set of frequent subgraphs.

Page 11: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Feature Selection

Given a set of queries {q1, q2, …, qr}, an optimal index should be able to maximize the total gain from naïve SCAN:

||||

|)|(||

,

1

1

FrD

FrCqDrJ

fFf

r

l

qf

r

l

ltotal

l

Page 12: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Feature Selection

Set i-th row to 0 if the query has feature fi as its subgraph.

Concatenate feature graph matrix to form a global matrix. fi covers a set of columns -> Maximum Coverage with Cost:

Page 13: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Maximum Coverage with Cost

Given a set of subsets S = {S1, S2, …, Sm} of the universal set U = {1, 2, …, n} and a cost parameter λassociated with any Si ∈ S, find a subset T of S such that |∪Si ∈TSi| - λ|T| is maximized.

Can be reduced from set cover, and therefore is NP-complete.

Greedy heuristic method, in each iteration: Select a row i with the most # of non-zero entries from global mat

rix M. Set j-th column to 0 if Mij = 1

Note that selecting a row is associate with a cost r, so stop the iteration if no rows have more than r non-zero entries.

The greedy heuristic achieves an approximation ratio of 1-1/e.

Page 14: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Algorithm: cIndex-Basic

Input: Graph Matrix M over r queries Output: Selected Features F.

1: F = ;∅ 2: while ∃i, ∑j Mij > r do

3: select row i with most non-zero entries in M; 4: F = F {∪ fi};

5: for each column j s.t. Mij is not zero do

6: delete column j from M 7: delete row i; 8: return F;

Page 15: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Complexity

Time Complexity: O(|F0||D||r|), where |D| and |r| can be reduced by sampling and clustering

on graph database and queries.

Space Complexity: Use a compact matrix, reduce the space complexity from O(|F0||D||r|) to

O(|F0||D| + |F0||r|)

q1 q2 q3

f1 0 3 0

f2 2 2 0

f3 0 2 2

f4 1 1 1

Page 16: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Hierarchical Indexing Models

The cIndex-Basic algorithm builds a flat index structure, where each feature is tested sequentially and deterministically against any input queries.

Hierarchical index may improve the performance: Bottom-up Top-down

Page 17: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

cIndex-BottomUp

Build index layer by layer staring from the bottom-level graphs. The first-level index L1 is built on the original graph database by cIndex-Basic.

The features in L1 can be regarded as another graph database, where cIndex-Basic can be executed again to form second-level index L2.

Disadvantage: high-level features are simple and easy to be contained be queries.

Page 18: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

cIndex-TopDown

Select feature fi that covers most columns in global graph matrix M.

Divide queries into two groups: contain fi and do not contain fi. Divide M into two parts according to query groups.

Run the above steps recursively on new matrices, until we reach a small number of queries in a group (to avoid overfitting).

Page 19: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Experiments

Page 20: 33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip

Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna

Thank You!