33 rd international conference on very large data bases, sep. 2007, vienna towards graph containment...
Post on 21-Dec-2015
213 views
TRANSCRIPT
33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Towards Graph Containment Search and Indexing
Chen Chen1, Xifeng Yan2, Philip S. Yu2, Jiawei Han1, Dong-Qing Zhang3, Xiaohui Gu2
1 University of Illinois at Urbana-Champaign2 IBM T. J. Watson Research Center
3 Thomson Research
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Problem Definition
Given a graph database D = {g1, … gn}, and a graph query q, one could formulate two basic search problems:
(1) (traditional) graph search: find all graphs gi in D s.t. q is a subgraph of gi.
GraphGrep: D. Shasha, J. T.-L. Wang, and R. Giugno. PODS 2002. gIndex: X. Yan, P. S. Yu, and J. Han. SIGMOD 2004. C-Tree: H. He and A. K. Singh. ICDE 2006. Tree+Δ: P. Zhao, J. X. Yu, and P. S. Yu. VLDB 2007.
(2) graph containment search: find all graphs gi in D s.t. q is a supergraph of gi
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Applications
Chem-Informatics
Pattern Recognition
Cyber Security (Virus Signature Detection)
Information Management (User-interest Mapping)
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Example
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Preliminary Definitions
Subgraph Isomorphism: For two labeled graphs g and g’, a subgraph isomorphism is an i
njective functionf : V(g) -> V(g’), s.t.
∀v ∈ V(g), l(v) = l’(f(v));
∀(u, v) ∈ E(g), (f(u), f(v)) ∈ E(g’) and l(u, v) = l’(f(u), f(v)). f is called an embedding of g in g’
Subgraph and Supergraph If there exists an embedding of g in g’, then g is a subgraph of g’,
denoted by g ⊆ g’, and g’ is a supergraph of g.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature-based Indexing Methodology
Naïve solution (SCAN): Examines the database D sequentially and compares each grap
h gi with the query graph q to decide whether q ⊇ gi.
Subgraph isomorphism problem is NP-complete.
Feature-based indexing: Similar model graphs gi and gj are likely to have similar isomorphi
sm testing results w.r.t. the same query graph. Let f be a common substructure shared by gi and gj. If f ⊆ q, then
gi ⊆ q and gj ⊆ q. Therefore, we can save on isomorphism test.
Select a feature set F from graph database D. If feature f ∈ F is not a subgraph of q, then the graphs having f as subgraph are pruned.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Basic Framework
Off-line index construction: Generate and select a feature set F from the graph database D. For feature f ∈ F, Df = {g|f ⊆ g, g ∈ D}, which can be represented by an inverted list over D.
Search: Test indexed features in F against the query q which returns all f ⊆ q, and compute the candidate query answer set, Cq = D – ∪f Df (f ⊆ q, f ∈ F).
Verification: Check each graph g in the candidate set Cq to see whether g is really a subgraph of q.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Cost Model
Search Time Formula:
|F| + |Cq| (negligible)
Ff Cg
index
q
TqgTqfT ),(),(
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Graph Matrix
ga gb gc
f1 1 1 1
f2 1 1 0
f3 1 1 0
f4 1 0 0
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Generation
Good features should be frequent, but not too frequent in the database. frequent: index more graphs in database too frequent: simple and easy to be contained by query graph
Use frequent subgraph mining algorithms, e.g. gSpan[1], to generate an initial set of frequent subgraphs.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Selection
Given a set of queries {q1, q2, …, qr}, an optimal index should be able to maximize the total gain from naïve SCAN:
||||
|)|(||
,
1
1
FrD
FrCqDrJ
fFf
r
l
qf
r
l
ltotal
l
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Feature Selection
Set i-th row to 0 if the query has feature fi as its subgraph.
Concatenate feature graph matrix to form a global matrix. fi covers a set of columns -> Maximum Coverage with Cost:
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Maximum Coverage with Cost
Given a set of subsets S = {S1, S2, …, Sm} of the universal set U = {1, 2, …, n} and a cost parameter λassociated with any Si ∈ S, find a subset T of S such that |∪Si ∈TSi| - λ|T| is maximized.
Can be reduced from set cover, and therefore is NP-complete.
Greedy heuristic method, in each iteration: Select a row i with the most # of non-zero entries from global mat
rix M. Set j-th column to 0 if Mij = 1
Note that selecting a row is associate with a cost r, so stop the iteration if no rows have more than r non-zero entries.
The greedy heuristic achieves an approximation ratio of 1-1/e.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Algorithm: cIndex-Basic
Input: Graph Matrix M over r queries Output: Selected Features F.
1: F = ;∅ 2: while ∃i, ∑j Mij > r do
3: select row i with most non-zero entries in M; 4: F = F {∪ fi};
5: for each column j s.t. Mij is not zero do
6: delete column j from M 7: delete row i; 8: return F;
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Complexity
Time Complexity: O(|F0||D||r|), where |D| and |r| can be reduced by sampling and clustering
on graph database and queries.
Space Complexity: Use a compact matrix, reduce the space complexity from O(|F0||D||r|) to
O(|F0||D| + |F0||r|)
q1 q2 q3
f1 0 3 0
f2 2 2 0
f3 0 2 2
f4 1 1 1
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Hierarchical Indexing Models
The cIndex-Basic algorithm builds a flat index structure, where each feature is tested sequentially and deterministically against any input queries.
Hierarchical index may improve the performance: Bottom-up Top-down
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
cIndex-BottomUp
Build index layer by layer staring from the bottom-level graphs. The first-level index L1 is built on the original graph database by cIndex-Basic.
The features in L1 can be regarded as another graph database, where cIndex-Basic can be executed again to form second-level index L2.
Disadvantage: high-level features are simple and easy to be contained be queries.
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
cIndex-TopDown
Select feature fi that covers most columns in global graph matrix M.
Divide queries into two groups: contain fi and do not contain fi. Divide M into two parts according to query groups.
Run the above steps recursively on new matrices, until we reach a small number of queries in a group (to avoid overfitting).
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Experiments
Chen Chen, Xifeng Yan, Philip S. Yu, Jiawei Han, Dong-Qing Zhang, Xiaohui Gu.33rd International Conference on Very Large Data Bases, Sep. 2007, Vienna
Thank You!