mining frequent subgraphs
DESCRIPTION
Mining Frequent Subgraphs. COMP 790-90 Seminar Spring 2007. 1L06. Overview. Introduction Finding recurring subgraphs from graph databases. gSpan FFSM. p 2. p 5. s 1. q 1. y. c. b. y. y. y. b. b. s 2. p 1. q 2. x. a. a. a. x. y. y. y. y. d. b. b. b. p 4. s 3. - PowerPoint PPT PresentationTRANSCRIPT
Mining Frequent Subgraphs
COMP 790-90 Seminar
Spring 2007
OverviewIntroduction
Finding recurring subgraphs from graph databases.
gSpan
FFSM
04/20/232
1L06
Labeled GraphWe define a labeled graph G as a five element tuple G = {V, E, V, E, } where
V is the set of vertices of G,
E V V is a set of undirected edges of G,
V (E) are set of vertex (edge) labels,
is the labeling function: V V and E E that maps vertices and edges to their labels.
04/20/233
a
b
b
y
xy
(Q)
q1
q3
q2
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
a
b
b
y
y
(S)
s1
s3
s2
Frequent Subgraph Mining
= 2/3
04/20/234
a
b
b
y
xy
(Q)
q1
q3
q2
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
a
b
b
y
y
(S)
s1
s3
s2
Input: A set GD of labeled undirected graphs
a b a by x
b b
a
b
b
y
xa
b
b
y
ya
b
b
y
xy
Output: All frequent subgraphs (w. r. t. ) from GD.
Finding Frequent Subgraphs
Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least graphs.
Isomorphic subgraphs are considered the same subgraph.
Apriori approachesGeneration of subgraph candidates is complicated and expensive.
Subgraph isomorphism is an NP-complete problem, so pruning is expensive.
gSpan
DFS without candidate generationRelabels graph representation to support DFS.
Discovers all frequent subgraphs without candidate generation or pruning.
DFS RepresentationMap each graph to a DFS code (sequence).
Lexicographically order the codes.
Construct a search tree based on the lexicographic order.
(a) (b) (c) (d)
Depth-First Search Tree
(a) (b)
(c) (d)
DFS Codes
Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if:
~i1 = i2 && j1 < j2
~i1 < j1 && j1 = i2
code(G,T) = edge sequence of ei < ei+1
edge (b) (c) (d)
0 (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X)
1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y)
2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X)
3 (2,3,X,c,Z) (2,3,X,c,Z) (2,3,Y,b,Z)
4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X)
5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z)
DFS Lexicographic Order
∂ = code(G∂,T∂) = (a0,a1,…,am)
ß = code(Gß,Tß) = (b0,b1,…,bn)
∂ ≤ ß iff (1) or (2):(1)(2)Minimum DFS code
The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G.
Graphs A and B are isomorphic if min(A) = min(B).
ak bk for 0 k m, n m
t, 0 t min m,n , ak bk for k t, at e bt
DFS Codes: Parents and Children
If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b):
ß is the child of ∂.
∂ is the parent of ß.
A valid DFS code requires that b grows from a vertex on the rightmost path.
DFS Code TreesOrganize DFS code nodes as parent-child.Pre-order traversal follows DFS lexicographic order.If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.
gSpanD is the set of all graphs.S is the result set.
Algorithm 1: GraphSet_Projection(D,S)1: sort labels in D by frequency2: remove infrequent vertices and edges3: relabel remaining vertices and edges4: S’ = all frequent 1-edge graphs in D5: sort S’ in DFS lexicographic order6: S = S’7: foreach edge e in S’ do8: s = graph defined by e9: s.D = subgraphs in D containing e10: Subgraph_Mining(D,S,s)11: D = D - e12: if |D| < minSup13: break
Subprocedure 1: Subgraph_Mining(D,S,s)1: if s != min(s)2: return3: S = S U {s}4: s’ = +1-edge children of s in s.D5: foreach child c of s’ do6: if support(c) ≥ minSup7:
Subgraph_Mining(Ds,S,c)
Runtime: SyntheticR
unti
me (
sec)
Runtime: Chemical
Runti
me (
sec)
Support Threshold (%)
1000
100
10
1
0 5 10 15 20 25 30
Apriori (FSG) gSpan
gSpan AdvantagesLower memory requirements.Faster than naïve FSG by an order of magnitude.No candidate generation.Lexicographic ordering minimizes search tree.False positives pruning.
Any disadvantage?
FFSM: Fast Frequent Subgraph Mining -- An Overview:
How to solve graph isomorphism problem?
A Novel Graph Canonical Form: CAM
How to tackle subgraph isomorphism problem (NP-complete)?
Incrementally maintained embeddings
How to enumerate subgraphs:An Efficient Data Structure: CAM Tree
Two Operations: CAM-join, CAM-extension.
04/20/2316
Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle
part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge.
04/20/2317
1for an undirected graph, the upper triangle is always a mirror of the lower triangle
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
M2
0
y
b
by
c0y0
d00
xy
a
M3
0
0
d
bx
a0yy
cy0
0y
b
M1
y
b
by
d000
cy0
xy
a
0
CodeA Code of n n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order:M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n
04/20/2318
M1
y
b
by
d000
cy0
xy
a
0
Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00aM2
y
b
by
c0y0
d00
xy
a
M3
0
0
d
bx
a0yy
cy0
0y
b
0
The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.
MP SubmatrixFor an m m matrix A, an n n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M.
04/20/2319
M6
y
b
by
d000
cy0
xy
a
0
M5
b
by
cy0
xy
a
0
M4
b
by
xy
a
M2
by
a
a
M1 M3
b
by
0y
a
We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAMTheorem II: A connected CAM’s MP submatrix is connected
CAM Tree: Subgraphs
04/20/2320
y
0
b
by
d000
cy0
0y
a
0
y
y
a
c0y
b0
b
by
a
a
bx
b
b
cy
b
dy
b
cy0
by
a
dy0
by
a
d0y0
c0y
bx
b
c0y
bx
b
d0y
bx
b
c d
bx0
by
a
0
0
y
a
c0y
bx
b
d0y0
bx0
by
a
0
0
y
a
cy0
bx
b
dy00
bx0
by
a
y
0
b
by
d000
cy0
x0
a
0
y
y
a
c0y
bx
b
y
0
b
by
c000
dy0
x0
a
0
y
y
a
d0y
bx
b
bxy
by
a
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
b0y
by
a
bx0
by
a
d0y0
b0y
by
a
CAM Tree: Frequent Subgraphs
04/20/2321
bxy
by
a
b0y
by
a
bx0
by
a
by
a
a
bx
b
b
a
b
b
y
xy
(Q)
q1
q3
q2
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
a
b
b
y
y
(S)
s1
s3
s2
= 2/3
How to Enumerate Nodes in a CAM Tree?
Two operations to explore CAM tree:CAM-Join
CAM-Extension
Augmenting CAM tree with Suboptimal CAMsObjectives:
none false dismissal
no redundancy
Plus: We want to this efficiently!
04/20/2322
Suboptimal Tree
04/20/2323
bx
b
b
cy
b
dy
b
by
a
a
bxy
by
a
cy0
by
a
dy0
by
a
j e e
y
0
b
by
d000
cy0
xy
a
y
y
b
by
c000
d00
xy
a
jj
j j
d0y0
c0y
bx
b
c0y0
d0y
bx
b
j
0
y
y
a
c0y
bx
b
d0y0
bxy
by
a
j e e
dy00
bxy
by
a
cy00
bxy
by
a
j ej e
c0y
bx
b
cy0
bx
b
dy0
bx
b
d0y
bx
b
c dWe define a Suboptimal CAM as a matrix that its MP submatrix is a CAM.
p2 p5
a
b
b d
y
x
y
y
y
(P)
p1
p3p4
c
SummaryTheorem:For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.
04/20/2324
Experimental StudyPredictive Toxicology Evaluation Competition (PTE)
Contains: 337 compounds
Each graph contains 27 nodes and 27 edges on average
NIH DTP Anti-Viral Screen Test (DTP CA/CM)Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI).
We formed a dataset contains CA (423) and CM (1083).
Each graph contains 25 nodes and 27 edges on average
04/20/2325
Performance (PTE)
04/20/2326Support Threshold (%) Support Threshold (%)
Performance (DTP CACM)
04/20/2327
Support Threshold (%) Support Threshold (%)