mining frequent subgraphs

Mining Frequent Subgraphs

COMP 790-90 Seminar

Spring 2007

OverviewIntroduction

Finding recurring subgraphs from graph databases.

gSpan

FFSM

04/20/232

1L06

Labeled GraphWe define a labeled graph G as a five element tuple G = {V, E, V, E, } where

V is the set of vertices of G,

E V V is a set of undirected edges of G,

V (E) are set of vertex (edge) labels,

is the labeling function: V V and E E that maps vertices and edges to their labels.

04/20/233

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

Frequent Subgraph Mining

= 2/3

04/20/234

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

Input: A set GD of labeled undirected graphs

a b a by x

b b

a

b

b

y

xa

b

b

y

ya

b

b

y

xy

Output: All frequent subgraphs (w. r. t. ) from GD.

Finding Frequent Subgraphs

Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least graphs.

Isomorphic subgraphs are considered the same subgraph.

Apriori approachesGeneration of subgraph candidates is complicated and expensive.

Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

gSpan

DFS without candidate generationRelabels graph representation to support DFS.

Discovers all frequent subgraphs without candidate generation or pruning.

DFS RepresentationMap each graph to a DFS code (sequence).

Lexicographically order the codes.

Construct a search tree based on the lexicographic order.

(a) (b) (c) (d)

Depth-First Search Tree

(a) (b)

(c) (d)

DFS Codes

Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if:

~i1 = i2 && j1 < j2

~i1 < j1 && j1 = i2

code(G,T) = edge sequence of ei < ei+1

edge (b) (c) (d)

0 (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X)

1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y)

2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X)

3 (2,3,X,c,Z) (2,3,X,c,Z) (2,3,Y,b,Z)

4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X)

5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z)

DFS Lexicographic Order

∂ = code(G∂,T∂) = (a0,a1,…,am)

ß = code(Gß,Tß) = (b0,b1,…,bn)

∂ ≤ ß iff (1) or (2):(1)(2)Minimum DFS code

The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G.

Graphs A and B are isomorphic if min(A) = min(B).

ak bk for 0 k m, n m

t, 0 t min m,n , ak bk for k t, at e bt

DFS Codes: Parents and Children

If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b):

ß is the child of ∂.

∂ is the parent of ß.

A valid DFS code requires that b grows from a vertex on the rightmost path.

DFS Code TreesOrganize DFS code nodes as parent-child.Pre-order traversal follows DFS lexicographic order.If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

gSpanD is the set of all graphs.S is the result set.

Algorithm 1: GraphSet_Projection(D,S)1: sort labels in D by frequency2: remove infrequent vertices and edges3: relabel remaining vertices and edges4: S’ = all frequent 1-edge graphs in D5: sort S’ in DFS lexicographic order6: S = S’7: foreach edge e in S’ do8: s = graph defined by e9: s.D = subgraphs in D containing e10: Subgraph_Mining(D,S,s)11: D = D - e12: if |D| < minSup13: break

Subprocedure 1: Subgraph_Mining(D,S,s)1: if s != min(s)2: return3: S = S U {s}4: s’ = +1-edge children of s in s.D5: foreach child c of s’ do6: if support(c) ≥ minSup7:

Subgraph_Mining(Ds,S,c)

Runtime: SyntheticR

unti

me (

sec)

Runtime: Chemical

Runti

me (

sec)

Support Threshold (%)

1000

100

10

1

0 5 10 15 20 25 30

Apriori (FSG) gSpan

gSpan AdvantagesLower memory requirements.Faster than naïve FSG by an order of magnitude.No candidate generation.Lexicographic ordering minimizes search tree.False positives pruning.

Any disadvantage?

FFSM: Fast Frequent Subgraph Mining -- An Overview:

How to solve graph isomorphism problem?

A Novel Graph Canonical Form: CAM

How to tackle subgraph isomorphism problem (NP-complete)?

Incrementally maintained embeddings

How to enumerate subgraphs:An Efficient Data Structure: CAM Tree

Two Operations: CAM-join, CAM-extension.

04/20/2316

Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle

part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge.

04/20/2317

1for an undirected graph, the upper triangle is always a mirror of the lower triangle

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

M2

0

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

M1

y

b

by

d000

cy0

xy

a

0

CodeA Code of n n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order:M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n

04/20/2318

M1

y

b

by

d000

cy0

xy

a

0

Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00aM2

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

0

The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

MP SubmatrixFor an m m matrix A, an n n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M.

04/20/2319

M6

y

b

by

d000

cy0

xy

a

0

M5

b

by

cy0

xy

a

0

M4

b

by

xy

a

M2

by

a

a

M1 M3

b

by

0y

a

We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAMTheorem II: A connected CAM’s MP submatrix is connected

CAM Tree: Subgraphs

04/20/2320

y

0

b

by

d000

cy0

0y

a

0

y

y

a

c0y

b0

b

by

a

a

bx

b

b

cy

b

dy

b

cy0

by

a

dy0

by

a

d0y0

c0y

bx

b

c0y

bx

b

d0y

bx

b

c d

bx0

by

a

0

0

y

a

c0y

bx

b

d0y0

bx0

by

a

0

0

y

a

cy0

bx

b

dy00

bx0

by

a

y

0

b

by

d000

cy0

x0

a

0

y

y

a

c0y

bx

b

y

0

b

by

c000

dy0

x0

a

0

y

y

a

d0y

bx

b

bxy

by

a

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

b0y

by

a

bx0

by

a

d0y0

b0y

by

a

CAM Tree: Frequent Subgraphs

04/20/2321

bxy

by

a

b0y

by

a

bx0

by

a

by

a

a

bx

b

b

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

= 2/3

How to Enumerate Nodes in a CAM Tree?

Two operations to explore CAM tree:CAM-Join

CAM-Extension

Augmenting CAM tree with Suboptimal CAMsObjectives:

none false dismissal

no redundancy

Plus: We want to this efficiently!

04/20/2322

Suboptimal Tree

04/20/2323

bx

b

b

cy

b

dy

b

by

a

a

bxy

by

a

cy0

by

a

dy0

by

a

j e e

y

0

b

by

d000

cy0

xy

a

y

y

b

by

c000

d00

xy

a

jj

j j

d0y0

c0y

bx

b

c0y0

d0y

bx

b

j

0

y

y

a

c0y

bx

b

d0y0

bxy

by

a

j e e

dy00

bxy

by

a

cy00

bxy

by

a

j ej e

c0y

bx

b

cy0

bx

b

dy0

bx

b

d0y

bx

b

c dWe define a Suboptimal CAM as a matrix that its MP submatrix is a CAM.

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

SummaryTheorem:For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

04/20/2324

Experimental StudyPredictive Toxicology Evaluation Competition (PTE)

Contains: 337 compounds

Each graph contains 27 nodes and 27 edges on average

NIH DTP Anti-Viral Screen Test (DTP CA/CM)Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI).

We formed a dataset contains CA (423) and CM (1083).

Each graph contains 25 nodes and 27 edges on average

04/20/2325

Performance (PTE)

04/20/2326Support Threshold (%) Support Threshold (%)

Performance (DTP CACM)

04/20/2327

Support Threshold (%) Support Threshold (%)

mining frequent subgraphs

Documents