mining frequent subgraphs

27
Mining Frequent Subgraphs COMP 790-90 Seminar Spring 2007

Upload: barto

Post on 26-Jan-2016

41 views

Category:

Documents


0 download

DESCRIPTION

Mining Frequent Subgraphs. COMP 790-90 Seminar Spring 2007. 1L06. Overview. Introduction Finding recurring subgraphs from graph databases. gSpan FFSM. p 2. p 5. s 1. q 1. y. c. b. y. y. y. b. b. s 2. p 1. q 2. x. a. a. a. x. y. y. y. y. d. b. b. b. p 4. s 3. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining Frequent Subgraphs

Mining Frequent Subgraphs

COMP 790-90 Seminar

Spring 2007

Page 2: Mining Frequent Subgraphs

OverviewIntroduction

Finding recurring subgraphs from graph databases.

gSpan

FFSM

04/21/232

1L06

Page 3: Mining Frequent Subgraphs

Labeled GraphWe define a labeled graph G as a five element tuple G = {V, E, V, E, } where

V is the set of vertices of G,

E V V is a set of undirected edges of G,

V (E) are set of vertex (edge) labels,

is the labeling function: V V and E E that maps vertices and edges to their labels.

04/21/233

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

Page 4: Mining Frequent Subgraphs

Frequent Subgraph Mining

= 2/3

04/21/234

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

Input: A set GD of labeled undirected graphs

a b a by x

b b

a

b

b

y

xa

b

b

y

ya

b

b

y

xy

Output: All frequent subgraphs (w. r. t. ) from GD.

Page 5: Mining Frequent Subgraphs

Finding Frequent Subgraphs

Given a graph database GD = {G0,G1,…,Gn}, find all subgraphs appearing in at least graphs.

Isomorphic subgraphs are considered the same subgraph.

Apriori approachesGeneration of subgraph candidates is complicated and expensive.

Subgraph isomorphism is an NP-complete problem, so pruning is expensive.

Page 6: Mining Frequent Subgraphs

gSpan

DFS without candidate generationRelabels graph representation to support DFS.

Discovers all frequent subgraphs without candidate generation or pruning.

DFS RepresentationMap each graph to a DFS code (sequence).

Lexicographically order the codes.

Construct a search tree based on the lexicographic order.

Page 7: Mining Frequent Subgraphs

(a) (b) (c) (d)

Depth-First Search Tree

Page 8: Mining Frequent Subgraphs

(a) (b)

(c) (d)

DFS Codes

Given ei = (i1,j1), e2 = (i2,j2): e1 < e2 if:

~i1 = i2 && j1 < j2

~i1 < j1 && j1 = i2

code(G,T) = edge sequence of ei < ei+1

edge (b) (c) (d)

0 (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X)

1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y)

2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X)

3 (2,3,X,c,Z) (2,3,X,c,Z) (2,3,Y,b,Z)

4 (3,1,Z,b,Y) (3,0,Z,b,Y) (3,0,Z,c,X)

5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z)

Page 9: Mining Frequent Subgraphs

DFS Lexicographic Order

∂ = code(G∂,T∂) = (a0,a1,…,am)

ß = code(Gß,Tß) = (b0,b1,…,bn)

∂ ≤ ß iff (1) or (2):(1)(2)Minimum DFS code

The minimum DFS code min(G), in DFS lexicographic order, is the canonical label of graph G.

Graphs A and B are isomorphic if min(A) = min(B).

ak bk for 0 k m, n m

t, 0 t min m,n , ak bk for k t, at e bt

Page 10: Mining Frequent Subgraphs

DFS Codes: Parents and Children

If ∂ = (a0,a1,…,am) and ß = (a0,a1,…,am,b):

ß is the child of ∂.

∂ is the parent of ß.

A valid DFS code requires that b grows from a vertex on the rightmost path.

Page 11: Mining Frequent Subgraphs

DFS Code TreesOrganize DFS code nodes as parent-child.Pre-order traversal follows DFS lexicographic order.If s and s’ are the same graph with different DFS codes, s’ is not the minimum and can be pruned.

Page 12: Mining Frequent Subgraphs

gSpanD is the set of all graphs.S is the result set.

Algorithm 1: GraphSet_Projection(D,S)1: sort labels in D by frequency2: remove infrequent vertices and edges3: relabel remaining vertices and edges4: S’ = all frequent 1-edge graphs in D5: sort S’ in DFS lexicographic order6: S = S’7: foreach edge e in S’ do8: s = graph defined by e9: s.D = subgraphs in D containing e10: Subgraph_Mining(D,S,s)11: D = D - e12: if |D| < minSup13: break

Subprocedure 1: Subgraph_Mining(D,S,s)1: if s != min(s)2: return3: S = S U {s}4: s’ = +1-edge children of s in s.D5: foreach child c of s’ do6: if support(c) ≥ minSup7:

Subgraph_Mining(Ds,S,c)

Page 13: Mining Frequent Subgraphs

Runtime: SyntheticR

unti

me (

sec)

Page 14: Mining Frequent Subgraphs

Runtime: Chemical

Runti

me (

sec)

Support Threshold (%)

1000

100

10

1

0 5 10 15 20 25 30

Apriori (FSG) gSpan

Page 15: Mining Frequent Subgraphs

gSpan AdvantagesLower memory requirements.Faster than naïve FSG by an order of magnitude.No candidate generation.Lexicographic ordering minimizes search tree.False positives pruning.

Any disadvantage?

Page 16: Mining Frequent Subgraphs

FFSM: Fast Frequent Subgraph Mining -- An Overview:

How to solve graph isomorphism problem?

A Novel Graph Canonical Form: CAM

How to tackle subgraph isomorphism problem (NP-complete)?

Incrementally maintained embeddings

How to enumerate subgraphs:An Efficient Data Structure: CAM Tree

Two Operations: CAM-join, CAM-extension.

04/21/2316

Page 17: Mining Frequent Subgraphs

Adjacency Matrix Every diagonal entry of adjacency matrix M corresponds to a distinct vertex in G and is filled with the label of this vertex. Every off-diagonal entry in the lower triangle

part of M1 corresponds to a pair of vertices in G and is filled with the label of the edge between the two vertices and zero if there is no edge.

04/21/2317

1for an undirected graph, the upper triangle is always a mirror of the lower triangle

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

M2

0

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

M1

y

b

by

d000

cy0

xy

a

0

Page 18: Mining Frequent Subgraphs

CodeA Code of n n adjacency matrix M is defined as sequence of lower triangular entries (including the diagonal entries) in the order:M1,1 M2,1 M2,2 … Mn,1 Mn,2 …Mn,n-1 Mn,n

04/21/2318

M1

y

b

by

d000

cy0

xy

a

0

Code(M1): aybyxb0y0c00y0d > Code(M2): aybyxb00yd0y00c > Code(M3): bxby0d0y0cyy00aM2

y

b

by

c0y0

d00

xy

a

M3

0

0

d

bx

a0yy

cy0

0y

b

0

The Canonical Adjacency Matrix is the one produces the maximal code, using lexicographic order.

Page 19: Mining Frequent Subgraphs

MP SubmatrixFor an m m matrix A, an n n matrix B is A’s maximal proper submatrix (MP Submatrix), iff N is obtained by removing the last none-zero entry from M.

04/21/2319

M6

y

b

by

d000

cy0

xy

a

0

M5

b

by

cy0

xy

a

0

M4

b

by

xy

a

M2

by

a

a

M1 M3

b

by

0y

a

We define a CAM is connected iff the corresponding graph is connected. Theorem I: A CAM’s MP submatrix is CAMTheorem II: A connected CAM’s MP submatrix is connected

Page 20: Mining Frequent Subgraphs

CAM Tree: Subgraphs

04/21/2320

y

0

b

by

d000

cy0

0y

a

0

y

y

a

c0y

b0

b

by

a

a

bx

b

b

cy

b

dy

b

cy0

by

a

dy0

by

a

d0y0

c0y

bx

b

c0y

bx

b

d0y

bx

b

c d

bx0

by

a

0

0

y

a

c0y

bx

b

d0y0

bx0

by

a

0

0

y

a

cy0

bx

b

dy00

bx0

by

a

y

0

b

by

d000

cy0

x0

a

0

y

y

a

c0y

bx

b

y

0

b

by

c000

dy0

x0

a

0

y

y

a

d0y

bx

b

bxy

by

a

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

b0y

by

a

bx0

by

a

d0y0

b0y

by

a

Page 21: Mining Frequent Subgraphs

CAM Tree: Frequent Subgraphs

04/21/2321

bxy

by

a

b0y

by

a

bx0

by

a

by

a

a

bx

b

b

a

b

b

y

xy

(Q)

q1

q3

q2

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

a

b

b

y

y

(S)

s1

s3

s2

= 2/3

Page 22: Mining Frequent Subgraphs

How to Enumerate Nodes in a CAM Tree?

Two operations to explore CAM tree:CAM-Join

CAM-Extension

Augmenting CAM tree with Suboptimal CAMsObjectives:

none false dismissal

no redundancy

Plus: We want to this efficiently!

04/21/2322

Page 23: Mining Frequent Subgraphs

Suboptimal Tree

04/21/2323

bx

b

b

cy

b

dy

b

by

a

a

bxy

by

a

cy0

by

a

dy0

by

a

j e e

y

0

b

by

d000

cy0

xy

a

y

y

b

by

c000

d00

xy

a

jj

j j

d0y0

c0y

bx

b

c0y0

d0y

bx

b

j

0

y

y

a

c0y

bx

b

d0y0

bxy

by

a

j e e

dy00

bxy

by

a

cy00

bxy

by

a

j ej e

c0y

bx

b

cy0

bx

b

dy0

bx

b

d0y

bx

b

c dWe define a Suboptimal CAM as a matrix that its MP submatrix is a CAM.

p2 p5

a

b

b d

y

x

y

y

y

(P)

p1

p3p4

c

Page 24: Mining Frequent Subgraphs

SummaryTheorem:For a graph G, let CK-1 (Ck) be set of the suboptimal CAMs of all the size (K-1) (K) subgraphs of G (K ≥ 2). Every member of set CK can be enumerated unambiguously either by joining two members of set CK-1 or by extending a member in CK-1.

04/21/2324

Page 25: Mining Frequent Subgraphs

Experimental StudyPredictive Toxicology Evaluation Competition (PTE)

Contains: 337 compounds

Each graph contains 27 nodes and 27 edges on average

NIH DTP Anti-Viral Screen Test (DTP CA/CM)Chemicals are classified to be Confirmed Active (CA), Confirmed Moderate Active (CM) and Confirmed Inactive (CI).

We formed a dataset contains CA (423) and CM (1083).

Each graph contains 25 nodes and 27 edges on average

04/21/2325

Page 26: Mining Frequent Subgraphs

Performance (PTE)

04/21/2326Support Threshold (%) Support Threshold (%)

Page 27: Mining Frequent Subgraphs

Performance (DTP CACM)

04/21/2327

Support Threshold (%) Support Threshold (%)