graph mining laks v.s. lakshmanan based on xifeng yan jiawei han: gspan: graph-based substructure...

29
Graph Mining Laks V.S. Lakshmanan Based on Xifeng Yan Jiawei Han: gSpan: Graph- Based Substructure Pattern Mining. ICDM 2002. Also see their tech. report (uiuc-cs web site).

Upload: andres-skillen

Post on 14-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Graph Mining

Laks V.S. Lakshmanan Based on Xifeng Yan Jiawei Han: gSpan: Graph-Based Substructure Pattern Mining. ICDM 2002. Also see their tech. report (uiuc-cs web site).

What’s the relevance to this course? Spec., what does GM have to do

with influence analysis? Connection is technical. Need to build up some

background in GM to appreciate connection.

Will do influence analysis after covering GM.

The problem Given a database of graphs*, find

sub-graphs with support >= minSup (user-specified threshold).

*graph may be labeled, weighted, etc. Will consider a simple version for clarity.

There is another problem formulation: Given one graph, find frequent pattern sub-graphs.

Basics Given database D = {G0, ..., Gn},

pattern graph g, supportD(g)=#graphs in D that contain g as a sub-graph. Want to find all g: support(g)>=minSup. ◦Mainly interested in frequent connected

subgraphs. (Why?) Analogous to: Given a database of

market basket transactions, find itemsets S: support(S)>=minSup – used for association rules and correlations.

Quick review of AR mining next.

Review of Association Rule Mining 1/2I = universe of items. t = transaction (set of items). D = database (set of transactions). S – a set of items. support(S) = |{tЄD|S\subseteq t}|. Given minSup, S is frequent if support(S)

>= minSup. For sets X, Y, conf(XY) =

support(XY)/support(X). Given D, find all rules XY: XY is frequent

and conf(XY) >= minConf.

Review of AR Mining 2/2 Two major steps:

1. Find all frequent item sets. 2. Find all ARs w/ conf.>=minConf.

By far, step 1 is the expensive one.

Numerous algorithms exist. Will sketch two major ones. 1. Apriori (classic). 2. FP-trees (more efficient and

recent).

Itemset Lattice

{A,B,C}

{B,C} {A,C}{A,B}

{C}{B}{A}

{}

Just conceptual: real itemset lattice is over tens of thousands of items!

Apriori Finding frequent itemsets:

◦Sweap itemset lattice bottom-up. ◦Generate candidate itemsets for

checking frequency: If X is infrequent, none of its supersets

can be frequent [anti-monotone, downward closed].

◦Check frequency of candidate itemsets and filter through frequent ones.

◦Repeat until candidate set is empty.

Apriori Generate rules:

◦For each frequent itemset S, for every partition X, Y of S, compute conf(XY) and output if conf >=minConf.

Candidate generation: Let Ck and C’k be any two frequent

itemsets of size k s.t. |Ck \cap Ck’| = k-1. Easy to pick such Cs if we keep them

sorted. Then Ck \cup Ck’ is a viable candidate set. Details: Agrawal & Srikant: VLDB 1994.

FP-trees Apriori is quite efficient, but has an

expensive step of candidate itemset generation built-in.

Not all candidates may turn out to be frequent.

Can we cut down on or even completely avoid candidate generation?

Answer: “Yes, using FP-trees!” [Jian Pei et al. Mining Frequent Patterns without Candidate Generation. SIGMOD 2000.]

Key strengths of FP-trees based approach Compress database into a compact

structure (tree plus links) which may often fit in memory. ◦Make each frequent (single) item a node and

use prefix sharing to factor transactions. ◦Frequent items are more likely to combine to

form frequent itemsets. Never need to look at original DB again. Two passes over the FP-tree yields all

frequent patterns w/o candidate generation.

Example DB from paper TID Items

1 f,a,c,d,g,i,m,p

2 a,b,c,f,l,m,o

3 b,f,h,j,o

4 b,c,k,s,p

5 a,f,c,e,l,p,m,n

Assume minSup = 3. Scan 1: Find all frequent items f: 4, c: 4, a: 3, b: 3, m: 3, p: 3 (ordered in descending frequency order).

Scan 2: read in every transaction and insert or increment an existing node count, using prefix sharing. Prefix sharing achieves great compression.

Example (contd.)

f:4 c:1

b:1 b:

1

p:1

a:3

m:3

p:2

b:1

c:3

root

He

Header

c

f

b

a

m

p

FP-tree construction Simple recursive algorithm: when

you read a transaction (ordered in desc. freq. order and keeping only freq. items), reflect each item into the current tree by either: ◦Incrementing count of (prefix)

matching node, OR ◦Starting a new branch.

Frequent Pattern Generation Start with header table, in asc.

freq. order.Item p: freq(p) = 3 (just follow

the link path from the header table entry for p).

What about patterns containing p?

Consider the p-projected DB: ◦{(fcam: 2), (cb:1)}. ◦c:3 is the only frequent item here.

The p-conditional FP-tree is c:3

p:3 []

FP Generation (contd.) Item m: m is frequent w/ freq(m) = 3. m-

projected DB = {(fca:2), (fcab:1)}. Resulting m-conditional FP-tree =

“(fca:3)”. The call mine(fca, m) generates

recursive calls mine(fc, am), mine(f, cm), and mine({},fm).

These calls generate the FPs m:3, am:3, cm:3, fm:3, cam:3, fam:3, fcm:3, fcam:3, i.e., all FPs containing m.

Other nodes (items) processed similarly.

FP-trees concluding remarks Note avoidance of expensive

candidate generation step. Read both the SIGMOD 2000

paper and the journal version DMKD 2004 for more details and ideas of parellelizing using divide and conquer.

Practical Note: OTOH, Apriori is one of the highly optimized codes for pattern mining.

Additional Topics on Frequent Patterns ARs with constraints: e.g., XY, such that

SUM(X.price) <= c and AVG(Y.price) >= c’. (See Ng et al. SIGMOD 98 and Lakshmanan et al. SIGMOD 99 and numerous follow-ups.)

Leung et al. 20?? Push constraints into FP-tree algorithm.

Closed patterns: background and Galois lattice theory. (see Zaki et al. and Pasquier et al.)

Maximal patterns and patterns with highest frequency.

Graph Mining Resumed Early work on GM followed the footsteps

of Apriori: ◦Generate candidate subgraphs of size (k+1)

from size k (frequent) subgraphs (expensive). ◦Eliminate infrequent candidates with a

subgraph isomorphism test (NP-complete). gSpan replaces both by a simple depth-

first search (no false positives to prune!). Key idea: DFS code. Recall – main interestfreq. connected

subgraphs.

gSpan Framework G = (V, E, L, l) – node- and edge-

labeled, undirected. Only interested in connected

frequent subgraphs. ◦Frequent(g) support(g) >= minSup.

Recall notions of isomorphism, subgraph iso, and automorphism.

DFS and Pruning See tech. report, Fig. 1, p5. How does it relate to the item set

lattice for frequent item set mining?

DFS Tree: (GT) see TR, Fig. 2, p6. ◦Clearly many DFS trees possible for a

graph. ◦Root, right-most node, right-most

path. ◦Forward and backward edges.

Partial order on fwd & bwd edges.

DFS & Pruning (contd.)

1: PO on fwd edges: (i1,j1) ‹ (i2,j2) iff j1 < j2: e.g., (v0,v1) < (v2,v3) in Fig. 2b.

2: PO on bwd edges: (i1,j1) ‹ (i2,j2) iff either (a) i1 < i2 OR (b) i1=i2 & j1 < j2: e.g., (v2,v0) < (v3,v1) [Fig. 2b] and (v2,v0) < (v3,v0) [Fig. 2c-d].

DFS & Pruning (contd.) 3: PO b/w fwd and bwd edges: e1

= (i1,j1) < e2 = (i2,j2) iff: e1 is fwd and e2 is bwd and j1≤i2 OR e1 is bwd and e2 is fwd and i1<j2. Theorem: 1-3 defines a total order. Proof: Exercise. Exercise: Prove that the above TO is

equiv. to: “< is the transitive closure of: (i) e1<e2 if i1=i2 & j1<j2;

(ii) e1<e2 if i1<j1 & j1=i2.”

DFS & Pruning (contd.) DFS Code: The unique ordering of

edges in a DFS tree (e.g., Fig. 2b-d). Procedure:

◦Pick start node (root). ◦Repeat {

Add fwd edge connecting current “code” to any new node;

Add all bwd edges connecting new node to any old node}

◦Until (no more edges left). E.g., Fig. 2.

DFS & Pruning (contd.) But a given graph has numerous

DFS codes, making it complex to check if a pattern g is isomorphic to a subgraph of some graph Gi \in D.

Solution: In one pass, gather counts of node labels and edge labels and order them in non-ascending frequency order.

We will assume frequency order = alphabetical order, for clarity.

DFS &Pruning (contd.) e.g.: X-a->X < X-b->X, X-a->X < X-a->Y,

etc. Think simple lex ordering on the strings of

length 3 (nodeLabel.edgeLabel.nodeLabel). Visiting time (wrt DFS) is important too:

e.g., (0,1,?,?,?) < (0,2,?,?,?), (1,?,?,?,?), etc. So what? Well, we get a unique minimum

DFS code for a graph. Revisit Fig. 2a. \gamma is its minimum DFS code.

Where does this help: Pruning (revisit Fig. 1)!

gSpan Algorithm 1. Graph Set Projection. 2. Sub-graph Mining.

◦ Makes crucial use of min. DFS code; ◦ Involves expensive step of computing

min. version of given DFS code. Remarks: gSpan one of the best known algorithms for

graph mining. Much more efficient than prior art

(synthetic and real data sets). Still quite expensive.

Graph Mining (Final Remarks) Considerable work since gSpan. Constraints on patterns and pushing constraints

into the “mining loop”. Xifeng Yan et al. Mining significant graph

patterns by leap search. SIGMOD 2007. M. Deshpande et al. Frequent substructure-

based approaches for classifying chemical compounds. IEEE TKDE. 17:1036--1050, 2005.

X. Yan et al. Graph indexing: A frequent structure-based approach. SIGMOD, 335--346, 2004.

M. Hasan et al. ORIGAMI: Mining representative orthogonal graph patterns. In Proc. of ICDM, pages 153--162, 2007.

F. Pennerath and A. Napoli. Mining frequent most informative subgraphs. In the 5th Int. Workshop on Mining and Learning with Graphs, 2007.

X. Yan and J. Han. CloseGraph: Mining closed frequent graph patterns. SIGKDD, 286--295, 2003.