frequent structure mining

62
Frequent Structure Mining Robert Howe University of Vermont Spring 2014

Upload: jude

Post on 23-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Frequent Structure Mining. Robert Howe University of Vermont Spring 2014. Original Authors. This presentation is based on the paper Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Frequent Structure Mining

Frequent Structure MiningRobert Howe

University of VermontSpring 2014

Page 2: Frequent Structure Mining

Original Authors• This presentation is based on the paper

Zaki MJ (2002). Efficiently mining frequent trees in a forest. Proceedings of the 8th ACM SIGKDD International Conference.

• The author’s original presentation was used to make this one.

• I further adapted this from Ahmed R. Nabhan’s modifications.

2

Page 3: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

3

Page 4: Frequent Structure Mining

Outline• Graph Mining Overview

• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

4

Page 5: Frequent Structure Mining

Why Graph Mining?• Graphs are convenient structures that can

represent many complex relationships.

• We are drowning in graph data:

• Social Networks

• Biological Networks

• World Wide Web

5

Page 6: Frequent Structure Mining

6

High School

UVM

BU

Facebook Data(Source: Wolfram|Alpha Facebook Report)

Page 7: Frequent Structure Mining

7

Facebook Data(Source: Wolfram|Alpha Facebook Report)

Page 8: Frequent Structure Mining

8

Biological Data(Source: KEGG Pathways Database)

Page 9: Frequent Structure Mining

Some Graph Mining Problems

• Pattern Discovery

• Graph Clustering

• Graph Classification and Label Propagation

• Structure and Dynamics of Evolving Graphs

9

Page 10: Frequent Structure Mining

Graph Mining Framework

10

Mining graph patterns is a fundamental problem in data mining.

Graph Data Mine

Exponential Pattern Space

Select

Relevant Patterns

Exploratory Task

ClusteringClassification

Structure Indices

Page 11: Frequent Structure Mining

Basic Concepts• Graph – A graph G is a 3-tuple G

= (V, E, L) where• V is the finite set of nodes.• E V × V⊆ is the set of edges• L is a labeling function for edges and

nodes.

• Subgraph – A graph G1 = (V1, E1, L1) is a subgraph of G2 = (V2, E2, L2) iff:• V1 V⊆ 2

• E1 E⊆ 2

• L1(v) = L2(v) for all v ∈ V1.

• L1(e) = L2(e) for all e ∈ E1. 11

A

C D

B

A

C

B

Page 12: Frequent Structure Mining

Basic Concepts• Graph Isomorphism – “A bijection between

the vertex sets of G1 and G2 such that any two vertices u and v which are adjacent in G1 are also adjacent in G2.” (Wikipedia)

12

A

C D

B

E1

4

3

2

5

• Subgraph Isomorphism is even harder (NP-Complete!)

Page 13: Frequent Structure Mining

Basic Concepts• Graph Isomorphism – Let G1 = (V1, E1, L1) and

G2 = (V2, E2, L2). A graph isomorphism is a bijective function f: V1 → V2 satisfying• L1(u) = L1( f (u)) for all u ∈ V1.• For each edge e1 = (u,v) ∈ E1, there exists e2 = (

f(u), f(v)) ∈ E2 such that L1(e1) = L2(e2).• For each edge e2 = (u,v) ∈ E2, there exists e1 =

( f –1(u), f –1(v)) ∈ E1 such that L1(e1) = L2(e2).

13

Page 14: Frequent Structure Mining

Discovering Subgraphs

• TreeMiner and gSpan both employ subgraph or substructure pattern mining.

• Graph or subgraph isomorphism can be used as an equivalence relation between two structures.

• There is an exponential number of subgraph patterns inside a larger graph (as there are 2n node subsets in each graph and then there are edges.)

• Finding frequent subgraphs (or subtrees) tends to be useful in data mining.

14

Page 15: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction

• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

15

Page 16: Frequent Structure Mining

Mining Complex Structures

• Frequent structure mining tasks• Item sets – Transactional, unordered data.• Sequences – Temporal/positional, text, biological

sequences.• Tree Patterns – Semi-structured data, web mining,

bioinformatics, etc.• Graph Patterns – Bioinformatics, Web Data

• “Frequent” is a broad term• Maximal or closed patterns in dense data• Correlation and other statistical metrics• Interesting, rare, non-redundant patterns.

16

Page 17: Frequent Structure Mining

Anti-Monotonicity

17(Source: SIGMOD ’08)

• A monotonic function is a consistently increasing or decreasing function*.

• The author refers to a monotonically decreasing function as anti-monotonic.

• The frequency of a super-graph cannot be greater than the frequency of a subgraph (similar to Apriori).

* Very Informal Definition

The black line is always decreasing

Page 18: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author

• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

18

Page 19: Frequent Structure Mining

Tree Mining – Motivation• Capture intricate (subspace) patterns

• Can be used (as features) to build global models (classification, clustering, etc.)

• Ideally suited for categorical, high-dimensional, complex, and massive data.

• Interesting Applications• Semi-structured Data – Mine structure and

content• Web usage mining – Log mining (user

sessions as trees)• Bioinformatics – RNA secondary structures,

Phylogenetic trees19

(Source: University of Washington)

Page 20: Frequent Structure Mining

Classification Example

• Subgraph patterns can be used as features for classification.

20

# of sides 2 3 4 5 6 7 8Amount 0 1 0 0 1 0 0

• Off-the-shelf classifiers (like neural networks or genetic algorithms) can be trained using these vectors.

• Feature selection is very useful too.

“Hexagons are a commonly occurring subgraph in organic compounds.”

Page 21: Frequent Structure Mining

Contributions• Systematic subtree enumeration.

• Extensions for mining unlabeled or unordered subtrees or sub-forests.

• Optimizations

• Representing trees as strings.

• Scope-lists for subtree occurrences.

21

Page 22: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples

• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

22

Page 23: Frequent Structure Mining

How does searching for patterns work?

• Start with graphs with small sizes.

• Extend k-size graphs by one node to generate k + 1 candidate patterns.

• Use a scoring function to evaluate each candidate.

• A popular scoring function is one that defines the minimum support. Only graphs with frequency greater than minisup are kept.

23

Page 24: Frequent Structure Mining

How does searching for patterns work?

• “The generation of size k + 1 subgraph candidates from size k frequent subgraphs is more complicated and more costly than that of itemsets” – Yan and Han (2002), on gSpan

• Where do we add a new edge?• It is possible to add a new edge to a pattern

and then find that doesn’t exist in the database.

• The main story of this presentation is on good candidate generation strategies.

24

Page 25: Frequent Structure Mining

TreeMiner• TreeMiner uses a technique for numbering

tree nodes based on DFS.

• This numbering is used to encode trees as vectors.

• Subtrees sharing a common prefix (e.g. the first k numbers in vectors) form an equivalence class.

• Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees (e.g. Apriori)

25

Page 26: Frequent Structure Mining

TreeMiner

• This is important because candidate subtrees are generated only once!

• (Remember the subgraph isomorphism problem that makes it likely to generate the same pattern over and over)

26

Page 27: Frequent Structure Mining

Definitions• Tree – An undirected graph where there is

exactly one path between any two vertices.

• Rooted Tree – Tree with a special node called root.

27

This tree has no root node.It is an unrooted tree.

This tree has a root node.It is a rooted tree.

Page 28: Frequent Structure Mining

Definitions• Ordered Tree – The ordering of a node’s

children matters.

• Example: XML Documents

• Exercise – Prove that ordered trees must be rooted. 28

≠v3v2v1 v3v1v2

Page 29: Frequent Structure Mining

Definitions• Labeled Tree – Nodes have

labels.

• Rooted trees also have some special terminology.

• Parent – The node one closer to the root.

• Ancestor – The node n edges closer to the root, for any n.

• Siblings – Two nodes with the same parent.

29

parent

sibling

ancestor

ancestor(X,Y) :-parent(X,Y).

ancestor(X,Y) :-parent(Z,Y),

ancestor(X,Z).

sibling(X,Y) :-parent(Z,X),parent(Z,Y).

embedded siblingembedded

sibling

Page 30: Frequent Structure Mining

Definitions• Embedded Siblings – Two nodes sharing a

common ancestor.

• Numbering – The node’s position in a traversal (normally DFS) of the tree.

• A node has a number ni and a label L(ni).

• Scope – The scope of a node nl is [l, r], where nr is the rightmost leaf under nl (again, DFS numbering).

30

Page 31: Frequent Structure Mining

Definitions• Embedded Subtrees – S

= (Ns, Bs) is an embedded subtree of T = (N, B) if and only if the following conditions are met:• Ns ⊆ N (the nodes have to

be a subset).• b = (nx, ny) ∊ Bs iff nx is an

ancestor of ny.• For each subset of nodes

Ns there is one embedded subtree or subforest.

31

v3v2

v1

v0

v6 v7

v8

v5v4

v1

v5v4

(Colors are only on this graph to show corresponding nodes)

subtree

Page 32: Frequent Structure Mining

Definitions• Match Label – The

node numbers (DFS numbers) in T of the nodes in S with matching labels.

• A match label uniquely identifies a subtree.

• This is useful because a labeling function must be surjective but will not necessarily be bijective.

{v1, v4, v5} or {1, 4, 5} 32(Colors are only on this graph to

show corresponding nodes)

v3v2

v1

v0

v6 v7

v8

v5v4

v1

v5v4

subtree

Page 33: Frequent Structure Mining

Definitions

• Subforest – A disconnected pattern generated in the same way as an embedded subtree.

33

v3v2

v1

v0

v6 v7

v8

v5v4

v1

v4

(Colors are only on this graph to show corresponding nodes)

subforestv7

v8

Page 34: Frequent Structure Mining

Problem Definition• Given a database (forest) D of trees, find all

frequent embedded subtrees.

• Frequent – Occurring a minimum number of times (use user-defined minisup).

• Support(S) – The number of trees in D that contain at least one occurrence of S.

• Weighted-Support(S) – The number of occurrences of S across all trees in D.

34

Page 35: Frequent Structure Mining

ExerciseGenerate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x.

35

v3v2

v1

v0

v6 v7

v8

v5v4

v1

v5v2

This is an embedded subtree because all of the

nodes are connected.

(*Cough* Exam Question *Cough*)

Page 36: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern

Extraction

• Experimental Results• Conclusions

36

Page 37: Frequent Structure Mining

Main Ingredients• Pattern Representation

• Trees as strings• Candidate Generation

• No duplicates.• Pattern Counting

• Scope-based List (TreeMiner)• Pattern-based Matching (PatternMatcher)

37

Page 38: Frequent Structure Mining

String Representation• With N nodes, M branches, and a max fanout

of F:

• An adjacency matrix takes (N)(F + 1) space.

• An adjacency list requires 4N – 2 space.

• A tree of (node, child, sibling) requires 3N space.

• String representation requires 2N – 1 space.38

Page 39: Frequent Structure Mining

String Representation• String representation is labels with a

backtrack operator, –1.

390 1 3 1 –1 2 –1 –1 2 –1 –1 2 –1

0

1

3

1 2

2

2

Page 40: Frequent Structure Mining

Candidate Generation

• Equivalence Classes – Two subtrees are in the same equivalence class iff they share a common prefix string P up to the (k – 1)-th node.

• This gives us simple equivalence testing of a fixed-size array.

• Fast and parallel – Can be run on a GPU.

• Caveat – The order of the tree matters.40

Page 41: Frequent Structure Mining

Candidate Generation

• Generate new candidate (k + 1)-subtrees from equivalence classes of k-subtrees.

• Consider each pair of elements in a class, including self-extensions.

• Up to two new candidates for each pair of joined elements.

• All possible candidate subtrees are enumerated.

• Each subtree is generated only once!41

Page 42: Frequent Structure Mining

Candidate Generation

• Each class is represented in memory by a prefix string and a set of ordered pairs indicating nodes that exist in that class.

• A class is extended by applying a join operator ⊗ on all ordered pairs in the class.

42

Page 43: Frequent Structure Mining

Candidate Generation

43

3

1

2 4

1

2

Equivalence ClassPrefix String 12

This generates two elements: (3, v1) and (4, v0)The element notation can be confusing because the

first item is a label and the second item is a DFS node number.

Page 44: Frequent Structure Mining

Candidate Generation

• Theorem 1. Define a join operator ⊗ on two elements as (x, i) (⊗ y, j). Then apply one of the following cases:

(1)If i = j and P is not empty, add (y, j) and (y, j + 1) to class [Px]. If P is empty, only add (y, j + 1) to [Px].

(2)If i > j, add (y, j) to [Px].

(3)If i < j, no new candidate is possible.44

Page 45: Frequent Structure Mining

Candidate Generation

• Consider the prefix class from the previous example: P = (1, 2) which contains two elements, (3, v1) and (4, v0).

1. Join (3, v1) (3, ⊗ v1) – Case (1) applies, producing (3, v1) and (3, v2) for the new class P3 = (1, 2, 3).

2. Join (3, v1) (4, ⊗ v0) – Case (2) applies.

(Don’t worry, there’s an illustration on the next slide.)

45

Page 46: Frequent Structure Mining

Candidate Generation

46

3

1

2

3

1

2⊗

A class with prefix {1,2} contains a node with label 3. This is written as (3, v1), meaning a node labeled ‘3’ is added at position 1 in DFS order of nodes.

=

3

1

2

3

1

2

3

3

Prefix = (1, 2, 3)New nodes = (3, v2), (3, v1)

Page 47: Frequent Structure Mining

Candidate Generation

47

3

1

2 4

1

2⊗ =

Prefix = (1, 2, 3)New nodes = (3, v2), (3, v1), (4, v0)

3

1

2

3

3

1

2

3 3

1

2 4

Page 48: Frequent Structure Mining

The AlgorithmTREEMINER( D, minisup ):

F1 = { frequent 1-subtrees}F2 = { classes [P]1 of frequent 2-subtrees }for all [P]1 ∈ E do

Enumerate-Frequent-Subtrees( [P]1 )ENUMERATE-FREQUENT-SUBTREES( [P] ):

for each element (x, i) ∈ [P] do[Px] = ∅for each element (y, j) ∈ [P] do

R = { (x, i) ⊗ (y, j) }L(R) = { L(x) ∩⊗ L(y) }if for any R ∈ R, R is frequent, then

[Px] = [Px] ∪ {R}Enumerate-Frequent-Subtrees( [Px] )

48

Page 49: Frequent Structure Mining

ScopeList Join

49

• Recall that the scope is the interval between the lowest numbered child (or self) node and the highest numbered child node, using DFS numbering.

• This can be used to calculate support.

v3v2

v1

v0

v6 v7

v8

v5v4

[0, 8]

[1, 5]

[2, 2]

[3, 5]

[4, 4] [5, 5]

[7, 8]

[8, 8]

Page 50: Frequent Structure Mining

ScopeList Join

50

• ScopeLists are used to calculate support.

• Let x and y be nodes with scopes sx = [lx, ux], sy = [ly, uy].

• sx contains sy iff lx ≤ ly and ux ≥ uy.

• A scope list represents the entire forest.

Page 51: Frequent Structure Mining

ScopeList Join

51

• A ScopeList is a list of (t, m, s) 3-tuples.

• t is the tree ID.

• m is the match label of the (k – 1)-length prefix of xk.

• s is the scope of the last item, xk.

• The use of scope lists allows constant time computations of whether y is a descendent or embedded sibling of x.

Page 52: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results

• Conclusions52

Page 53: Frequent Structure Mining

Experimental Results• Machine: 500Mhz PentiumII, 512MB memory, 9GB disk, RHEL 6.0• Synthetic Data: Web browsing• Parameters: N = #Labels, M = #Nodes, F = Max Fanout, D = Max Depth, T = #Trees • Create master website tree W• For each node in W, generate #children (0 to F)• Assign probabilities of following each child or to backtrack; adding up to 1• Recursively continue until D is reached• Generate a database of T subtrees of W• Start at root. Recursively at each node generate a random number

• (0 – 1) to decide which child to follow or to backtrack. • Default parameters: N=100, M=10,000, D=10, F=10, T=100,000• Three Datasets: D10 (all default values), F5 (F=5), T1M (T=106)• Real Data: CSLOGS – 1 month web log files at RPI CS

53

• Over 13361 pages accessed (#labels)• Obtained 59,691 user browsing trees (#number of trees)• Average string length of 23.3 per tree

Page 54: Frequent Structure Mining

Distribution of Frequent Trees

54

Sparse Dense

Take-Home Point: Many large, frequent trees can be discovered.

F5: Max-Fanout = 5T1M: 106 Trees

Page 55: Frequent Structure Mining

Experiments (Sparse)

55

Sparse Dense

Take-Home Point: Both algorithms are able to cope with relatively short patterns in sparse data.

Page 56: Frequent Structure Mining

Experiments (Dense)

56

Sparse(Artificial Dataset)

Dense(Real-World Dataset)

Take-Home Point: Long patterns at low-support (length=20); the level-wise approach suffers.The authors use the artificial dataset to justify TreeMiner as 20 times faster than pattern matcher.

Page 57: Frequent Structure Mining

Outline• Graph Mining Overview• Mining Complex Structures - Introduction• Motivation and Contributions of author• Problem Definition and Case Examples• Main Ingredients for Efficient Pattern Extraction• Experimental Results• Conclusions

57

Page 58: Frequent Structure Mining

Conclusions

58

• TREEMINER: A novel tree mining approach.

• Non-duplicate candidate generation.

• Scope-List joins for frequency comparison.

• Framework for tree-mining tasks

• Frequent subtrees in a forest of rooted, labeled, ordered trees.

• Frequent subtrees in a single tree.

• There are extensions for unlabeled and unordered trees.

Page 59: Frequent Structure Mining

Caveats

59

• Frequent does not always mean significant!

• Exhaustive enumeration is a problem even though the candidate generation in TreeMiner is efficient.

• Low min_sup values increases true positives at the cost of increasing false positives.

• State-of-the-art graph miners make use of the structure of the search space (e.g. the LEAP search algorithm) to extract only significant structures.

• Candidate structures can be generated by tree miners and evaluated by some other mean.

Page 60: Frequent Structure Mining

Question OneGenerate an embedded subtree or subforest for the set of nodes Ns = {v1, v2, v5}. Is this an embedded subtree or subforest, and why? Assume a labeling function L(x) = x.

60

v3v2

v1

v0

v6 v7

v8

v5v4

v1

v5v2

This is an embedded subtree because all of the

nodes are connected.

Page 61: Frequent Structure Mining

Question TwoWhy is the frequency of subgraphs a good function to evaluate candidate patterns? How could it be better?

61

Answer. The frequency of subgraphs is a monotonically decreasing function, meaning supergraphs are not more frequent than subgraphs. This is a desirable property combined with a minimum support threshold to reduce the search space as subgraph patterns get bigger.However, frequency does not always imply significance – another metric must be used to evaluate the candidates generated by a graph miner for significance.

Page 62: Frequent Structure Mining

Question ThreeHow is a string representation of a tree useful in graph mining? What requirements does it place on the graph?

62

Answer. A string representation of a tree is useful because string comparisons are worst-case O(n) and can be easily optimized. However, it requires that a tree be rooted and ordered, because otherwise the string comparison operator would not be valid.