mining for tree-query associations in a graph

66
Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium)

Upload: luz

Post on 08-Jan-2016

35 views

Category:

Documents


2 download

DESCRIPTION

Mining for Tree-Query Associations in a Graph. Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium). Graph Data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Mining for Tree-Query Associations  in a Graph

Mining for Tree-Query Associations in a Graph

Jan Van den Bussche Hasselt University, Belgium

joint work with Bart Goethals (U Antwerp, Belgium)and Eveline Hoekx (U Hasselt, Belgium)

Page 2: Mining for Tree-Query Associations  in a Graph

2

Graph Data

A (directed) graph over a set of nodes N is a set G of edges: ordered pairs ij with ij N.

Snapshot of a graph representing the metabolic pathway of a human.

Applications: life sciences, biology, social sciences, WWW, ...

Page 3: Mining for Tree-Query Associations  in a Graph

3

Graph Mining

Transactional category– dataset: set of many small graphs (transactions)

– frequency: transactions in which the pattern occurs (at least once)

– ILP: Warmr

[AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel]

Single graph category– dataset: single large graph

– frequency: copies of the pattern in the large graph

[Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom]

Focus on pattern mining, few work on association rule mining!

Page 4: Mining for Tree-Query Associations  in a Graph

4

Tree-Query Pattern

• powerful tree-shaped pattern• inspired by conjunctive database queries

• special features:– existential nodes– parameterized nodes

• occurrence of the pattern in G is any homomorphism from the pattern in G

0

8 x frequency: x z: 0z G z8 G zx G

Page 5: Mining for Tree-Query Associations  in a Graph

5

Association rules

0

8 x

x1

x3 x2

(x1,x2,x3) (0, ,8)x

• fully fledged associations over tree-query patterns• example:

Page 6: Mining for Tree-Query Associations  in a Graph

6

Experimental results: Real-life datasets

• Food web nodes1 edges30

x1

20

x2

20

frequency = 176

x1

x2

x1

20

x2

confidence = 89%

Page 7: Mining for Tree-Query Associations  in a Graph

7

Experimental results: Real-life datasets

• Food web nodes1 edges30

x1

20

x2

frequency = 176

x1

x2

x1

20

x2

confidence = 89%

Page 8: Mining for Tree-Query Associations  in a Graph

8

Experimental results: Food web

nodes1 edges30

⇒ x4

x3

x2

(x1,x2,x3,x4,x5)

x1

x5

x4

x3

x2

(0,x2,x3,x4,x5)

0

x5

⇒ x4

x3

x2

(x1,x2,x3,x4,x5)

x1

x5

x3

x2

x1

(x1,x2,x3,x4,x5)

0

x4

x5

45% 55%

Page 9: Mining for Tree-Query Associations  in a Graph

9

Experimental results: Real-life datasets

• Protein interactions graph nodes211 edges80

x1

x2

x1

22

x2 confidence = 10%

Page 10: Mining for Tree-Query Associations  in a Graph

10

Experimental results: Protein interaction graph

nodes211 edges80

746

x2

(x1,x2)

x1

(x1,x2)

746

x2

x1

376

90%

Page 11: Mining for Tree-Query Associations  in a Graph

11

Outline rest of the talk

• Formal problem definition• Algorithm

– overall approach– levelwise generation of tree patterns– generation of containment mappings– generation of parameter assignments

• Equivalent association rules• Certhia• Performance and Experimental results• Future work

Page 12: Mining for Tree-Query Associations  in a Graph

12

Tree pattern

0

8 x

Page 13: Mining for Tree-Query Associations  in a Graph

13

Tree pattern

0

8 x

Page 14: Mining for Tree-Query Associations  in a Graph

14

Tree pattern

0

8 x

Page 15: Mining for Tree-Query Associations  in a Graph

15

Tree pattern

0

8 x

Page 16: Mining for Tree-Query Associations  in a Graph

16

Tree pattern

0

8 x

select distinct G3.to as xfrom G G1, G G2, G G3where G1.from=5 and G1.to=G2.from

and G1.to=G3.from and G2.to=8

Page 17: Mining for Tree-Query Associations  in a Graph

17

Matching

0

8 x

P:

0

1

5 4

2 3

7

6 8

G: z y z x

Page 18: Mining for Tree-Query Associations  in a Graph

18

Matching

0

8 x

P:

0

1

5 4

2 3

7

6 8

G: z y z x

Page 19: Mining for Tree-Query Associations  in a Graph

19

Matching

0

1

5 4

2 3

7

6 8

G: z y z x

h1 0 1 8 4

0

8 x

P:

Page 20: Mining for Tree-Query Associations  in a Graph

20

Matching

0

1

5 4

2 3

7

6 8

G: z y z x

h 0 1 8 4

h 0 1 8 8

0

8 x

P:

Page 21: Mining for Tree-Query Associations  in a Graph

21

Matching

0

1

5 4

2 3

7

6 8

G: z y z x

h 0 1 8 4

h 0 1 8 8

h 0 2 8 4

0

8 x

P:

Page 22: Mining for Tree-Query Associations  in a Graph

22

Matching

0

1

5 4

2 3

7

6 8

G: z y z x

h 0 1 8 4

h 0 1 8 8

h 0 2 8 4

h 0 2 8 5

0

8 x

P:

Page 23: Mining for Tree-Query Associations  in a Graph

23

Matching

0

1

5 4

2 3

7

6 8

G: z y z x

h 0 1 8 4

h 0 1 8 8

h 0 2 8 4

h 0 2 8 5

h 0 2 8 8

0

8 x

P:

Page 24: Mining for Tree-Query Associations  in a Graph

24

Frequency

0

1

5 4

2 3

7

6 8

G: z y z x

h 0 1 8 4

h 0 1 8 8

h 0 2 8 4

h 0 2 8 5

h 0 2 8 8

frequency = 3

0

8 x

P:

Page 25: Mining for Tree-Query Associations  in a Graph

25

Tree Query

0

8 x

( , , 8)x x

P, body

H, head

• Q = (H,P)

Page 26: Mining for Tree-Query Associations  in a Graph

26

Association Rule

• AR: Q1 Q2

Confidence (AR) = freq(Q2)/freq(Q1) Q2 Q1

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

{ (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }

Page 27: Mining for Tree-Query Associations  in a Graph

27

Examples of Association Rules

x1

x2

(x1, x2)

,) ⇒

x1

8

(x1, 8)

,)

x1

(x1)

,) ⇒

x1

8

(x1)

,)

(1) (2)

Page 28: Mining for Tree-Query Associations  in a Graph

28

Association Rule

• AR: Q1 Q2

Confidence (AR) = freq(Q2)/freq(Q1) Q2 Q1

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

{ (x1,x2,x3) | Q1(x1,x2,x3) G} { (x,x,6) | Q2(x,x,6) G }

Page 29: Mining for Tree-Query Associations  in a Graph

29

Containment Mapping

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

containment mapping

Page 30: Mining for Tree-Query Associations  in a Graph

30

Containment Mapping

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

containment mapping

Page 31: Mining for Tree-Query Associations  in a Graph

31

Containment Mapping

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

containment mapping

Page 32: Mining for Tree-Query Associations  in a Graph

32

Containment Mapping

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

containment mapping

Page 33: Mining for Tree-Query Associations  in a Graph

33

Containment Mapping

x

8 6

0

( , , 6)x x

x1

1

8 x3

0

(x1, x2, x3)

x2

2

Q1 Q2

containment mapping

Q Q containment mapping from Q1 to Q2

Page 34: Mining for Tree-Query Associations  in a Graph

34

Problem statement: Mining tree queries

Given a graph G and a threshold k, find all tree queries that

have frequency at least k in G, those queries are calledfrequent.

Page 35: Mining for Tree-Query Associations  in a Graph

35

Problem statement: Association rules

• Input:– a graph G– minsup

– Qleft frequent in G

– minconf

• Output: All association rules Qleft Q– frequent in G– confident in G.

Page 36: Mining for Tree-Query Associations  in a Graph

36

Algorithm: mining tree queries

Outer loop: Generate, incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees.

Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency.

...

x1

x4x3

x2

x2x1

x2x1

x1

2

Page 37: Mining for Tree-Query Associations  in a Graph

37

Outer loop

• It is well known how to efficiently generate all trees uniquely up to isomorphism

• Based on canonical form of trees.

• [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

Page 38: Mining for Tree-Query Associations  in a Graph

38

Inner loop: Levelwise approach

• A query Q is characterized by Q set of existential nodes Q set of selected nodes– Labeling Q of the selected nodes by constants.

• Q11 1 1 specializes Q22 2 2 if 12, 1 2 and 1 agrees with 2 on 2.

• If Q1 specializes Q2 then freqQ1 freqQ2

• Most general query: T = (, , )

Page 39: Mining for Tree-Query Associations  in a Graph

39

Inner loop: Candidate generation

• CanTab is a candidate queryFreqTab is a frequent query

• Q’=’ ’ is a parent of Q= if either:– ’ and has precisely one more node than

’, or– ’ and has precisely one more node than ’

• Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

Page 40: Mining for Tree-Query Associations  in a Graph

40

Inner loop: Frequency counting

• Each candidacy table can be computed by a single SQL query. (ref. Join lemma).

• Suppose: Gfrom to table in the database, then each frequency table can be computed with a single SQL query.

» formulate in SQL and count

» formulate in SQL E» natural join of E with CanTab

» group by » count each group

Page 41: Mining for Tree-Query Associations  in a Graph

41

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

Page 42: Mining for Tree-Query Associations  in a Graph

42

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• Join expression:

CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3

Page 43: Mining for Tree-Query Associations  in a Graph

43

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• Join expression:

CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3

Page 44: Mining for Tree-Query Associations  in a Graph

44

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• Join expression:

CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3

Page 45: Mining for Tree-Query Associations  in a Graph

45

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• Join expression:

CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3

Page 46: Mining for Tree-Query Associations  in a Graph

46

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• Join expression:

CanTab{x2}{x1,x3} = FreqTabx2x1 ⋈ FreqTabx2x3 ⋈ FreqTabx1x3

Page 47: Mining for Tree-Query Associations  in a Graph

47

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• SQL expression E for x2

select distinct G1.from as x1, G2.to as x3, G3.to as x4

from G G1, G G2, G G3where G1.to = G2.from and G3.from = G2.from

Page 48: Mining for Tree-Query Associations  in a Graph

48

Inner loop: Example

0

8 x

:Q

x1

x2

x3 x4

T: x2

x1 x3

x10 x38

• SQL expression for filling the frequency table:

select distinct E.x1, E.x3, count(E.x4)from E, CanTab{x2}{x1,x3} as CT

where E.x1 = CT.x1 and E.x3 = CT.x3group by E.x1, E.x3having count(E.x4) >= k

Page 49: Mining for Tree-Query Associations  in a Graph

49

Algorithm: Mining association rules

Loop 1: Generate incrementally all possible trees T of increasing sizes.

Loop 2: For each T, generate all frequent tree patterns P based T.

Loop 3: For each P, generate all containment mappings f from Pleft to P.

Loop 4: For each f, generate Q=(f(Hleft),P) and all parameter instantiations for Qleft Q.

Page 50: Mining for Tree-Query Associations  in a Graph

50

Pattern database

• For each P a table FreqTabP, that contains all frequent parameter instantiations.

Pattern Database

Page 51: Mining for Tree-Query Associations  in a Graph

51

Loop 3: Generation of containment mappings

Efficiently solvable, due to tree shape.

Page 52: Mining for Tree-Query Associations  in a Graph

52

Loop 4: Generation of parameter

instantiations single relational algebra expression (SQL)

plistσ FreqTabP .freq

FreqTabPleft .freq≥minconf

(σ ϑ left (FreqTab Pleft ) ><θ FreqTab P )

ϑ :σ ∈∑left

∧FreqTab Pleft.σ = FreqTab P . f (σ )

ϑ left :σ ∈∑left

∧FreqTab Pleft.σ = σ left (σ )

plist: all P.σ withσ ∈∑,FreqTabP .freq,

FreqTabP .freqFreqTabPleft

.freq

Page 53: Mining for Tree-Query Associations  in a Graph

53

Example: Loop 4

x2

x

1

(x2, x, x)

x

Qleft

1

x2

4 5

P

Page 54: Mining for Tree-Query Associations  in a Graph

54

Example: Loop 4

x2

x

1

(x2, x, x)

x

Qleft

1

x2

4 5

Q

(x2, x2, 5)

Page 55: Mining for Tree-Query Associations  in a Graph

55

Example: Loop 4

x2

x

1

(x2, x, x)

x

Qleft

1

x2

4 5

Q

(x2, x2, 5)

select freqQleft.x1, freqQleft.x4, freqP.x1, freqP.x4, freqP.x5, freqP.freq, freqP.freq/freqQleft.freqfrom freqP, freqQleftwhere freqQleft.x1=freqP.x1 and freqQleft.x4=freqP.x4and freqP.freq/freqQleft.freq >= minconf

Page 56: Mining for Tree-Query Associations  in a Graph

56

Equivalent queries

Queries Q1 and Q2 are equivalent if same result sets on all

graphs G (up to renaming of the distinguished variables)

• 2 cases of equivalent queries:1. Q1 has fewer nodes than Q2

2. Q1 and Q2 have the same number of nodes

Page 57: Mining for Tree-Query Associations  in a Graph

57

Equivalence theorem

A containment mapping from Q1 to Q2 is a h: Q1 Q2 that

maps distinguished variables of Q1 one-to-one to distinguished

variables of Q2, and maps selected nodes of Q1 to selected

nodes of Q2, preserving labels

Two queries are equivalent if and only if there are containment mappings between them in both directions.

Page 58: Mining for Tree-Query Associations  in a Graph

58

Q2 x1

x3

x2

Case 1: Q1 fewer nodes than Q2

Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has aredundancy if and only if it contains a subtree C in the form of

alinear chain of nodes (possibly just a single node), such that

the parent of C has another subtree that is at least as deep as C.

Q1 x1

x3

x2

Redundantsubtree

Page 59: Mining for Tree-Query Associations  in a Graph

59

Case 2: Q1 and Q2 same number of nodes

• Q1 and Q2 must be isomorphic.

• Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.

Page 60: Mining for Tree-Query Associations  in a Graph

60

Equivalent Association Rules

(x1, x2, x3, x)

,) ⇒

x1

x3

x2

(x1, x2, x2, x3)

,)

x1

x3

x2

x

(x1, x2, x3, x)

,) ⇒

x1

x3

x2

(x1, x2, x3, x3)

,)

x1

x3

x3

x

(1)

(2)

Page 61: Mining for Tree-Query Associations  in a Graph

61

Equivalence detection for rules

• Many cases efficiently checked.• But worst case still as hard as general graph isomorphism checking.

• Fast heuristics for graph isomorphism checking i.e. Nauty

Page 62: Mining for Tree-Query Associations  in a Graph

62

Certhia

• Loop 1 + Loop 2: preprocessing step Pattern database

• Loop 3 + Loop 4: interactive browsing tool Certhia Demo session

Page 63: Mining for Tree-Query Associations  in a Graph

63

Experimental results: Performance

• Fully implemented on top of IBM DB2

• Preliminary performance results:– adequate performance– huge number of patterns– constant overhead per discovered pattern

Page 64: Mining for Tree-Query Associations  in a Graph

64

Performance: Association rules

• Loop 3 and Loop 4: –very fast–constant overhead per rule

Page 65: Mining for Tree-Query Associations  in a Graph

65

Future work

• Serious scientific data mining• Loosen restriction to trees

Page 66: Mining for Tree-Query Associations  in a Graph

66

Publications

• B. Goethals, E. Hoekx, J. Van den Bussche, “Mining tree queries in a graph”, KDD’05, p 61–69.

• E. Hoekx, J. Van den Bussche, “Mining tree-query associations in a graph”, ICDM’06 regular paper.

• “Certhia: Tree-query mining in large graphs”, ICDM’06 software demo.

• http://alpha.uhasselt.be/~vdbuss