1 qsx: querying social graphs graph pattern matching graph pattern matching via subgraph isomorphism...

51
1 QSX: Querying Social Graphs Graph Pattern Matching Graph pattern matching via subgraph isomorphism Graph pattern matching via graph simulation Revisions of graph simulation for social network analysis

Upload: harvey-dickerson

Post on 21-Dec-2015

247 views

Category:

Documents


0 download

TRANSCRIPT

1

QSX: Querying Social Graphs

Graph Pattern Matching

Graph pattern matching via subgraph isomorphism

Graph pattern matching via graph simulation

Revisions of graph simulation for social network analysis

2

The need for studying graph pattern matching

Prevalent use in traditional and emerging applications

Applications

• pattern recognition

• knowledge discovery

• intelligence analysis

• transportation network analysis

• Web site classification,

• social position and community detection

• social media marketing

• knowledge fusion

• . . .

2

Subgraph isomorphism: complexity and algorithm

33

Gen

4

Directed graph G = (V, E, fA) attributes fA(u): label

Social Graphs

Med

Soc

AI

Chem

Simplification: node labels

DBDB

Assume fA(u) has a unique attribute: label

Assume fA(u) has a unique attribute: label

4

EcoEco

5

Subgraph isomorphism

A function f from the nodes of Q to the nodes of G: For each node u in Q, u and f(u) have the same label; There exists an edge (u, u’) in Q if and only if there exists an

edge (f(u), f(u’)) in G

A bijection: identical label matching, edge-to-edge relations

A

B

D

Bvv11 vv22

E

G

A

B

D E

Q

5

6

Matching by subgraph isomorphism

Input: A directed graph G, and a graph pattern Q

Output: all subgraphs of G that are isomorphic to Q

intractable 6

Complexity • Remains NP-hard even when

• Q is a tree and G is a forest• Q is acyclic and G is a tree

PTIME if Q is a forest and G is a tree

NP-completeNP-complete Exponentially many matchesExponentially many matches

The lower bounds is rather robust

7

Algorithms for computing subgraph isomorphism

Match(P)

•if P covers all nodes in Q then output P;

•else compute the set S(P) of all candidate pairs for inclusion in P

•for each pair p = (u, v) in S(P)

• if p passes feasibility check

• then P’ P {p}; call Match(P’);

•restore data structures

Input: pattern Q and graph G

Output: all isomorphic mappings P from Q to G

nodes that are directly connected to those already in P,

with the same labels

P: partial mappings, initially empty

Recursion, refinement

for each pair p = (u, v) in S(P):enumerate all possible extensions, for refinementif the feasibility test is not successful, drop it and try the next

Guarantee correctness

8

VF2

Match(P)

•if P covers all nodes in Q then output P;

•else compute the set S(P) of all candidate pairs for inclusion in P

•for each pair p = (u, v) in S(P)

• if p passes feasibility check

• then P’ P {p}; call Match(P’);

•restore data structures

Five k-look-ahead rules, to make sure that P is a partial isomorphic mapping

VF2: a popular algorithm for subgraph isomorphism

Feasibility rules: for each pair (u, v) in Ptheir predecessors are already mapped and included in P their successors can possibly be mapped Certain conditions on cardinalities of predecessors and successors to ensure correctness and expandability

Guarantee correctness and reduce backtracking

L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004

9

Ullman’s algorithm

Backtrack(P)

•if P covers all nodes in Q then output P and return;

•for each node u in Q that is not yet in P

• find a node v in G; p (u, v); P’ P {p};

• if P’ makes a partial mapping (injective function, preserving edges)

• then call Backtrack(P’);

Use adjacency matrices of G and Q, their transposes, and a form of permutation matrices

An algorithm that is still being used

Expanding permutation matrices representing P

for each candidate pair p = (u, v):enumerate all possible extensions, for refinementBacktracking: no matter whether the test is successful or not, go back to the previous level and try another p

J. R. Ullman. An Algorithm for Subgraph Isomorphism. JACM 1976

Graph simulation: complexity and algorithm

1010

11

Graph Simulation

11A relation: identical label matching, edge-to-edge mapping

A binary relation R on the nodes of Q and the nodes of G: For each node u in Q, there exists a node v in G such that (u, v)

is in R, and u and v have the same label; If there exists an edge (u, u’) in Q and each pair (u, v) is in R,

then there exists an edge (v, v’) in G such that (u’, v’) is in R

A

B

D

Bvv11 vv22

E

G

A

B

D E

Q

11

relations as opposed to functions

12

Matching by graph simulation

Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R

Quadratic time 12

Maximum simulation relation: always exists and is unique• If a match relation exists, then there exists a maximum one• Otherwise, it is the empty set – still maximum

Use relations instead of functions

Complexity: O((| V | + | VQ |) (| E | + | EQ| )

The output is a unique relation, possibly of size |Q||V|

13

Data locality

Given a pattern Q, a graph G and a node v in G, can we decide whether v

matches some node in Q by inspecting only nodes within d hops of v, where d

is determined by Q only?

Graph simulation: a recursive computation

Q Gd: the diameter of Q

We only need to inspect the d-neighborhood of v

13

Graph simulation does not have the data locality

Subgraph isomorphism has the data locality

14

Algorithm for computing graph simulation

Similarity(P)

•for all nodes u in Q do

• sim(u) the set of candidate matches w in G;

•while there exist (u, v) in Q and w in sim(u) (in G) that violate the simulation condition

• sim(u) sim(u) {w};

•output sim(u) for all u in Q

Input: pattern Q and graph G

Output: for each u in Q, sim(u): the matches w in G

successor(w) sim(v) =

Correct, but not in quadratic time

successor(w) sim(v) = • There exists an edge from u to v in Q, but the candidate w of u

has no corresponding edge to a node w’ that matches v

refinement

with the same label; moreover, if u has an outgoing edge, so does w

15

speedup

For each node u in pattern Q, prevsim(u)• once considered for candidate matches of u• for each edge (u, v) in Q and each w in sim(u)

successor(w) prevsim(v) • terminate if prevsim(u) = sim(u) for all nodes u in G

prevsim(u) sim(u): invalid candidates

Each node in prevsim(u) is looked up only once15

a superset of sim(u)

If successor(w) prevsim(v) = • w should be removed from sim(u); u: a predecessor of v

Propagate violations upward

Can’t be refined further

Once w is removed, it is never put back

16

Algorithm

Similarity(P)

•for all nodes v in Q do

• sim(v) the set of candidate matches in G;

• prevsim(v) the set of all the nodes in G;

•while there exists a node v in Q and such that sim(v) prevsim(v)

• remove predecessor(sim(v)) predecessor(prevsim(v));

• for all u in predecessor(v) do

• sim(u) sim(u) remove;

• prevsim(v) sim(v);

•output sim(v) for all v in Q

Can be implemented in O((| V | + | VQ |) (| E | + | EQ| ) time

refinement

with the same label; moreover, if u has an outgoing edge, so does w

Propagate up

For each w prevsim(v) sim(v),w is checked only once, hence |VQ| |V| in

total

A dynamically maintained remove

Graph simulation revised for social network analysis

1717

18

Input: a query Q and a data graph G,

Output: all the matches of Q in G.• subgraph isomorphism

a bijective function f on nodes: (u,u’

) ∈ Q iff (f(u), f(u’)) ∈ G

a binary relation S on nodes

for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈

• graph simulation

18

Graph pattern matching: The conventional

Can we use the conventional notions for social network analysis?

Find all matches of a pattern in a graphFind all matches of a pattern in a graph

Example query: graph pattern matching

Identify suspects in a drug ring

Identify suspects in a drug ring

19“Understanding the structure of drug trafficking organizations”

pattern graph

B

A1 Am

W

W

W

W W

W

WW

33

1

B

AM S

FW

19

Pattern matching in social graphs

20

not allowed by bijectionrelation

instead of function

edges to paths

Neither subgraph isomorphism nor graph simulation works

B

A1

Am

W

W

W

W W

W

WW

33

1

B

AM S

FW

For both scalability and effectiveness

20

Gen

21

Directed graph G = (V, E, fA) attributes fA(u): a tuple (A1 = a1, ..., An = an)

Social Graphs

Med

Soc

AI

Chem

(‘dept’=CS, ‘field’=AI) (‘dept’=CS, ‘field’=AI)

(‘dept’=CS, ‘field’=DB) (‘dept’=CS, ‘field’=DB) (‘dept’=Bio, ‘field’=Gen) (‘dept’=Bio, ‘field’=Gen)

(‘dept’=Bio, ‘field’=Eco) (‘dept’=Bio, ‘field’=Eco)

Social graphs: modeling attributes

DBDB

label, keywords, blogs, comments, rating …label, keywords, blogs, comments, rating …

21

EcoEco

CS Bio

Soc

Med*

3

*

2

2

3

22

Bounded patterns

Pattern graph: Q = (VQ, EQ, fv, fe)

fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥

fe(u,u’): a constant k or a symbol , ∗ bound

BoundedBounded

UnboundedUnbounded

fv(): ‘dept’=CSfv(): ‘dept’=CS

Incorporating search conditions and bounds on the number of hops

Search conditionSearch condition

within k hopswithin k hops

22

23

G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via bounded simulation, if

there exists a binary relation S ⊆ VQ × V such that S

is a total mapping, satisfies search conditions and bounds on edge-to-path mappings

Bounded Simulation

CS DB

Soc

Med Med

Gen

Soc Eco

*

3

*

2

2

3AI

Chem

S

Mapping edges to bounded paths

Bio

for each u V∈ Q, there exists v V ∈ such that (u,v) S∈

for each (u,v) S, ∈ attributes fA(v) satisfies predicate fv(u)

each (u,u’ ) in EQ is mapped to a path from v to v’ of length fe(u,u’ ) in G, (u’,v’ ) S∈

23

There exists a unique maximum match

Bounded simulation in social graphs

24The set of all suspects involved in a drug ring

edges to paths

B

A1

Am

W

W

W

W W

W

WW

33

1

B

AM S

FW

relation instead of function

24

O(| V | | E | + | EQ| | V |2 + | VQ| | V |)O(| V | | E | + | EQ| | V |2 + | VQ| | V |)

25

Complexity

Subgraph isomorphism: intractable

Graph simulation: O((| V | + | VQ |) (| E | + | EQ| )

Input: Pattern Q and data graph G

Output: Q(G), the unique maximum match relation cubic time

comparable: Q is small in practice

To identify sensible matches and be computable in low PTIME 25

Query driven approximation: use bounded simulation instead of subgraph isomorphism. Criteria:Lower complexityEffectiveness: the query answers are sensible

Always exist

Algorithm? The reading list

26

Bounded simulation vs. graph simulation

Graph simulation: a special case of bounded simulation

The same bound 1 on all pattern edges (edge-to-edge mapping)

Unique attributes vs. search conditions: label equality

O((| VG | + | VQ |) (| EG | + | EQ| )

vs.

O(| VG | | EG | + | EQ| | VG |2 + | VQ| | VG|)

Process calculusWeb site classificationSocial position detection, …

Capture more sensible matches in social graphs (by 80%) 26

27

Homeomorphism and monomorphism

Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ)

an injective function from VQ V

edges to pairwise node-disjoint simple paths in G

function rather than relation

Strike a balance between expressive power and complexity

constraints on pathsMonomorphism revised: G = (V, E) matches Q = (VQ, EQ)

an injective function from VQ V

edges to nonempty paths in G

Intractable, even when Q is a tree and G is a DAG

27

Graph pattern matching:

• Incorporating edge relationships

2828

Edge relationships

29What is this pattern to find?

S: supervise

C: co-author

Ann, CS

Pat, DB

John, DB

Bill, Bio

Don, Gen

Tom, BioCC

SS

SS

SS

CC

CC

CC

CC

CC

Mat, DB

DB

CS

Bio

Bio

CC

CC

S+S+

pattern

29

Edge relation

(Alice, Facebook)

(Alice, Sunita)

(Jose, Twitter)

(Jose, Sunita)

(Mikhail, Facebook)

(Mikhail, Twitter)

(Sunita, Facebook)

(Sunita, Alice)

(Sunita, Jose)

30

Alice Sunita Jose

MikhailTwitter

Facebook

Graph encodings: Adding edge types

(Alice, fan-of, Facebook)

(Alice, friend-of, Sunita)

(Jose, fan-of, Twitter)

(Jose, friend-of, Sunita)

(Mikhail, fan-of, Facebook)

(Mikhail, fan-of, Twitter)

(Sunita, fan-of, Facebook)

(Sunita, friend-of, Alice)

(Sunita, friend-of, Jose)31

Alice Sunita Jose

MikhailTwitter

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

Adding edge labelsAdding edge labels

Graph encodings: Adding weights

(Alice, fan-of, 0.5, Facebook)

(Alice, friend-of, 0.9, Sunita)

(Jose, fan-of, 0.5, Twitter)

(Jose, friend-of, 0.3, Sunita)

(Mikhail, fan-of, 0.8, Facebook)

(Mikhail, fan-of, 0.7, Twitter)

(Sunita, fan-of, 0.7, Facebook)

(Sunita, friend-of, 0.9, Alice)

(Sunita, friend-of, 0.3, Jose)

32

Alice Sunita Jose

MikhailTwitter

Facebook fan-of

friend-offriend-of

fan-of fan-of

fan-of

fan-of

0.5

0.9

0.7

0.3

0.8 0.7

0.5

Even further, you can add weights and othersEven further, you can

add weights and others

33

Regular patterns

Pattern: Q = (VQ, EQ, fv, fe)

fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥

fe(u,u’ ): a regular expression of the form

BoundedBounded

UnboundedUnbounded

Mapping edges to paths satisfying associated regular expressions

DB

CS

Bio

Bio

CC

CC

S+S+F ::= c | ck | c+ | FFF ::= c | ck | c+ | FF

Simple regular expressions: fairly common optimizing patterns (checking containment in linear-time) low complexity in matching

33

O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)

34

Complexity

bounded simulation: a special case single color c (hence m = 1) fe(u,u’ ) = c

Input: Pattern Q and data graph G

Output: Q(G) m: the number of distinct colors in Q

Adding edge colors does not incur extra complexity

general regular expressions?

34

Graph pattern matching:

• Capturing graph topology

3535

36

Limitations of graph simulation

A disconnected graph matches a connected pattern The yellow node in the pattern has 3 “parents”, in contrast to 1

in the data graph An undirected cycle matches a tree

Simulation does not preserve the topologic in matching

pattern graphWhat is wrong?

36

37

Limitations of graph simulation

A cycle with two nodes matches a cycle of unbounded length

The match relation may be excessively large

The need for revising simulation to enforce locality

pattern graph

When social distances increase, the closeness of relationships decrease

37

38

G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via dual simulation, if there

exists a binary relation S ⊆ VQ × V such that S

is a total mapping, satisfies search conditions, and preserves both “child” and “parent” relationships

Dual simulation

Preserve “parent” relationships and connectivity

for each (u,v) S, ∈ each (u,u’ ) in EQ is mapped to an

edge (v, v’ ) in G, (u’, v’ ) S∈ each (u’, u) in EQ is mapped to an

edge (v’, v) in G, (u’, v’ ) S∈

Q(G) : a unique maximum match relation

38

39

diameter dQ: the maximum shortest distance (undirected paths)

Locality

Locality: matches contained in G[v, dQ] for some v

dQ-radius subgraph G[v, dQ] : centered at v, within dQ hops

21

v

Excessive match

39

40

G matches Q via strong simulation, if there exists a node v in G

such that G[v, dQ] matches Q via dual simulation

– duality– local

Strong simulation

Matching: given Q and G, find the set Q(G) of all matches

Match: the subgraph GS of G[v, dQ] representing the maximum match S

for each (u,v) in the maximum match S, v is in GS for each edge (u,u’ ) in Q, (v, v’ ) is in

GS if (u’,v’ ) S∈

40

41

Child and parent relationships

Preserving the topology of patterns

What about graph simulation?

connectivity: if Q is connected (via undirected path), so is GS

cycles: a directed (resp. undirected) cycle in Q matches a directed (resp. undirected) cycle in GS

bounded matches: – the diameter of GS is at most 2 * dQ – |M(Q, G)| |V|

41

O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |))O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |))

42

Strong simulation vs. graph simulation

Input: Pattern Q and data graph G Output: Q(G)

cubic time

hierarchy

A balance between the complexity and the ability to preserve topology

G matches Q via dual simulation

G matches Q via graph simulation

G matches Q via strong simulation

G matches Q via subgraph isomorphismpreserve topology, but

not bounded match

does not preserve parents, connectivity, undirected cycles, bounded match

Complexity of strong simulation

42

43

Bounded cycles

Making strong simulation stronger?

Both extensions make matching from PTIME to intractable

Bisimulation instead of simulation: find all subgraphs that are bisimilar to a pattern

If G matches Q, then the longest simple cycle in G is no longer than its counterpart in Q

for each (u,v) S, ∈ each (u,u’ ) in EQ is mapped to an edge

(v, v’ ) in Gs, (u’,v’ ) S∈

each edge (v, v’ ) in Gs is mapped to an edge (u,u’ ) in EQ, (u’, v’ ) S∈

43

Summing up

4444

45

Various notions for graph pattern matching

Query driven approximation: from subgraph isomorphism (intractable)

to strong simulation or bounded simulation (cubic-time)

matching complexity |M(Q, G)|

subgraph isomorphism NP-complete |V| |VQ|

graph simulation quadratic time |V| |VQ|

bounded simulation cubic time |V| |VQ|

regular matching cubic time |V| |VQ|

strong simulation cubic time |V|

45

Summary

Graph pattern matching – Subgraph isomorphism– Graph simulation– Bounded simulation– Regular matching– Strong simulation– . . .

46The study has raised as many questions as it has answered

Querying both topology and data content• What query language should we use for social data analysis?• Strike a balance between the expressivity and complexity

A uniform framework for these

46

Reading: W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012. (survey of graph pattern matching)

47

Summary and review

What is subgraph isomorphism? Complexity? Algorithm? Name

a few applications

What is graph simulation? Complexity? Understand its

algorithm. Name a few applications

Why do we need to revise conventional graph pattern matching

for social network analysis? How should we do it? Why?

Understand bounded simulation. Read its algorithm.

Complexity?

What is strong simulation? Complexity? Name a few

applications in which strong simulation is useful.

Find other revisions of conventional graph pattern matching that

are not covered in the lecture.

48

Project (1)

Recall bounded graph simulation

48

Implement an algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via bounded simulation

Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability

with the size of G Write a survey on revisions of conventional graph simulation, as

related work

A development project

49

Project (2)

Recall graph simulation

49

Develop a MapReduce algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via graph simulation

Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability

with the size of G Write a survey on revisions of conventional graph simulation, as

part of the related work

A research and development project

50

Project (3)

Recall subgraph isomorphism

50

Develop two algorithms that, given a pattern Q and a graph G, computes the maximum match of Q in G via subgraph isomorphism, in

• MapReduce (see Lecture 4)• BSP (see Lecture 5)

Develop optimization strategies to reduce parallel computational cost and data shipment cost Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on parallel algorithms for subgraph isomorphism

A development project

Papers for you to review

51

• M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on

finite and infinite graphs. FOCS, 1995.

http://infoscience.epfl.ch/record/99332/files/HenzingerHK95.pdf

• L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004 (search Google scholar)

A. Fard, M. U. Nisar, J. A. Miller, L. Ramaswamy, Distriuted and scalable

graph pattern matching: models and algorithms. Int. J. Big Data.

http://cobweb.cs.uga.edu/~ar/papers/IJBD_final.pdf

• W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010.

• W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries, ICDE 2011.

• S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo: Strong simulation: Capturing topology in graph pattern matching. TODS 39(1): 4, 2014.