1
QSX: Querying Social Graphs
Graph Pattern Matching
Graph pattern matching via subgraph isomorphism
Graph pattern matching via graph simulation
Revisions of graph simulation for social network analysis
2
The need for studying graph pattern matching
Prevalent use in traditional and emerging applications
Applications
• pattern recognition
• knowledge discovery
• intelligence analysis
• transportation network analysis
• Web site classification,
• social position and community detection
• social media marketing
• knowledge fusion
• . . .
2
Gen
4
Directed graph G = (V, E, fA) attributes fA(u): label
Social Graphs
Med
Soc
AI
Chem
Simplification: node labels
DBDB
Assume fA(u) has a unique attribute: label
Assume fA(u) has a unique attribute: label
4
EcoEco
5
Subgraph isomorphism
A function f from the nodes of Q to the nodes of G: For each node u in Q, u and f(u) have the same label; There exists an edge (u, u’) in Q if and only if there exists an
edge (f(u), f(u’)) in G
A bijection: identical label matching, edge-to-edge relations
A
B
D
Bvv11 vv22
E
G
A
B
D E
Q
5
6
Matching by subgraph isomorphism
Input: A directed graph G, and a graph pattern Q
Output: all subgraphs of G that are isomorphic to Q
intractable 6
Complexity • Remains NP-hard even when
• Q is a tree and G is a forest• Q is acyclic and G is a tree
PTIME if Q is a forest and G is a tree
NP-completeNP-complete Exponentially many matchesExponentially many matches
The lower bounds is rather robust
7
Algorithms for computing subgraph isomorphism
Match(P)
•if P covers all nodes in Q then output P;
•else compute the set S(P) of all candidate pairs for inclusion in P
•for each pair p = (u, v) in S(P)
• if p passes feasibility check
• then P’ P {p}; call Match(P’);
•restore data structures
Input: pattern Q and graph G
Output: all isomorphic mappings P from Q to G
nodes that are directly connected to those already in P,
with the same labels
P: partial mappings, initially empty
Recursion, refinement
for each pair p = (u, v) in S(P):enumerate all possible extensions, for refinementif the feasibility test is not successful, drop it and try the next
Guarantee correctness
8
VF2
Match(P)
•if P covers all nodes in Q then output P;
•else compute the set S(P) of all candidate pairs for inclusion in P
•for each pair p = (u, v) in S(P)
• if p passes feasibility check
• then P’ P {p}; call Match(P’);
•restore data structures
Five k-look-ahead rules, to make sure that P is a partial isomorphic mapping
VF2: a popular algorithm for subgraph isomorphism
Feasibility rules: for each pair (u, v) in Ptheir predecessors are already mapped and included in P their successors can possibly be mapped Certain conditions on cardinalities of predecessors and successors to ensure correctness and expandability
Guarantee correctness and reduce backtracking
L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004
9
Ullman’s algorithm
Backtrack(P)
•if P covers all nodes in Q then output P and return;
•for each node u in Q that is not yet in P
• find a node v in G; p (u, v); P’ P {p};
• if P’ makes a partial mapping (injective function, preserving edges)
• then call Backtrack(P’);
Use adjacency matrices of G and Q, their transposes, and a form of permutation matrices
An algorithm that is still being used
Expanding permutation matrices representing P
for each candidate pair p = (u, v):enumerate all possible extensions, for refinementBacktracking: no matter whether the test is successful or not, go back to the previous level and try another p
J. R. Ullman. An Algorithm for Subgraph Isomorphism. JACM 1976
11
Graph Simulation
11A relation: identical label matching, edge-to-edge mapping
A binary relation R on the nodes of Q and the nodes of G: For each node u in Q, there exists a node v in G such that (u, v)
is in R, and u and v have the same label; If there exists an edge (u, u’) in Q and each pair (u, v) is in R,
then there exists an edge (v, v’) in G such that (u’, v’) is in R
A
B
D
Bvv11 vv22
E
G
A
B
D E
Q
11
relations as opposed to functions
12
Matching by graph simulation
Input: A directed graph G, and a graph pattern Q Output: the maximum simulation relation R
Quadratic time 12
Maximum simulation relation: always exists and is unique• If a match relation exists, then there exists a maximum one• Otherwise, it is the empty set – still maximum
Use relations instead of functions
Complexity: O((| V | + | VQ |) (| E | + | EQ| )
The output is a unique relation, possibly of size |Q||V|
13
Data locality
Given a pattern Q, a graph G and a node v in G, can we decide whether v
matches some node in Q by inspecting only nodes within d hops of v, where d
is determined by Q only?
Graph simulation: a recursive computation
Q Gd: the diameter of Q
We only need to inspect the d-neighborhood of v
13
Graph simulation does not have the data locality
Subgraph isomorphism has the data locality
14
Algorithm for computing graph simulation
Similarity(P)
•for all nodes u in Q do
• sim(u) the set of candidate matches w in G;
•while there exist (u, v) in Q and w in sim(u) (in G) that violate the simulation condition
• sim(u) sim(u) {w};
•output sim(u) for all u in Q
Input: pattern Q and graph G
Output: for each u in Q, sim(u): the matches w in G
successor(w) sim(v) =
Correct, but not in quadratic time
successor(w) sim(v) = • There exists an edge from u to v in Q, but the candidate w of u
has no corresponding edge to a node w’ that matches v
refinement
with the same label; moreover, if u has an outgoing edge, so does w
15
speedup
For each node u in pattern Q, prevsim(u)• once considered for candidate matches of u• for each edge (u, v) in Q and each w in sim(u)
successor(w) prevsim(v) • terminate if prevsim(u) = sim(u) for all nodes u in G
prevsim(u) sim(u): invalid candidates
Each node in prevsim(u) is looked up only once15
a superset of sim(u)
If successor(w) prevsim(v) = • w should be removed from sim(u); u: a predecessor of v
Propagate violations upward
Can’t be refined further
Once w is removed, it is never put back
16
Algorithm
Similarity(P)
•for all nodes v in Q do
• sim(v) the set of candidate matches in G;
• prevsim(v) the set of all the nodes in G;
•while there exists a node v in Q and such that sim(v) prevsim(v)
• remove predecessor(sim(v)) predecessor(prevsim(v));
• for all u in predecessor(v) do
• sim(u) sim(u) remove;
• prevsim(v) sim(v);
•output sim(v) for all v in Q
Can be implemented in O((| V | + | VQ |) (| E | + | EQ| ) time
refinement
with the same label; moreover, if u has an outgoing edge, so does w
Propagate up
For each w prevsim(v) sim(v),w is checked only once, hence |VQ| |V| in
total
A dynamically maintained remove
18
Input: a query Q and a data graph G,
Output: all the matches of Q in G.• subgraph isomorphism
a bijective function f on nodes: (u,u’
) ∈ Q iff (f(u), f(u’)) ∈ G
a binary relation S on nodes
for each (u,v) S, ∈ each edge (u,u’) in Q is mapped to an edge (v, v’ ) in G, such that (u’,v’ ) S∈
• graph simulation
18
Graph pattern matching: The conventional
Can we use the conventional notions for social network analysis?
Find all matches of a pattern in a graphFind all matches of a pattern in a graph
Example query: graph pattern matching
Identify suspects in a drug ring
Identify suspects in a drug ring
19“Understanding the structure of drug trafficking organizations”
pattern graph
B
A1 Am
W
W
W
W W
W
WW
33
1
B
AM S
FW
19
Pattern matching in social graphs
20
not allowed by bijectionrelation
instead of function
edges to paths
Neither subgraph isomorphism nor graph simulation works
B
A1
Am
W
W
W
W W
W
WW
33
1
B
AM S
FW
For both scalability and effectiveness
20
Gen
21
Directed graph G = (V, E, fA) attributes fA(u): a tuple (A1 = a1, ..., An = an)
Social Graphs
Med
Soc
AI
Chem
(‘dept’=CS, ‘field’=AI) (‘dept’=CS, ‘field’=AI)
(‘dept’=CS, ‘field’=DB) (‘dept’=CS, ‘field’=DB) (‘dept’=Bio, ‘field’=Gen) (‘dept’=Bio, ‘field’=Gen)
(‘dept’=Bio, ‘field’=Eco) (‘dept’=Bio, ‘field’=Eco)
Social graphs: modeling attributes
DBDB
label, keywords, blogs, comments, rating …label, keywords, blogs, comments, rating …
21
EcoEco
CS Bio
Soc
Med*
3
*
2
2
3
22
Bounded patterns
Pattern graph: Q = (VQ, EQ, fv, fe)
fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
fe(u,u’): a constant k or a symbol , ∗ bound
BoundedBounded
UnboundedUnbounded
fv(): ‘dept’=CSfv(): ‘dept’=CS
Incorporating search conditions and bounds on the number of hops
Search conditionSearch condition
within k hopswithin k hops
22
23
G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via bounded simulation, if
there exists a binary relation S ⊆ VQ × V such that S
is a total mapping, satisfies search conditions and bounds on edge-to-path mappings
Bounded Simulation
CS DB
Soc
Med Med
Gen
Soc Eco
*
3
*
2
2
3AI
Chem
S
Mapping edges to bounded paths
Bio
for each u V∈ Q, there exists v V ∈ such that (u,v) S∈
for each (u,v) S, ∈ attributes fA(v) satisfies predicate fv(u)
each (u,u’ ) in EQ is mapped to a path from v to v’ of length fe(u,u’ ) in G, (u’,v’ ) S∈
23
There exists a unique maximum match
Bounded simulation in social graphs
24The set of all suspects involved in a drug ring
edges to paths
B
A1
Am
W
W
W
W W
W
WW
33
1
B
AM S
FW
relation instead of function
24
O(| V | | E | + | EQ| | V |2 + | VQ| | V |)O(| V | | E | + | EQ| | V |2 + | VQ| | V |)
25
Complexity
Subgraph isomorphism: intractable
Graph simulation: O((| V | + | VQ |) (| E | + | EQ| )
Input: Pattern Q and data graph G
Output: Q(G), the unique maximum match relation cubic time
comparable: Q is small in practice
To identify sensible matches and be computable in low PTIME 25
Query driven approximation: use bounded simulation instead of subgraph isomorphism. Criteria:Lower complexityEffectiveness: the query answers are sensible
Always exist
Algorithm? The reading list
26
Bounded simulation vs. graph simulation
Graph simulation: a special case of bounded simulation
The same bound 1 on all pattern edges (edge-to-edge mapping)
Unique attributes vs. search conditions: label equality
O((| VG | + | VQ |) (| EG | + | EQ| )
vs.
O(| VG | | EG | + | EQ| | VG |2 + | VQ| | VG|)
Process calculusWeb site classificationSocial position detection, …
Capture more sensible matches in social graphs (by 80%) 26
27
Homeomorphism and monomorphism
Graph homeomorphism: G = (V, E) matches Q = (VQ, EQ)
an injective function from VQ V
edges to pairwise node-disjoint simple paths in G
function rather than relation
Strike a balance between expressive power and complexity
constraints on pathsMonomorphism revised: G = (V, E) matches Q = (VQ, EQ)
an injective function from VQ V
edges to nonempty paths in G
Intractable, even when Q is a tree and G is a DAG
27
Edge relationships
29What is this pattern to find?
S: supervise
C: co-author
Ann, CS
Pat, DB
John, DB
Bill, Bio
Don, Gen
Tom, BioCC
SS
SS
SS
CC
CC
CC
CC
CC
Mat, DB
DB
CS
Bio
Bio
CC
CC
S+S+
pattern
29
Edge relation
(Alice, Facebook)
(Alice, Sunita)
(Jose, Twitter)
(Jose, Sunita)
(Mikhail, Facebook)
(Mikhail, Twitter)
(Sunita, Facebook)
(Sunita, Alice)
(Sunita, Jose)
30
Alice Sunita Jose
MikhailTwitter
Graph encodings: Adding edge types
(Alice, fan-of, Facebook)
(Alice, friend-of, Sunita)
(Jose, fan-of, Twitter)
(Jose, friend-of, Sunita)
(Mikhail, fan-of, Facebook)
(Mikhail, fan-of, Twitter)
(Sunita, fan-of, Facebook)
(Sunita, friend-of, Alice)
(Sunita, friend-of, Jose)31
Alice Sunita Jose
MikhailTwitter
Facebook fan-of
friend-offriend-of
fan-of fan-of
fan-of
fan-of
Adding edge labelsAdding edge labels
Graph encodings: Adding weights
(Alice, fan-of, 0.5, Facebook)
(Alice, friend-of, 0.9, Sunita)
(Jose, fan-of, 0.5, Twitter)
(Jose, friend-of, 0.3, Sunita)
(Mikhail, fan-of, 0.8, Facebook)
(Mikhail, fan-of, 0.7, Twitter)
(Sunita, fan-of, 0.7, Facebook)
(Sunita, friend-of, 0.9, Alice)
(Sunita, friend-of, 0.3, Jose)
32
Alice Sunita Jose
MikhailTwitter
Facebook fan-of
friend-offriend-of
fan-of fan-of
fan-of
fan-of
0.5
0.9
0.7
0.3
0.8 0.7
0.5
Even further, you can add weights and othersEven further, you can
add weights and others
33
Regular patterns
Pattern: Q = (VQ, EQ, fv, fe)
fv(u): a conjunction of A op a, op in <, ≤, =, ≠, >, ≥
fe(u,u’ ): a regular expression of the form
BoundedBounded
UnboundedUnbounded
Mapping edges to paths satisfying associated regular expressions
DB
CS
Bio
Bio
CC
CC
S+S+F ::= c | ck | c+ | FFF ::= c | ck | c+ | FF
Simple regular expressions: fairly common optimizing patterns (checking containment in linear-time) low complexity in matching
33
O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)O(| V | | E | + m | EQ| | V |2 + | VQ| | V |)
34
Complexity
bounded simulation: a special case single color c (hence m = 1) fe(u,u’ ) = c
Input: Pattern Q and data graph G
Output: Q(G) m: the number of distinct colors in Q
Adding edge colors does not incur extra complexity
general regular expressions?
34
36
Limitations of graph simulation
A disconnected graph matches a connected pattern The yellow node in the pattern has 3 “parents”, in contrast to 1
in the data graph An undirected cycle matches a tree
Simulation does not preserve the topologic in matching
pattern graphWhat is wrong?
36
37
Limitations of graph simulation
A cycle with two nodes matches a cycle of unbounded length
The match relation may be excessively large
The need for revising simulation to enforce locality
pattern graph
When social distances increase, the closeness of relationships decrease
37
38
G = (V, E, fA) matches Q = (VQ, EQ, fv, fe) via dual simulation, if there
exists a binary relation S ⊆ VQ × V such that S
is a total mapping, satisfies search conditions, and preserves both “child” and “parent” relationships
Dual simulation
Preserve “parent” relationships and connectivity
for each (u,v) S, ∈ each (u,u’ ) in EQ is mapped to an
edge (v, v’ ) in G, (u’, v’ ) S∈ each (u’, u) in EQ is mapped to an
edge (v’, v) in G, (u’, v’ ) S∈
Q(G) : a unique maximum match relation
38
39
diameter dQ: the maximum shortest distance (undirected paths)
Locality
Locality: matches contained in G[v, dQ] for some v
dQ-radius subgraph G[v, dQ] : centered at v, within dQ hops
21
v
Excessive match
39
40
G matches Q via strong simulation, if there exists a node v in G
such that G[v, dQ] matches Q via dual simulation
– duality– local
Strong simulation
Matching: given Q and G, find the set Q(G) of all matches
Match: the subgraph GS of G[v, dQ] representing the maximum match S
for each (u,v) in the maximum match S, v is in GS for each edge (u,u’ ) in Q, (v, v’ ) is in
GS if (u’,v’ ) S∈
40
41
Child and parent relationships
Preserving the topology of patterns
What about graph simulation?
connectivity: if Q is connected (via undirected path), so is GS
cycles: a directed (resp. undirected) cycle in Q matches a directed (resp. undirected) cycle in GS
bounded matches: – the diameter of GS is at most 2 * dQ – |M(Q, G)| |V|
41
O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |))O(| V | (| V | + (| VQ| + | EQ|) (| V | + | E |))
42
Strong simulation vs. graph simulation
Input: Pattern Q and data graph G Output: Q(G)
cubic time
hierarchy
A balance between the complexity and the ability to preserve topology
G matches Q via dual simulation
G matches Q via graph simulation
G matches Q via strong simulation
G matches Q via subgraph isomorphismpreserve topology, but
not bounded match
does not preserve parents, connectivity, undirected cycles, bounded match
Complexity of strong simulation
42
43
Bounded cycles
Making strong simulation stronger?
Both extensions make matching from PTIME to intractable
Bisimulation instead of simulation: find all subgraphs that are bisimilar to a pattern
If G matches Q, then the longest simple cycle in G is no longer than its counterpart in Q
for each (u,v) S, ∈ each (u,u’ ) in EQ is mapped to an edge
(v, v’ ) in Gs, (u’,v’ ) S∈
each edge (v, v’ ) in Gs is mapped to an edge (u,u’ ) in EQ, (u’, v’ ) S∈
43
45
Various notions for graph pattern matching
Query driven approximation: from subgraph isomorphism (intractable)
to strong simulation or bounded simulation (cubic-time)
matching complexity |M(Q, G)|
subgraph isomorphism NP-complete |V| |VQ|
graph simulation quadratic time |V| |VQ|
bounded simulation cubic time |V| |VQ|
regular matching cubic time |V| |VQ|
strong simulation cubic time |V|
45
Summary
Graph pattern matching – Subgraph isomorphism– Graph simulation– Bounded simulation– Regular matching– Strong simulation– . . .
46The study has raised as many questions as it has answered
Querying both topology and data content• What query language should we use for social data analysis?• Strike a balance between the expressivity and complexity
A uniform framework for these
46
Reading: W. Fan. Graph Pattern Matching Revised for Social Network Analysis, ICDT 2012. (survey of graph pattern matching)
47
Summary and review
What is subgraph isomorphism? Complexity? Algorithm? Name
a few applications
What is graph simulation? Complexity? Understand its
algorithm. Name a few applications
Why do we need to revise conventional graph pattern matching
for social network analysis? How should we do it? Why?
Understand bounded simulation. Read its algorithm.
Complexity?
What is strong simulation? Complexity? Name a few
applications in which strong simulation is useful.
Find other revisions of conventional graph pattern matching that
are not covered in the lecture.
48
Project (1)
Recall bounded graph simulation
48
Implement an algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via bounded simulation
Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability
with the size of G Write a survey on revisions of conventional graph simulation, as
related work
A development project
49
Project (2)
Recall graph simulation
49
Develop a MapReduce algorithm that, given a pattern Q and a graph G, computes the maximum match of Q in G via graph simulation
Develop optimization strategies Experimentally evaluate your algorithm, especially its scalability
with the size of G Write a survey on revisions of conventional graph simulation, as
part of the related work
A research and development project
50
Project (3)
Recall subgraph isomorphism
50
Develop two algorithms that, given a pattern Q and a graph G, computes the maximum match of Q in G via subgraph isomorphism, in
• MapReduce (see Lecture 4)• BSP (see Lecture 5)
Develop optimization strategies to reduce parallel computational cost and data shipment cost Experimentally evaluate your algorithms, especially their scalability with the size of G Write a survey on parallel algorithms for subgraph isomorphism
A development project
Papers for you to review
51
• M. R. Henzinger, T. Henzinger, and P. Kopke. Computing simulations on
finite and infinite graphs. FOCS, 1995.
http://infoscience.epfl.ch/record/99332/files/HenzingerHK95.pdf
• L. P. Cordella, P. Foggia, C. Sansone, M. Vento. A (Sub)Graph Isomorphism Algorithm for Matching Large Graphs, IEEE Trans. Pattern Anal. Mach. Intell. 26, 2004 (search Google scholar)
A. Fard, M. U. Nisar, J. A. Miller, L. Ramaswamy, Distriuted and scalable
graph pattern matching: models and algorithms. Int. J. Big Data.
http://cobweb.cs.uga.edu/~ar/papers/IJBD_final.pdf
• W. Fan J. Li, S. Ma, and N. Tang, and Y. Wu. Graph pattern matching: From intractable to polynomial time, VLDB, 2010.
• W. Fan, J. Li, S. Ma, N. Tang, and Y. Wu. Adding Regular Expressions to Graph Reachability and Pattern Queries, ICDE 2011.
• S. Ma, Y. Cao, W. Fan, J. Huai, T. Wo: Strong simulation: Capturing topology in graph pattern matching. TODS 39(1): 4, 2014.