![Page 1: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/1.jpg)
Yuanyuan Tian and Jignesh M. Patel
University of Michigan
TALE: A Tool for Approximate Large Graph Matching
![Page 2: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/2.jpg)
Growth of the KEGG Database
2706
29921
41689
66407
0
20000
40000
60000
80000
1999 2001 2003 2005 2007Year
# Pathways
� Graphs are everywhere.
� Social networks, computer networks, biological networks
� Graph databases are large and growing rapidly in size.
� Wealth of information is encoded in graph databases.
Motivation
Need: Graph Matching
![Page 3: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/3.jpg)
� Previous studies largely focus on exact graph matching.
� Assume precise graph data
� Subgraph isomorphism (NP-Complete)
� Real life graphs are noisy and incomplete.
� More challenging (need heuristic methods)
Motivation
Need: Approximate Graph Matching
Gap Node
Node Similarity
Match
Difference in Node Connectivity
![Page 4: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/4.jpg)
� Most existing methods are applicable to small query
graphs.
� 10s of nodes and edges
� Supporting large queries is more and more desired.
� Protein Interaction Networks (PINs):
� 100s ~ 1000s nodes and edges
� Compare PIN of one species against other species
Motivation
Need: LargeApproximate Graph Matching
![Page 5: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/5.jpg)
5
TALE: A Tool for Approximate LargeGraph Matching
� A Novel Disk-based Indexing Method
� High pruning power
� Linear index size with the database size
� Index-based Matching Algorithm
� Significantly outperforms existing methods
� Gracefully handles large queries and databases
� Experiments on Real Datasets
� Effectiveness
� Efficiency
![Page 6: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/6.jpg)
6
TALE: A Tool for Approximate LargeGraph Matching
� A Novel Disk-based Indexing Method
� High pruning power
� Linear index size with the database size
� Index-based Matching Algorithm
� Significantly outperforms existing methods
� Gracefully handles large queries and databases
� Experiments on Real Datasets
� Effectiveness
� Efficiency
![Page 7: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/7.jpg)
7
Neighborhood Indexing
� Index Unit?
Neighborhoods
Nodes Low � O(n) ☺
Subgraphs High ☺ O(nk) �
High ☺ O(n) ☺
Index Unit Pruning Power Index Size
Neighborhood
(induced subgraph of a node and its neighbors)
![Page 8: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/8.jpg)
8
Index Unit
� Index Unit: Neighborhood
� Which node is at the center?
� Node label
� How many neighbors does the node have?
� Node degree
� How do the neighbors connect to each other?
� NeighborConnection: # edges between neighbors
� Who are the neighbors?
![Page 9: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/9.jpg)
9
Index Unit
� Who are the neighbors?
� Naïve approach: list the labels of the neighbors
� Problem: the number of neighbors varies.
� If # labels in the problem domain is a small constant.
� Deterministic bit array.
� What if the number of labels is huge?
� Bloom filter: label —hash� position in a m-bit array.
� Information in the index unit
� (label, degree, nConn, nArray)
A B C D E
1 0 0 1 1Neighbor Array
![Page 10: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/10.jpg)
10
Match a Query Neighborhood
Exact
� Nq.label = Ndb.label
� Nq.degree ≤ Ndb.degree
� Nq.nConn ≤ Ndb.nConn
� (NOT Ndb.nArray)
AND Nq.nArray = 0
Approximate
� group(Nq.label) = group(Ndb.label)
� Nq.degree ≤ Ndb.degree +ε
� Nq.nConn ≤ Ndb.nConn +δ
� |(NOT Ndb.nArray) AND Nq.nArray| ≤ε
ρ: % of neighbors of a query node with no corresponding matches in the
neighborhood of a database node
max # missing neighbors: ε=ρ(Nq.degree)
max # missing nConn: δ=ε(ε-1)/2+ε(Nq.degree-ε)
group nodes based on similarity
Query DB
![Page 11: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/11.jpg)
11
Index Structure
� Support efficient search for DB neighborhoods.
group(Ndb.label) = group(Nq.label)
Ndb.degree ≥ Nq.degree –ε
Ndb.nConn ≥ Nq.nConn –δ
|(NOT Ndb.nArray) AND Nq.nArray| ≤ε
� Simple implementation in RDBMSs.
� Use existing robust disk-based index
structures in RDBMSs.Hybrid Index Structure
1 0 0 1
1 1 0 0
nArray
n0
n1
n2
n3
n4
n5
Bitmap Index on
nArray
![Page 12: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/12.jpg)
12
Index Probing
� Probe the B+tree for group, degree and nConn
� Easy
� Probe bitmaps for nArrays
� Naïve approach: look at each row of a bitmap
� A better approach
� Operate on bit slices.
� Up to 12X speedup!
1 0 0 1
1 1 0 0
![Page 13: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/13.jpg)
13
TALE: A Tool for Approximate LargeGraph Matching
� A Novel Disk-based Indexing Method
� High pruning power
� Linear index size with the database size
� Index-based Matching Algorithm
� Significantly outperforms existing methods
� Gracefully handles large queries and databases
� Experiments on Real Datasets
� Effectiveness
� Efficiency
![Page 14: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/14.jpg)
14
Observations
� Observation 1: Not every node plays the same role in
a graph.
� Node importance
� Observation 2: A good match should be more tolerant
towards missing unimportant nodes than missing
important nodes.
![Page 15: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/15.jpg)
15
Matching Algorithm Overview
� Step 1: Match the important nodes from the
query.
� Step 2: Progressively extends the node
matches.
Query Graph Database Graph
![Page 16: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/16.jpg)
16
TALE Matching Algorithm
� Step 1: Match important nodes from the query.
� Select important nodes.
� Importance measure: degree centrality
� The percentage of important nodes: P
� Probe Neighborhood Index to match important nodes.
� For each candidate graph in the database, find the one-
to-one mappings to the important query nodes.
� Maximum weighted bipartite graph matching
query nodes nodes in a DB graph
weight (matching neighbors & neighbor connections)
![Page 17: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/17.jpg)
17
TALE Matching Algorithm
� Step 2: Progressively extends the node
matches.
� Start from the importance node matches.
� Match “nearby” nodes of already matched nodes.
� Not just immediate neighbors
� Also nodes two hops away
� gap nodes
� differences in node connectivity
![Page 18: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/18.jpg)
18
TALE: A Tool for Approximate LargeGraph Matching
� A Novel Disk-based Indexing Method
� High pruning power
� Linear index size with the database size
� Index-based Matching Algorithm
� Significantly outperforms existing methods
� Gracefully handles large queries and databases
� Experiments on Real Datasets
� Effectiveness
� Efficiency
![Page 19: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/19.jpg)
19
Experimental Evaluation
� Implementation
� C++ on top of PostgreSQL
� Evaluation Platform
� 2.8GHz P4, 2GB RAM, 250GB SATA disk, FC2
� PostgreSQL: version 8.1.3, 512 MB buffer pool
� Experimental Datasets
� BIND protein interaction networks
� ASTRAL protein structures
� Evaluation Measures:
� Effectiveness
� Efficiency
![Page 20: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/20.jpg)
20
Effectiveness Experiment
#node #edge
rat 830 942
mouse 2991 3347
human 8470 11260
#KEGGshit
KEGG coverage
Time (sec)
rat vs. human
Graemlin
TALE
0
6
NA
3.2%
910.0
0.3
mouse vs. human
Graemlin
TALE
18
42
5.0%
13.6%
16305.5
0.8
# KEGGs hit: number of pathways aligned between 2 species
KEGG coverage: fraction of proteins aligned within a pathway.
� Protein Interaction Network Comparison (BIND)
![Page 21: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/21.jpg)
0
1000
2000
3000
4000
5000
6000
0 20000 40000 60000 80000
Database Size (#graphs)
Ind
ex C
on
str
uctio
n
Tim
e (
se
c)
21
Efficiency Experiment
0
500
1000
1500
2000
2500
0 20000 40000 60000 80000
Database Size (#graphs)
Ind
ex S
ize
(M
B)
� Query increasing sized ASTRAL datasets
� 20 queries (153.1n, 592.0e)
� Top 20 results
0
20
40
60
80
100
0 20000 40000 60000 80000
Database size (#graphs)
Exe
cu
tio
n T
ime
(se
c)
![Page 22: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/22.jpg)
22
Related Work
� Index-based Approximate Graph Matching
� Graphfil, PIS, CDIndex, C-Tree, SAGA
� Limited approximation: Graphfil, PIS, CDIndex,
C-Tree
� For small queries: Graphfil, PIS, CDIndex, SAGA
� Pairwise Graph Alignment Methods
� NetworkBlast, MaWIsh, Graemlin
� Specific to protein interaction networks
� Very slow for database search (no index)
![Page 23: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/23.jpg)
Conclusion
� TALE � Approximate Large Graph Matching
� Neighborhood Indexing
� Disk-based index using existing index structures in RDBMSs
� High pruning power
� Linear index size with the database size
� Index-based Matching Algorithm
� Distinguish nodes by importance
� Match important nodes then extend to others
� Experiments on Real Datasets
� Improved effectiveness and efficiency over existing methods
![Page 24: TALE: A Tool for Approximate Large Graph Matching · 2009-12-29 · TALE 0 6 NA 3.2% 910.0 0.3 mouse vs. human Graemlin TALE 18 42 5.0% 13.6% 16305.5 0.8 # KEGGshit: number of pathways](https://reader036.vdocument.in/reader036/viewer/2022081613/5fb5fb911d05805cdc4c18d0/html5/thumbnails/24.jpg)
24
Questions?
Suggestions?
Thanks! ☺☺☺☺