Exploiting Local Similarity for Indexing Paths in Graph-Structured
Data
byRaghav Kaushik, Pradeep Shenoy, Philip
Bohannon and Ehud Gudes
1Abdullah Mueen
Outline
X No OutlineX No Confusing SyntaxX No Pseudocode Examples Results
Abdullah Mueen 2
XML as Data Graph
Abdullah Mueen 3
oid
label(3)
value(13)
Non-tree edges: model IDREF relationships in the document
Some Notations• node path:
– 1.2.3.7.14• label path:
– ROOT.metro.cultural.museum.name
• 1.2.3.7 matches ROOT.metro.cultural.museum
• 2.3.7 does not match metro.cultural.museum.name
• 7 and 6 both matches ROOT.etro.cultural.museum
• k-path:– Label Path of length ≤ k
Abdullah Mueen 4
Path Expression
Abdullah Mueen 5
label
matches with any label
sequencing
alteration repetition
optional
ROOT.metro.cultural.museum 6,7
ROOT.(-.-.-).name 12,14,16,19,22,24
ROOT.-*.hotel All hotel nodes
ROOT.metro.neighborhoods.neighborhood.
(-|-.-)?.(hotel|museum).name 12,14,16,19
•http://saxon.sourceforge.net/saxon6.5.3/expressions.html•http://www.w3.org/1999/09/ql/docs/xquery.html
Xpath and other Query Languagesthat use Path Expressions
The Problem
• Given a graph G and a path expression P , what are the labels of the nodes that match with P.
• Possible Solution is to evaluate the path expression query using the data graph.
• But data graph can be Very Large to fit in the main memory and can be Very Large to search completely even if it fits.
Abdullah Mueen 6
Indexing Data Graph
• No Schema• No Keys• Only Structural Information is there which can
be summarized by a smaller graph I(G). This summary graph serves as an Index for the whole data graph.
Abdullah Mueen 7
Indexing Data Graph : Example(1)
Abdullah Mueen 8
data graph G
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
R.A.-*.C = {5,7}R.-.B = {4,2}
R.A.-*.C = {5,7}R.-.B = {4,2}
12 13 14
15
11R
CBA
D{2,4}
{3}
{6}
index graph I(G)
17
C
{5,7}
18
D
{8,9}
{1}
ext(17) = {5,7}ext(13) = {2,4}
ExtentPrecise Indexeg. DataGuide, 1-index
Indexing Data Graph : Example(2)
Abdullah Mueen 9
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
data graph G
R.A.-*.C = {5,7}R.-.-*.B = {4}
12 13 14
15
11R
CBA
D
{1} {2,4}{3,5,7}
{6,8,9}
index graph I(G)
R.A.-*.C = {3,5,7}R.-.-*.B = {2,4}
Safe Index
Indexing Data Graph : Example(3)
Abdullah Mueen 10
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
data graph G
R.A.-*.C = {5,7}R.-.-*.B = {2}
Unsafe Index
12 13 14
15
11R
CBA
D
{1} {2,4}{3,5,7}
{6,8,9}
index graph I(G)
R.A.-*.C = {3,5,7}R.-.-*.B = { }
Bisimilarity
Abdullah Mueen 11
data graph G
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
R.A.-*.C = {5,7}R.-.B = {4,2}
2,4 are bisimilar. 5,7 are bisimilar 8,9 are bisimilar 6,8 are Not bisimilar
≈b defines an equivalence class over the set of nodes in G
Needs O(m log n) time to find the partitions
Two nodes u and v are called bisimilar (u ≈b v) if1.label(u) = label(v)2.every incoming label path from ROOT to u matches with at least one incoming path from ROOT to v and vice versa.
Equivalence Classb → The 1-index
Abdullah Mueen 12
data graph G
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
12 13 14
15
11R
CBA
D{2,4}
{3}
{6}
index graph I(G)
17
C
{5,7}
18
D
{8,9}
{1}
R.A.-*.C = {5,7}R.-.B = {4,2}
R.A.-*.C = {5,7}R.-.B = {4,2}
Revisiting Bisimilarity
Abdullah Mueen 13
1-index is upper bounded by the size (number of nodes) of the data graphFor real large documents it is almost 45% of the size of the data graph
Bisimilarity partitions nodes by considering all incoming paths from ROOT which is a global comparison between nodes.
k-bisimilarity
Abdullah Mueen 14
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
Two nodes u and v are called k-bisimilar (u ≈k v) if
1.label(u) = label(v) 2.every incoming label path of length≤k to u matches with at least one incoming path of length≤k to v and vice versa.
2,4 are 0-bisimilar. 5,7 are 1-bisimilar 8,9 are 2-bisimilar 6,8 are 1-bisimilar
≈k defines an equivalence class over the set of nodes in G
The algorithm for computing k-bisimulation will be shown later
Equivalence Class0 → A(0) index
Abdullah Mueen 15
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
data graph G index graph A(0)
12 13 14
15
11R
CBA
D
{1} {2,4}{3,5,7}
{6,8,9}
Label grouping /Label partition
Equivalence Class1 → A(1) index
Abdullah Mueen 16
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
data graph G index graph A(1)
15
12 13
16
14
17
11R
CBA
DCB
{1}{2}
{4} {5,7}
{3}
{6,8,9}
A(k) index family
Abdullah Mueen 17
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
18
15
12
13
16
14
17
11
R
D
CB
A
DCB
{8,9}
{1}{2}
{4} {5}
{3}
{6}
19
C
{7}
18
15
12
13
16
14
17
11
R
D
CB
A
DCB
{8}
{1}{2}
{4} {5}
{3}
{6}
19
C
{7}
18
D{9}
12
13
14
15
11
R
CB
A
D
{1} {2,4}{3,5,7}
{6,8,9}
A(0) A(1)
A(2) A(3) = 1-index
data graph G
15
12
13
16
14
17
11
R
CBA
DCB
{1}{2}
{4} {5,7}
{3}
{6,8,9}
Properties of A(k) index
Abdullah Mueen 18
8
4
1 2
5
7
3
6
9
0R
D
CBA
D
DC
C
B
15
12
13
16
14
17
11
R
C
BA
DCB
{1}
{2}
{4} {5,7}
{3}
{6,8,9}
A(1)
Properties of A(k) index
Abdullah Mueen 19
8
4
1 2
5
7
3
6
9
0R
D
CBA
D
DC
C
B15
12
13
16
14
17
11
R
C
BA
DCB
{1}
{2}
{4} {5,7}
{3}
{6,8,9}
A(1)
How to compute A(1) index
Abdullah Mueen 20
{1} {2,4} {3,5,7} {6,8,9}
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
{1} {2} {4} {3,5,7} {6,8,9}
{1} {2,4} {3,5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2,4} {3,5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2,4} {3,5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
1-bisimilar partition
{1} {2,4} {3,5,7} {6,8,9}
{1} {2,4} {3,5,7} {6,8,9}
Label partition
Lookup:
Refining:
How to compute A(2) index
Abdullah Mueen 21
8
4
1 2
5
7
3
6
9
0 R
D
CBA
D
DC
C
B
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5} {7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5} {7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5} {7} {6} {8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5} {7} {6} {8,9}2-bisimilar partition
{1} {2} {4} {3} {5,7} {6,8,9}
{1} {2} {4} {3} {5,7} {6,8,9}
1-bisimilar partition
Lookup:
Refining:
Query Evaluation : Fwd or Bckwd
Abdullah Mueen 22
R.A.-*.C = {5,7}15
12
13
16
14
17
11
R
C
BA
DCB
{1}
{2}
{4} {5,7}
{3}
{6,8,9}
R A
C
-
Repeated state is prevented O(|A|*m) Backward evaluation using label-group
Query Evaluation : Validation
Abdullah Mueen 23
R.A.B.C.D = {6,8,9}15
12
13
16
14
17
11
R
C
BA
DCB
{1}
{2}
{4} {5,7}
{3}
{6,8,9}
R A
C
B
Repeated state is prevented O(|A|*m)
D
Avoiding Validation
Abdullah Mueen 24
15
12
13
16
14
17
11
R
C
BA
DCB
{1}
{2}
{4} {5,7}
{3}
{6,8,9}
R.-*.C.D= {6,8,9}
A(1)
For Queries like R.-*.p, we can safely avoid validation on A(k) if p is a k-path.
Results
Abdullah Mueen 25
Results
Abdullah Mueen 26
Conclusion
• A(k) index is smaller than precise indexes and have their advantages, such as faster execution time with significant accuracy.
• Future presentations– Change of the indexes with updates.– Incorporating more complex queries.
Abdullah Mueen 27