exploiting local similarity for indexing paths in graph-structured data

Exploiting Local Similarity for Indexing Paths in Graph-Structured

Data

byRaghav Kaushik, Pradeep Shenoy, Philip

Bohannon and Ehud Gudes

1Abdullah Mueen

Outline

X No OutlineX No Confusing SyntaxX No Pseudocode Examples Results

Abdullah Mueen 2

XML as Data Graph

Abdullah Mueen 3

oid

label(3)

value(13)

Non-tree edges: model IDREF relationships in the document

Some Notations• node path:

– 1.2.3.7.14• label path:

– ROOT.metro.cultural.museum.name

• 1.2.3.7 matches ROOT.metro.cultural.museum

• 2.3.7 does not match metro.cultural.museum.name

• 7 and 6 both matches ROOT.etro.cultural.museum

• k-path:– Label Path of length ≤ k

Abdullah Mueen 4

Path Expression

Abdullah Mueen 5

label

matches with any label

sequencing

alteration repetition

optional

ROOT.metro.cultural.museum 6,7

ROOT.(-.-.-).name 12,14,16,19,22,24

ROOT.-*.hotel All hotel nodes

ROOT.metro.neighborhoods.neighborhood.

(-|-.-)?.(hotel|museum).name 12,14,16,19

•http://saxon.sourceforge.net/saxon6.5.3/expressions.html•http://www.w3.org/1999/09/ql/docs/xquery.html

Xpath and other Query Languagesthat use Path Expressions

http://saxon.sourceforge.net/saxon6.5.3/expressions.html

http://www.w3.org/1999/09/ql/docs/xquery.html

The Problem

• Given a graph G and a path expression P , what are the labels of the nodes that match with P.

• Possible Solution is to evaluate the path expression query using the data graph.

• But data graph can be Very Large to fit in the main memory and can be Very Large to search completely even if it fits.

Abdullah Mueen 6

Indexing Data Graph

• No Schema• No Keys• Only Structural Information is there which can

be summarized by a smaller graph I(G). This summary graph serves as an Index for the whole data graph.

Abdullah Mueen 7

Indexing Data Graph : Example(1)

Abdullah Mueen 8

data graph G

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

R.A.-*.C = {5,7}R.-.B = {4,2}

R.A.-*.C = {5,7}R.-.B = {4,2}

12 13 14

15

11R

CBA

D{2,4}

{3}

{6}

index graph I(G)

17

C

{5,7}

18

D

{8,9}

{1}

ext(17) = {5,7}ext(13) = {2,4}

ExtentPrecise Indexeg. DataGuide, 1-index


Abdullah Mueen 9

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

data graph G

R.A.-*.C = {5,7}R.-.-*.B = {4}

12 13 14

15

11R

CBA

D

{1} {2,4}{3,5,7}

{6,8,9}

index graph I(G)

R.A.-*.C = {3,5,7}R.-.-*.B = {2,4}

Safe Index


Abdullah Mueen 10

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

data graph G

R.A.-*.C = {5,7}R.-.-*.B = {2}

Unsafe Index

12 13 14

15

11R

CBA

D

{1} {2,4}{3,5,7}

{6,8,9}

index graph I(G)

R.A.-*.C = {3,5,7}R.-.-*.B = { }

Bisimilarity

Abdullah Mueen 11

data graph G

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

R.A.-*.C = {5,7}R.-.B = {4,2}

2,4 are bisimilar. 5,7 are bisimilar 8,9 are bisimilar 6,8 are Not bisimilar

≈b defines an equivalence class over the set of nodes in G

Needs O(m log n) time to find the partitions

Two nodes u and v are called bisimilar (u ≈b v) if1.label(u) = label(v)2.every incoming label path from ROOT to u matches with at least one incoming path from ROOT to v and vice versa.

Equivalence Classb → The 1-index

Abdullah Mueen 12

data graph G

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

12 13 14

15

11R

CBA

D{2,4}

{3}

{6}

index graph I(G)

17

C

{5,7}

18

D

{8,9}

{1}

R.A.-*.C = {5,7}R.-.B = {4,2}

R.A.-*.C = {5,7}R.-.B = {4,2}

Revisiting Bisimilarity

Abdullah Mueen 13

1-index is upper bounded by the size (number of nodes) of the data graphFor real large documents it is almost 45% of the size of the data graph

Bisimilarity partitions nodes by considering all incoming paths from ROOT which is a global comparison between nodes.

k-bisimilarity

Abdullah Mueen 14

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

Two nodes u and v are called k-bisimilar (u ≈k v) if

1.label(u) = label(v) 2.every incoming label path of length≤k to u matches with at least one incoming path of length≤k to v and vice versa.

2,4 are 0-bisimilar. 5,7 are 1-bisimilar 8,9 are 2-bisimilar 6,8 are 1-bisimilar

≈k defines an equivalence class over the set of nodes in G

The algorithm for computing k-bisimulation will be shown later

Equivalence Class0 → A(0) index

Abdullah Mueen 15

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

data graph G index graph A(0)

12 13 14

15

11R

CBA

D

{1} {2,4}{3,5,7}

{6,8,9}

Label grouping /Label partition

Equivalence Class1 → A(1) index

Abdullah Mueen 16

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

data graph G index graph A(1)

15

12 13

16

14

17

11R

CBA

DCB

{1}{2}

{4} {5,7}

{3}

{6,8,9}

A(k) index family

Abdullah Mueen 17

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

18

15

12

13

16

14

17

11

R

D

CB

A

DCB

{8,9}

{1}{2}

{4} {5}

{3}

{6}

19

C

{7}

18

15

12

13

16

14

17

11

R

D

CB

A

DCB

{8}

{1}{2}

{4} {5}

{3}

{6}

19

C

{7}

18

D{9}

12

13

14

15

11

R

CB

A

D

{1} {2,4}{3,5,7}

{6,8,9}

A(0) A(1)

A(2) A(3) = 1-index

data graph G

15

12

13

16

14

17

11

R

CBA

DCB

{1}{2}

{4} {5,7}

{3}

{6,8,9}

Properties of A(k) index

Abdullah Mueen 18

8

4

1 2

5

7

3

6

9

0R

D

CBA

D

DC

C

B

15

12

13

16

14

17

11

R

C

BA

DCB

{1}

{2}

{4} {5,7}

{3}

{6,8,9}

A(1)

Properties of A(k) index

Abdullah Mueen 19

8

4

1 2

5

7

3

6

9

0R

D

CBA

D

DC

C

B15

12

13

16

14

17

11

R

C

BA

DCB

{1}

{2}

{4} {5,7}

{3}

{6,8,9}

A(1)

How to compute A(1) index

Abdullah Mueen 20

{1} {2,4} {3,5,7} {6,8,9}

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

{1} {2} {4} {3,5,7} {6,8,9}

{1} {2,4} {3,5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2,4} {3,5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2,4} {3,5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

1-bisimilar partition

{1} {2,4} {3,5,7} {6,8,9}

{1} {2,4} {3,5,7} {6,8,9}

Label partition

Lookup:

Refining:

How to compute A(2) index

Abdullah Mueen 21

8

4

1 2

5

7

3

6

9

0 R

D

CBA

D

DC

C

B

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5} {7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5} {7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5} {7} {6} {8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5} {7} {6} {8,9}2-bisimilar partition

{1} {2} {4} {3} {5,7} {6,8,9}

{1} {2} {4} {3} {5,7} {6,8,9}

1-bisimilar partition

Lookup:

Refining:

Query Evaluation : Fwd or Bckwd

Abdullah Mueen 22

R.A.-*.C = {5,7}15

12

13

16

14

17

11

R

C

BA

DCB

{1}

{2}

{4} {5,7}

{3}

{6,8,9}

R A

C

-

Repeated state is prevented O(|A|*m) Backward evaluation using label-group

Query Evaluation : Validation

Abdullah Mueen 23

R.A.B.C.D = {6,8,9}15

12

13

16

14

17

11

R

C

BA

DCB

{1}

{2}

{4} {5,7}

{3}

{6,8,9}

R A

C

B

Repeated state is prevented O(|A|*m)

D

Avoiding Validation

Abdullah Mueen 24

15

12

13

16

14

17

11

R

C

BA

DCB

{1}

{2}

{4} {5,7}

{3}

{6,8,9}

R.-*.C.D= {6,8,9}

A(1)

For Queries like R.-*.p, we can safely avoid validation on A(k) if p is a k-path.

Results

Abdullah Mueen 25

Results

Abdullah Mueen 26

Conclusion

• A(k) index is smaller than precise indexes and have their advantages, such as faster execution time with significant accuracy.

• Future presentations– Change of the indexes with updates.– Incorporating more complex queries.

Abdullah Mueen 27

exploiting local similarity for indexing paths in graph-structured data

Documents

data graphabdullah mueen

graph g

summary graph

example3abdullah mueen

example2abdullah mueen

example1abdullah mueen

abdullah mueenxml

incoming path of lengthk