estimating the selectivity of xml path expressions for internet scale applications

68
Estimating the Selectivity of XML Path Expressions for Internet Scale Applications Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison

Upload: joelle-miller

Post on 03-Jan-2016

21 views

Category:

Documents


1 download

DESCRIPTION

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications. Ashraf Aboulnaga Alaa R. Alameldeen Jeffrey F. Naughton Computer Sciences Department University of Wisconsin - Madison. Motivation. XML enables Internet scale applications that query data from many sources - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Ashraf AboulnagaAlaa R. AlameldeenJeffrey F. Naughton

Computer Sciences DepartmentUniversity of Wisconsin - Madison

Page 2: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Motivation XML enables Internet scale applications that

query data from many sources Niagara, Xyleme, …

Queries over XML data use path expressions Optimizing these queries requires

estimating the selectivity of the path expressions

Focus of this talk: Building statistics for XML data and using them for estimating the selectivity of simple path expressions

Page 3: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

What is XML?<readings> <play> <title>Pygmalion</title> <author>Bernard Shaw</author> </play> <novel> <title>David Copperfield</title> <author>Charles Dickens</author> </novel></readings>

Page 4: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Querying XML

FOR $n_auth IN document("*")//novel/author $p_auth IN document("*")//play/authorWHERE $n_auth/text() = $p_auth/text()RETURN $n_auth

Optimizing this query requires estimating the selectivity of the path expressions

This requires information about the structure of the XML data

Page 5: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Goal of this Work Build database statistics that capture

the structure of XML data Ensure that the statistics fit in a small

amount of memory For efficient query optimization Important for Internet scale applications

Use the statistics to estimate the selectivity of simple XML path expressions//t1/t2/…/tn

Page 6: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Outline of Presentation Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Page 7: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Path Trees<A> <B> </B> <B> <D> </D> </B> <C> <D> </D> <E> </E> <E> </E> <E> </E> </C></A>

A 1

C 1B 2

D 1D 1 E 3

Page 8: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Summarizing Path Trees Path trees contain all the information

needed for selectivity estimation Problem: May not fit in available memory

Small available memory Internet scale

Remove low frequency nodes Removed nodes replaced with *-nodes

Tag name: * meaning "any tag" Frequency: Average frequency of replaced

nodes Sibling-*, Level-*, Global-*, No-*

Page 9: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 10: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 11: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 12: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 13: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

*-nodes represent deleted sibling nodes Memory saved by coalescing nodes

Page 14: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Page 15: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Page 16: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11* f=6n=2

Page 17: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12K 11* f=6n=2

* f=12n=2

Page 18: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

*

K 11* f=6n=2

f=12n=2

Page 19: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

*

K 11* f=6n=2

f=12n=2

Page 20: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

F 15

K 12

*

K 11* f=6n=2

f=12n=2 * f=16

n=2

Page 21: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=6n=2

f=12n=2

f=16n=2

f=23n=2

Page 22: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=23n=2

6 8

3

Page 23: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Original Path Tree

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 24: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Sibling-* Summarization

A 1

C 9B 13

*F 15*

K* f=23n=2

6 8

3

Try to retain as much information as possible about the deleted nodes

Page 25: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Level-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 26: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Level-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 27: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Level-* Summarization

A 1

C 9B 13

G 10F 15

K 12K 11

* 6

* 3

Less information about deleted nodes than sibling-* Deletes fewer nodes than sibling-*

Page 28: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Global-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 29: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Global-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 30: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Global-* Summarization

C 9B 13

G 10F 15 H 6

K 12

D 7

K 11

*3

Page 31: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

No-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 32: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

No-* Summarization

A 1

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11J 4I 2

Page 33: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

No-* Summarization

C 9B 13

G 10F 15 H 6

K 12

E 5D 7

K 11

Memory savings similar to global-* Conservative assumption about deleted nodes

Page 34: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Page 35: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Markov Tables A table of all distinct paths of length up

to m and their frequencies For paths of length greater than m,

combine paths from the Markov table Example:

Uses "short memory" or "Markov" property

f(B/C/D)

f(B/C)f(A/B/C/D) =

f(A/B/C)

Page 36: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Markov Tables

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

A 1

D 4C 6B 11

D 7C 9

D 8

Page 37: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Summarizing Markov Tables Exact selectivities for paths of length up to

m Approximate selectivities for paths longer

than m Problem: May not fit in available memory Remove low frequency paths Discard removed paths of length > 2 Replace removed paths of length 1 or 2

with *-paths Suffix-*, Global-*, No-*

Page 38: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

Page 39: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Page 40: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A 1 AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* 0 ** 0

Page 41: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

Page 42: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11 AD 4

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { }

Set of deleted paths of length 2

Page 43: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Page 44: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Page 45: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

AC 6

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0

SD= { (AD,4) }

Page 46: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Page 47: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19 BD 7

AB 11 CD 8

* f=1,n=1 ** 0SD= { }

Page 48: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Page 49: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11 CD 8

* f=1,n=1 ** 0SD= { (BD,7) }

Page 50: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Page 51: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Page 52: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 BC 9

D 19

AB 11

* f=1,n=1 ** 0SD= { (BD,7), (CD,8) }

Page 53: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 ** 0SD= { (CD,8) }

Page 54: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

A*f=10,n=2

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 ** 0SD= { (CD,8) }

Page 55: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

B 11

C 15 B*f=16,n=2

D 19

AB 11

* f=1,n=1 **f=10,n=2SD= { (CD,8) }

Page 56: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Suffix-* Summarization

Path Freq Path Freq

B 11

C 15 B* 8

D 19

AB 11

* 1 ** 6

SD= { }

Page 57: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Global-*, No-* Summarization Global-*

Two *-paths, * and ** Deletes fewer paths than suffix-* to

summarize the Markov table No-*

No *-paths Conservatively assumes that paths not in

the Markov table do not exist in the data

Page 58: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Outline Introduction Path Trees Markov Tables Performance Evaluation Conclusions

Page 59: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Data Sets for Experiments Synthetic data set

100,000 XML elements Path tree: 3197 nodes, 6 levels, 38 KB Element frequencies: Zipfian (z=1)

DBLP data set 1,399,765 XML elements Path tree: 5883 nodes, 6 levels, 69 KB

Page 60: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Query Workloads 1,000 paths of length between 1 and 4 Random paths

All query paths exist in the data Random tags

Most query paths of length 2 or more do not exist in the data

Available memory between 5 and 50 KB

Page 61: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Best Summarization Methods Path trees

Query paths in data: Global-* Query paths not in data: No-*

Markov tables m = 2 is best Query paths in data: Suffix-* Query paths not in data: No-*

Page 62: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Path Trees vs. Markov Tables When to use path trees and when to use

Markov tables? Also compared against Pruned Suffix

Trees (PSTs) [Chen et al, ICDE 2001] Can handle branching path expressions Can handle conditions on element values

Page 63: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Synthetic Data – Random Paths

0

4

8

12

16

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree Global-*Markov Suffix-*PST

Page 64: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Synthetic Data – Random Tags

0

2

4

6

8

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree No-*Markov No-*PST

Page 65: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

DBLP Data – Random Paths

0

20

40

60

80

100

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree Global-*Markov Suffix-*PST

Page 66: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

DBLP Data – Random Tags

0

1

2

3

4

0 10 20 30 40 50

Available Memory (KB)

Abso

lute

Err

or

Tree No-*Markov No-*PST

Page 67: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

When are Markov Tables Better? DBLP

Repeated sub-structures effectively captured by Markov tables

<sigmod> <inproceedings> <author>…</author> … </inproceedings> …</sigmod>

<vldb> <inproceedings> <author>…</author> … </inproceedings> …</vldb>

Page 68: Estimating the Selectivity of XML Path Expressions for Internet Scale Applications

Conclusions Novel statistics for estimating the selectivity of

XML path expressions Scale to "all the XML data on the Internet" More accurate than best previously known

alternative Repeated sub-structures: Markov tables

No repeated sub-structures: Path trees Query paths exist in the data: Global-*, Suffix-*

Query paths do not exist in the data: No-* To appear in VLDB 2001