legodb 1 data binding workshop, avaya labs, june 2003 legodb: cost-based xml to relational...

18
LegoDB 1 ta Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint work with: Juliana Freire (Bell Labs, OGI) Jayant Haritsa (IISc, Bangalore) Maya Ramanath (IISc, Bangalore) Prasan Roy (Bell Labs, ITT Bombay)

Upload: bryce-norman

Post on 30-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

LegoDB:Cost-based XML to Relational

“Shredding”Jerome Simeon

Bell Labs – Lucent Technologies

joint work with:Juliana Freire (Bell Labs, OGI)

Jayant Haritsa (IISc, Bangalore)Maya Ramanath (IISc, Bangalore)

Prasan Roy (Bell Labs, ITT Bombay)

Page 2: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

XML Data Management

Application is based on XML– XML documents to represent information– XML Schema to model information– XPath / XQuery to access and manipulate information

XMLDocuments

XqueryEngine

XMLqueries

XMLresults

Find the title, year and box office proceeds for all 2001 movies

for $v in document(“imdbdata”)/imdb/showwhere $v/year=2001return($v/title, $v/year, $v/box_office)

RDBMS

Page 3: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Why Storing XML in a RDBMS?

For some applications it makes sense to use a relational database backend:

– Leverage many years of development of relational technology

– concurrency control/transaction support

– Scalability

– Safety (crash recovery, duplication)

– Integrate with existing data stored in an RDBMS

– Performance matters!!

But storing and querying XML data in an RDBMS is a non-trivial task

Page 4: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

XML and Relational Databases

Mismatch between the relational model and XML. How to store XML data into relational tables?

– Mapping (“Shredding”) XML data into flat and regular tables

How to evaluate XML queries over relational tables?– Mapping XQuery into SQL (or SQL-XML?)...

Litterature filled with various mapping proposals:– Many variations over binary tables (Florescu et al, Grust et

al).– Shanmugasundaram et. All try to inline as much nested

elements as possible in the same table.There are many alternative mappings!

Page 5: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Various mappings...TABLE Show

(show_id INT,

title STRING,

year INT,

box_office INT,

seasons INT)

TABLE Review

(review_id INT,

tilde STRING,

review STRING,

parent_Show INT)

(I) Inline as many elements as possible

TABLE Show

(show_id INT,

title STRING,

year INT,

box_office INT,

seasons INT)

TABLE NYTReview

(review_id INT,

review STRING,

parent_Show INT)

TABLE Review

(review_id INT,

tilde STRING,

review STRING,

parent_Show INT)

(II)Partition reviews table-one for NYT,one for the rest

TABLE Show1

(show1_id INT,

title STRING,

year INT,

box_office INT)

TABLE Show2

(show2_id INT,

title STRING,

year INT,

seasons INT)

TABLE Review

(review_id INT,

tilde STRING,

review STRING,

parent_Show INT)

(III)Split Show table into TV and Movies

Define group Show { element show { element title type string, element year type integer?, element review { element * type string* }, (element box type string | element seasons type

string)}

Page 6: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

...have various performances

0

0.2

0.4

0.6

0.8

1

1.2

1.4

Q1 Q2 Q3 Q4 W1 W2

(I)

(II)

(III)

No given mapping is best

Q1: Simple selection queryQ2: Join involving a fragment of the reviewsQ3: publishing query

Page 7: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

The LegoDB Storage "Shredding" Engine

An optimization approach:– automatically explores a space of possible mappings– selects the mapping which has the lowest cost for a given

application Important features:

– Application-driven: takes into account schema, data statistics and query workload

– Logical/physical independence: interface is XML-based (XML Schema, XQuery, XML data statistics)

– Leverage existing technology: XML standards; XML-specific operations for generating space of mappings; relational optimizer for evaluating configurations

Page 8: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Mapping an XML-Schema into Relations

Define group Show { element show { element title type string, element year type integer, group Reviews*, ... }define group Reviews { element review type string}

XML Schema

TABLE Show_Table( Show_id INT,

title STRING,

year INT )

TABLE Review_Table ( Review_id INT,

review STRING,

parent_Show INT )

Relational schema

Page 9: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

A word about mapping queries

XQuery

for $x in //show/reviewwhere contains($x/review, “Potter”)

SQLselect reviewfrom Show_Table, Review_Tablewhere Parent_show = Show_id // join here!and review contains “Potter”

Page 10: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Transforming Schemas

Key idea: A given document can be validated by different XML Schemas:– Different but equivalent regular expressions can be used

to define an element– The presence or absence of a type name does not change

the semantics of an XML Schema Applying transformations that manipulate the types

(but preserve the element structure of schema) leads to a space of distinct relational configurations

Define XML Schema transformations that – Exploit the structure of the schema, and– lead to useful relational configurations

Page 11: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Adding a group corresponds to splitting a tabledefine group Show {

element show {

element title type string,

element year type integer,

element review type string*,

...

Original schema

Define group Show { element show { element title type string, element year type integer, group Reviews*, ... }define group Reviews { element review type string}

Transformed schema

TABLE Show( Show_id INT,

title STRING,

year INT )

TABLE Review (Review_id INT,

review STRING,

parent_Show INT )

Transformed Relational schema

TABLE Show( Show_id INT,

title STRING,

year INT,

review STRING)

Relational schema

Page 12: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Regular expression rewritings

(group A|group B),group C ==>group D|group E horizontal table partitiondefine group D=(group A,group C)define group E=(group B,group C)

group A+ ==> put first item in a table, put the group A, group A* rest in a different table

element * type string extracts certain element ==> in separate tableelement a type string |element (^a) type string ...........

Page 13: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

LegoDB: Storage Design

Physical SchemaTransformation

Query/Schema/Stats

Translation

TraditionalRelational Query

Optimizer

XML data statistics

XML Schema

Relationalschema,

stats and workload

Goodconfiguration

Physical

schemaXQueryworkloa

d

cost estimate

Physical SchemaGeneration

Physical schema

Page 14: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Searching for a good configuration

Cost is key: use a relational optimizer as a black box– Support different cost-models– Quality of selected configuration depends on the accuracy of

the optimizer! Set of possible configurations that result from applying

the rewritings is very large - possibly infinite! How to search for the optimal solution?

– LegoDB use a greedy search Importance of statistics for cost evaluation

– Large collections vs. small collections (e.g., many new york time review or not?)

– Selectivity of predicates– XML structure distribution (distribution of children / parent

relation)

Page 15: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

LegoDB: Runtime Support

goodconfiguration and

mapping specification

XQueryworkloa

d

DB Loader

XML document

Query Translation

mapping specification

XML result

Commercial RDBMS

tuples SQL query/results

Page 16: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

Conclusion

Cost-based approach for generating relational storage for XML

– takes application characteristics into account» schema + data 'statistics' + queries

Performance shows that storage significant performance improvements

The same is likely to be true for other indexing techniques:– Full text index on XML (best for text queries?)– Native XML indexes (best for Xpath queries without

schema?)– Files!!! (Best when a lot of small documents an no need

for concurrency control)

There is no one best way to store XML...We should try to hide that from the user!

Page 17: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

A few other things I've done...

Mapping XQuery to SQL/XML– Work with Jonathan Robie– SQL/XML is an extension of SQL to XML

» Standard mapping from table to XML document» Standard mapping from relational schema to XML Schema» Extension of SQL query language to builds XML elements

– Goal is to evaluation XQueries on top of an ODBC driver supporting XQuery

– Approach: identify a fragment of XQuery which has a direct syntactic mapping into SQL / XML

– Surprise: the syntactic approach worked really well, because of the way XQuery is designed. (i.e., FLWOR close to SQL statement).

Galax : XQuery 1.0 implementation

Page 18: LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint

LegoDB 1Data Binding Workshop, Avaya Labs, June 2003

My take on the Data Binding problem

The main problem will be about the mismatch between type systems:

SQL vs. XML Schema vs. Object-Oriented

I've never seen a good proposal that brings any two type systems together (then builds a language one top of it)

Where I see the best chance of success:– Use XML has the data model (can represent pretty

much anything).– XML Schema can represent relational schema (see

SQL/XML standard)– Can it represent Object-Oriented?