management of xml documents without schema in relational database systems

26
10/14/2001 Management of XML Documents without Schema in Relational Database Systems Workshop Objects, <XML> and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics

Upload: richard-ayala

Post on 31-Dec-2015

9 views

Category:

Documents


0 download

DESCRIPTION

Management of XML Documents without Schema in Relational Database Systems. Workshop Objects, and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics. Overview. Introduction Motivation Main Issues - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Management of XML Documents without Schema in Relational Database Systems

10/14/2001

Management of XML Documents without Schema in Relational Database Systems

Workshop

Objects, <XML> and Databases

OOPSLA 2001, Tampa

Thomas Kudrass

Leipzig University of Applied Sciences

Department of Computer Science and Mathematics

Page 2: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

2

Overview Introduction

– Motivation– Main Issues

Structure-Oriented Approach– Storage / Data Model– Queries– Evaluation

Opaque Approach– Storage– Queries (vs. XPath) – Evaluation

Prototype Implementation – Interface– Experience

Outlook

Page 3: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

3

Motivation

XML is used in:– data publishing (document-centric documents)

– data exchange (data-centric documents)

Why XML Documents without Schema?– generated by programs

– mostly data-centric documents, e.g., account statements

– high update frequency of the document structure evolving schemas

Problems– How to deal with XML documents without DTD / XML Schema

in databases?

– Evaluate approaches

– Use relational database systems

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 4: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

4

Main Issues

Evaluate Storage and Retrieval Methods– no defined document schema

– platform: Oracle 8i

XML-to-Relational Mapping Approaches– structure-oriented decomposition

– opaque approach

Identify Parameters of a Unified XML-DB Interface

Implement a Testbed– qualitiative assessment

– performance of both approaches

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 5: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

5

Structure-Oriented Approach

Characteristics– decomposition of an XML document into smaller units

(elements)

– depends on the document structure only

– target system: relational DBMs, generic schema

Variety of Mapping Methods– model XML document als directed ordered labeled graphs and

map them to tables

– proposed algorithms:

• edge tables [Florescu, Kossmann]

• universal table

• inlining techniques [Shanmugasundaram et.al.]

• model-based fragmentation

• Monet XML-model

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 6: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

6

Storing XML Data in Relations

XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree>• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

tre e

p e rson p e rso n

nam

e

nam

e a d ress

3 4

Pe te rM a ry 4711

Fruitd a le Ave .

1

2

[a g e = 55] [a g e = 38]

Page 7: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

7

Data Model

tblDocs

DocId url

tblEdge

SourceId TargetId LeafId AttrId DocId EdgeName Type Depth

tblLeafs

LeafId Value

tblAttrs

AttrId Value

1

n

1

0/1 0/1

1• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

tblDocs

DocId url

tblEdge

SourceId TargetId LeafId AttrId DocId EdgeName Type Depth

tblLeafs

LeafId Value

tblAttrs

AttrId Value

1

n

1

0/1 0/1

1

Page 8: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

8

Import Algorithm<tree>

<person age=“36“>

<name>Peter</name>

<address>

<street>Main Road 4</street>

<zip>04236</zip>

<city>Leipzig</city>

</address>

</person>

</tree>

SourceId

Target Id

LeafId

AttrId

DocId

EdgeName Type Depth

0 1 NULL NULL 1 tree ref 3

1 2 NULL NULL 1 person ref 2

2 3 NULL 1 1 age attr 02 4 1 NULL 1 name leaf 02 5 NULL NULL 1 address ref 15 6 2 NULL 1 street leaf 05 7 3 NULL 1 zip leaf 0

5 8 4 NULL 1 city leaf 0

DocId url1 Sample.xml

LeafId Value1 Peter2 Main

Road 43 042364 Leipzig

AttrId Value1 36

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 9: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

9

Query Processing

Query Language– XML query language most appropriate

– data model of our solution is based on XML-QL XML-QL preferred choice

XML-Relational Mismatch– relational DBMS “understands“ SQL only

requires translation from XML-QL to SQL

– generate result document from the tuples retrieved

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 10: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

10

Query Processing

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

XML-QLQuery Parser

GenerateSQL Statement

ObjectStructure

SQLStatement

Row Set

ExecuteSQL Statement

ConstructResult Document

XMLDocument

DB

Page 11: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

11

Generate an SQL-Statement XML-QL QueryCONSTRUCT <result> {

WHERE

<person>

<name>$n</name>

<address>$a</address>

</person>

CONSTRUCT

<person>

<name>$n</name>

<address>$a</address>

</person>

} </result>

SQL StatementSELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_ValueFROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs EWHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+))

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 12: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

12

Construct Result Document Result Tuple

Subtree ReconstructionSELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+)

n_Type n_TargetId

n_Depth n_Value a_Type a_TargetId

a_Depth a_Value

leaf 4 0 Peter ref 5 1

EdgeName Type A_LeafVal A_AttrVal

street leaf Main Road 4

zip leaf 04236city leaf Leipzig

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 13: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

13

Query Result

XML Result Document<result>

<person>

<name>Peter</name>

<address>

<street>Main Road 4</street>

<zip>04236</zip>

<city>Leipzig</city>

</address>

</person>  

</result>

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 14: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

14

Advantages

Vendor Independence– no specific DBMS features needed

Stability High Flexibility of Queries

– retrieve and update single values

– full SQL functionality can be used

Well-Suited for Structure-Oriented Queries– structures are represented in tables

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 15: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

15

Drawbacks Information Loss

– Comments– Processing Instructions– Prolog– CDATA Sections– Entities

Restrictions– only one text (content) per element

<element> Text1 <subelement/> Text2 lost

</element>– element text as VARCHAR(n); n <= 4000

Increased Load Time – sample document: 3.3. MB, 130.000 tuples, 13 minutes

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 16: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

16

Opaque Approach

Characteristics– XML document stored as Large Object (LOB)

– document completely preserved

Storage

Insert into tblXMLClob values (1,‘person.xml‘,‘

<person>

<name>Mary</name>

</person>‘ );

DocId url content

1 person.xml <person>

<name>Mary</name>

</person>

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 17: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

17

Oracle interMedia Text

Query Facilities of interMedia Text– full-text retrieval (word matching only)

– path expression only together with content search

– no range queries

Example in interMedia Text:SELECT DocId FROM tblXMLClob WHERE

CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0

XML Full-Text Index– Autosectioner Index

– XML Sectioner Index

– WITHIN operator• text_subquery WITHIN elementname

• searches the entire text content of the named tag

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 18: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

18

Comparison of Queries

return document IDs word matching (default) no existence test for

elements or attributes restricted set of path

expressions using WITHINe.g.: (xml WITHIN title) WITHIN book

provides limited attribute value searches, no nesting of attribute searches

numeric and data values not type-converted

no range searches on attribute values

return document fragments substring matching search for existing

elements or attributes path expressions

structure-oriented queries //Book/Title/[contains(..‘xml‘)]

searches for attribute values and element text can be combined

considers also decimal values

range searches possible using filters

interMedia Text XPath

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 19: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

19

XPath Queries with PL/SQL

Prerequisite– XDK for PL/SQL installed on the server

Parse CLOB into DOM representation

XPath Query

Search Document IDs of all CLOBs of

the XML Table

Execute XPath Query on the DOM tree for each

CLOB Objects of a Document ID

DocIDs with Result XML Documents

server-side

DB

Doc IDs

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 20: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

20

Advantages

Information Preservation Handling of Large Documents

– appropriate for document-centric documents with little structure and prose-rich elements

Different XML Document APIs– interMedia Text: restricted set of XPath functionality

– generate a DOM of the document before using XPath queries

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 21: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

21

Drawbacks

Restricted Expressiveness of Text Queries Performance vs. Accuracy of Query Results

– interMedia Text queries on CLOBs faster than the DOM-API

– sample document: 12.5 MB, parse time 3 min, load time 5 min

Restrictions of Indexes– maximum tag names for indexing (incl. namespace) 64 bytes

Problems with Markup– character entities

Vendor Dependence– text engines are proprietary, e.g., Oracle interMedia

Stability– maximum document size 50 MB

– memory errors may occur with smaller documents

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 22: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

22

User Interface

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 23: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

23

XML Database Interface

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• OutlookDatabase Server

DB

StorageMethod

ExportMethod

DeleteMethod

DocumentList

Query Method

Transf. to SQL

Transf. to XML

Middleware / XML Interface

SQLInsert Statem

ent

SQL

Select Statement

Row

Set

SQL

Delete

Statement SQ

LSe

lec t

Stat

emen

tD

ocID

s / D

ocN

ames

SQL

Sele

ctSt

atem

ent

Row S

et

Client

XMLDocument

XMLQuery

DocList

DocID /Doc Name

Page 24: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

24

Comparison

Import– loss of information– time-consuming– produces lots of tuples

Queries– XML-QL

• fast• new document as

result

Import– no loss of information– faster than structure-

oriented decomposition

Queries– interMedia Text

• fast• only document IDs as

result– XPath

• high response time• flexible granularity of

results

Structure-Oriented Approach Opaque Approach

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 25: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

25

Implementation Experience

Import Problems– structure-oriented decomposition

• SAX parser produces FillBuf Error for XML documents > 3 MB

• limitations of VARCHAR columns (max. 4000 bytes)

– opaque Approach• OutOfMemory error during import of XML documents

> 4MB – reason: to little heap size in Java – use start option Xmx<HeapSize>

Queries– Opaque Approach

• OutOfMemory error when parsing CLOB into DB– increase java_pool_size (100-150 MB)– increase shared_pool_size (150-200 MB)

Export and Delete without Problems

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook

Page 26: Management of XML Documents without Schema in Relational Database Systems

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

26

Outlook

Experience– problems with larger documents in both approaches

– no universal solution for all requirements

Co-Existence of Multiple Storage Approaches – integrate different storage engines

– combine structure-oriented decomposition and opaque approach

– need for a generic XML data type

New Data Model for Structure-Oriented Approach– reduce loss of information

XML Database Interface / Middleware– combines different approaches

– parameterize the XML database interface

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook