management of xml documents without schema in relational database systems

10/14/2001

Management of XML Documents without Schema in Relational Database Systems

Workshop

Objects, <XML> and Databases

OOPSLA 2001, Tampa

Thomas Kudrass

Leipzig University of Applied Sciences

Department of Computer Science and Mathematics

T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001

2

Overview Introduction

– Motivation– Main Issues

Structure-Oriented Approach– Storage / Data Model– Queries– Evaluation

Opaque Approach– Storage– Queries (vs. XPath) – Evaluation

Prototype Implementation – Interface– Experience

Outlook


3

Motivation

XML is used in:– data publishing (document-centric documents)

– data exchange (data-centric documents)

Why XML Documents without Schema?– generated by programs

– mostly data-centric documents, e.g., account statements

– high update frequency of the document structure evolving schemas

Problems– How to deal with XML documents without DTD / XML Schema

in databases?

– Evaluate approaches

– Use relational database systems

• Introduction

• Structure-Oriented Approach

• Opaque Approach

• Prototype Implementation

• Outlook


4

Main Issues

Evaluate Storage and Retrieval Methods– no defined document schema

– platform: Oracle 8i

XML-to-Relational Mapping Approaches– structure-oriented decomposition

– opaque approach

Identify Parameters of a Unified XML-DB Interface

Implement a Testbed– qualitiative assessment

– performance of both approaches

• Introduction


• Opaque Approach


• Outlook


5

Structure-Oriented Approach

Characteristics– decomposition of an XML document into smaller units

(elements)

– depends on the document structure only

– target system: relational DBMs, generic schema

Variety of Mapping Methods– model XML document als directed ordered labeled graphs and

map them to tables

– proposed algorithms:

• edge tables [Florescu, Kossmann]

• universal table

• inlining techniques [Shanmugasundaram et.al.]

• model-based fragmentation

• Monet XML-model

• Introduction


• Opaque Approach


• Outlook


6

Storing XML Data in Relations

XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree>• Introduction


• Opaque Approach


• Outlook

tre e

p e rson p e rso n

nam

e

nam

e a d ress

3 4

Pe te rM a ry 4711

Fruitd a le Ave .

1

2

[a g e = 55] [a g e = 38]


7

Data Model

tblDocs

DocId url

tblEdge

SourceId TargetId LeafId AttrId DocId EdgeName Type Depth

tblLeafs

LeafId Value

tblAttrs

AttrId Value

1

n

1

0/1 0/1

1• Introduction


• Opaque Approach


• Outlook

tblDocs

DocId url

tblEdge

SourceId TargetId LeafId AttrId DocId EdgeName Type Depth

tblLeafs

LeafId Value

tblAttrs

AttrId Value

1

n

1

0/1 0/1

1


8

Import Algorithm<tree>

<person age=“36“>

<name>Peter</name>

<address>

<street>Main Road 4</street>

<zip>04236</zip>

<city>Leipzig</city>

</address>

</person>

</tree>

SourceId

Target Id

LeafId

AttrId

DocId

EdgeName Type Depth

0 1 NULL NULL 1 tree ref 3

1 2 NULL NULL 1 person ref 2

2 3 NULL 1 1 age attr 02 4 1 NULL 1 name leaf 02 5 NULL NULL 1 address ref 15 6 2 NULL 1 street leaf 05 7 3 NULL 1 zip leaf 0

5 8 4 NULL 1 city leaf 0

DocId url1 Sample.xml

LeafId Value1 Peter2 Main

Road 43 042364 Leipzig

AttrId Value1 36

• Introduction


• Opaque Approach


• Outlook


9

Query Processing

Query Language– XML query language most appropriate

– data model of our solution is based on XML-QL XML-QL preferred choice

XML-Relational Mismatch– relational DBMS “understands“ SQL only

requires translation from XML-QL to SQL

– generate result document from the tuples retrieved

• Introduction


• Opaque Approach


• Outlook


10

Query Processing

• Introduction


• Opaque Approach


• Outlook

XML-QLQuery Parser

GenerateSQL Statement

ObjectStructure

SQLStatement

Row Set

ExecuteSQL Statement

ConstructResult Document

XMLDocument

DB


11

Generate an SQL-Statement XML-QL QueryCONSTRUCT <result> {

WHERE

<person>

<name>$n</name>

<address>$a</address>

</person>

CONSTRUCT

<person>

<name>$n</name>

<address>$a</address>

</person>

} </result>

SQL StatementSELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_ValueFROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs EWHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+))

• Introduction


• Opaque Approach


• Outlook


12

Construct Result Document Result Tuple

Subtree ReconstructionSELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+)

n_Type n_TargetId

n_Depth n_Value a_Type a_TargetId

a_Depth a_Value

leaf 4 0 Peter ref 5 1

EdgeName Type A_LeafVal A_AttrVal

street leaf Main Road 4

zip leaf 04236city leaf Leipzig

• Introduction


• Opaque Approach


• Outlook


13

Query Result

XML Result Document<result>

<person>

<name>Peter</name>

<address>

<street>Main Road 4</street>

<zip>04236</zip>

<city>Leipzig</city>

</address>

</person>

</result>

• Introduction


• Opaque Approach


• Outlook


14

Advantages

Vendor Independence– no specific DBMS features needed

Stability High Flexibility of Queries

– retrieve and update single values

– full SQL functionality can be used

Well-Suited for Structure-Oriented Queries– structures are represented in tables

• Introduction


• Opaque Approach


• Outlook


15

Drawbacks Information Loss

– Comments– Processing Instructions– Prolog– CDATA Sections– Entities

Restrictions– only one text (content) per element

<element> Text1 <subelement/> Text2 lost

</element>– element text as VARCHAR(n); n <= 4000

Increased Load Time – sample document: 3.3. MB, 130.000 tuples, 13 minutes

• Introduction


• Opaque Approach


• Outlook


16

Opaque Approach

Characteristics– XML document stored as Large Object (LOB)

– document completely preserved

Storage

Insert into tblXMLClob values (1,‘person.xml‘,‘

<person>

<name>Mary</name>

</person>‘ );

DocId url content

1 person.xml <person>

<name>Mary</name>

</person>

• Introduction


• Opaque Approach


• Outlook


17

Oracle interMedia Text

Query Facilities of interMedia Text– full-text retrieval (word matching only)

– path expression only together with content search

– no range queries

Example in interMedia Text:SELECT DocId FROM tblXMLClob WHERE

CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0

XML Full-Text Index– Autosectioner Index

– XML Sectioner Index

– WITHIN operator• text_subquery WITHIN elementname

• searches the entire text content of the named tag

• Introduction


• Opaque Approach


• Outlook


18

Comparison of Queries

return document IDs word matching (default) no existence test for

elements or attributes restricted set of path

expressions using WITHINe.g.: (xml WITHIN title) WITHIN book

provides limited attribute value searches, no nesting of attribute searches

numeric and data values not type-converted

no range searches on attribute values

return document fragments substring matching search for existing

elements or attributes path expressions

structure-oriented queries //Book/Title/[contains(..‘xml‘)]

searches for attribute values and element text can be combined

considers also decimal values

range searches possible using filters

interMedia Text XPath

• Introduction


• Opaque Approach


• Outlook


19

XPath Queries with PL/SQL

Prerequisite– XDK for PL/SQL installed on the server

Parse CLOB into DOM representation

XPath Query

Search Document IDs of all CLOBs of

the XML Table

Execute XPath Query on the DOM tree for each

CLOB Objects of a Document ID

DocIDs with Result XML Documents

server-side

DB

Doc IDs

• Introduction


• Opaque Approach


• Outlook


20

Advantages

Information Preservation Handling of Large Documents

– appropriate for document-centric documents with little structure and prose-rich elements

Different XML Document APIs– interMedia Text: restricted set of XPath functionality

– generate a DOM of the document before using XPath queries

• Introduction


• Opaque Approach


• Outlook


21

Drawbacks

Restricted Expressiveness of Text Queries Performance vs. Accuracy of Query Results

– interMedia Text queries on CLOBs faster than the DOM-API

– sample document: 12.5 MB, parse time 3 min, load time 5 min

Restrictions of Indexes– maximum tag names for indexing (incl. namespace) 64 bytes

Problems with Markup– character entities

Vendor Dependence– text engines are proprietary, e.g., Oracle interMedia

Stability– maximum document size 50 MB

– memory errors may occur with smaller documents

• Introduction


• Opaque Approach


• Outlook


22

User Interface

• Introduction


• Opaque Approach


• Outlook


23

XML Database Interface

• Introduction


• Opaque Approach


• OutlookDatabase Server

DB

StorageMethod

ExportMethod

DeleteMethod

DocumentList

Query Method

Transf. to SQL

Transf. to XML

Middleware / XML Interface

SQLInsert Statem

ent

SQL

Select Statement

Row

Set

SQL

Delete

Statement SQ

LSe

lec t

Stat

emen

tD

ocID

s / D

ocN

ames

SQL

Sele

ctSt

atem

ent

Row S

et

Client

XMLDocument

XMLQuery

DocList

DocID /Doc Name


24

Comparison

Import– loss of information– time-consuming– produces lots of tuples

Queries– XML-QL

• fast• new document as

result

Import– no loss of information– faster than structure-

oriented decomposition

Queries– interMedia Text

• fast• only document IDs as

result– XPath

• high response time• flexible granularity of

results

Structure-Oriented Approach Opaque Approach

• Introduction


• Opaque Approach


• Outlook


25

Implementation Experience

Import Problems– structure-oriented decomposition

• SAX parser produces FillBuf Error for XML documents > 3 MB

• limitations of VARCHAR columns (max. 4000 bytes)

– opaque Approach• OutOfMemory error during import of XML documents

> 4MB – reason: to little heap size in Java – use start option Xmx<HeapSize>

Queries– Opaque Approach

• OutOfMemory error when parsing CLOB into DB– increase java_pool_size (100-150 MB)– increase shared_pool_size (150-200 MB)

Export and Delete without Problems

• Introduction


• Opaque Approach


• Outlook


26

Outlook

Experience– problems with larger documents in both approaches

– no universal solution for all requirements

Co-Existence of Multiple Storage Approaches – integrate different storage engines

– combine structure-oriented decomposition and opaque approach

– need for a generic XML data type

New Data Model for Structure-Oriented Approach– reduce loss of information

XML Database Interface / Middleware– combines different approaches

– parameterize the XML database interface

• Introduction


• Opaque Approach


• Outlook