management of xml documents without schema in relational database systems
DESCRIPTION
Management of XML Documents without Schema in Relational Database Systems. Workshop Objects, and Databases OOPSLA 2001, Tampa Thomas Kudrass Leipzig University of Applied Sciences Department of Computer Science and Mathematics. Overview. Introduction Motivation Main Issues - PowerPoint PPT PresentationTRANSCRIPT
10/14/2001
Management of XML Documents without Schema in Relational Database Systems
Workshop
Objects, <XML> and Databases
OOPSLA 2001, Tampa
Thomas Kudrass
Leipzig University of Applied Sciences
Department of Computer Science and Mathematics
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
2
Overview Introduction
– Motivation– Main Issues
Structure-Oriented Approach– Storage / Data Model– Queries– Evaluation
Opaque Approach– Storage– Queries (vs. XPath) – Evaluation
Prototype Implementation – Interface– Experience
Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
3
Motivation
XML is used in:– data publishing (document-centric documents)
– data exchange (data-centric documents)
Why XML Documents without Schema?– generated by programs
– mostly data-centric documents, e.g., account statements
– high update frequency of the document structure evolving schemas
Problems– How to deal with XML documents without DTD / XML Schema
in databases?
– Evaluate approaches
– Use relational database systems
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
4
Main Issues
Evaluate Storage and Retrieval Methods– no defined document schema
– platform: Oracle 8i
XML-to-Relational Mapping Approaches– structure-oriented decomposition
– opaque approach
Identify Parameters of a Unified XML-DB Interface
Implement a Testbed– qualitiative assessment
– performance of both approaches
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
5
Structure-Oriented Approach
Characteristics– decomposition of an XML document into smaller units
(elements)
– depends on the document structure only
– target system: relational DBMs, generic schema
Variety of Mapping Methods– model XML document als directed ordered labeled graphs and
map them to tables
– proposed algorithms:
• edge tables [Florescu, Kossmann]
• universal table
• inlining techniques [Shanmugasundaram et.al.]
• model-based fragmentation
• Monet XML-model
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
6
Storing XML Data in Relations
XML-QL Data Model <tree> <person age=’55‘> <name>Peter</name> </person> <person age=’38‘> <name>Mary</name> <address>Fruitdale Ave. </address> </person></tree>• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
tre e
p e rson p e rso n
nam
e
nam
e a d ress
3 4
Pe te rM a ry 4711
Fruitd a le Ave .
1
2
[a g e = 55] [a g e = 38]
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
7
Data Model
tblDocs
DocId url
tblEdge
SourceId TargetId LeafId AttrId DocId EdgeName Type Depth
tblLeafs
LeafId Value
tblAttrs
AttrId Value
1
n
1
0/1 0/1
1• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
tblDocs
DocId url
tblEdge
SourceId TargetId LeafId AttrId DocId EdgeName Type Depth
tblLeafs
LeafId Value
tblAttrs
AttrId Value
1
n
1
0/1 0/1
1
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
8
Import Algorithm<tree>
<person age=“36“>
<name>Peter</name>
<address>
<street>Main Road 4</street>
<zip>04236</zip>
<city>Leipzig</city>
</address>
</person>
</tree>
SourceId
Target Id
LeafId
AttrId
DocId
EdgeName Type Depth
0 1 NULL NULL 1 tree ref 3
1 2 NULL NULL 1 person ref 2
2 3 NULL 1 1 age attr 02 4 1 NULL 1 name leaf 02 5 NULL NULL 1 address ref 15 6 2 NULL 1 street leaf 05 7 3 NULL 1 zip leaf 0
5 8 4 NULL 1 city leaf 0
DocId url1 Sample.xml
LeafId Value1 Peter2 Main
Road 43 042364 Leipzig
AttrId Value1 36
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
9
Query Processing
Query Language– XML query language most appropriate
– data model of our solution is based on XML-QL XML-QL preferred choice
XML-Relational Mismatch– relational DBMS “understands“ SQL only
requires translation from XML-QL to SQL
– generate result document from the tuples retrieved
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
10
Query Processing
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
XML-QLQuery Parser
GenerateSQL Statement
ObjectStructure
SQLStatement
Row Set
ExecuteSQL Statement
ConstructResult Document
XMLDocument
DB
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
11
Generate an SQL-Statement XML-QL QueryCONSTRUCT <result> {
WHERE
<person>
<name>$n</name>
<address>$a</address>
</person>
CONSTRUCT
<person>
<name>$n</name>
<address>$a</address>
</person>
} </result>
SQL StatementSELECT DISTINCT B.Type AS n_Type, B.TargetId AS n_TargetId, B.Depth AS n_Depth, C.Value AS n_Value, D.Type AS a_Type, D.TargetId AS a_TargetId, D.Depth AS a_Depth, E.Value AS a_ValueFROM tblEdge A,tblEdge B,tblLeafs C, tblEdge D,tblLeafs EWHERE (A.EdgeName = ‘person’) AND (A.TargetId = B.SourceId) AND (B.EdgeName = ‘name’) AND (B.LeafId = C.LeafId(+)) AND (A.TargetId = D.SourceId) AND (D.EdgeName = ‘address’) AND (D.LeafId = E.LeafId(+))
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
12
Construct Result Document Result Tuple
Subtree ReconstructionSELECT A.EdgeName, A.Type, Al.Value AS A_LeafVal, Aa.Value AS A_AttrVal FROM tblEdge A, tblLeafs Al, tblAttrs Aa WHERE A.SourceId=5 AND A.leafId=Al.leafId(+) AND A.attrId=Aa.attrId(+)
n_Type n_TargetId
n_Depth n_Value a_Type a_TargetId
a_Depth a_Value
leaf 4 0 Peter ref 5 1
EdgeName Type A_LeafVal A_AttrVal
street leaf Main Road 4
zip leaf 04236city leaf Leipzig
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
13
Query Result
XML Result Document<result>
<person>
<name>Peter</name>
<address>
<street>Main Road 4</street>
<zip>04236</zip>
<city>Leipzig</city>
</address>
</person>
</result>
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
14
Advantages
Vendor Independence– no specific DBMS features needed
Stability High Flexibility of Queries
– retrieve and update single values
– full SQL functionality can be used
Well-Suited for Structure-Oriented Queries– structures are represented in tables
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
15
Drawbacks Information Loss
– Comments– Processing Instructions– Prolog– CDATA Sections– Entities
Restrictions– only one text (content) per element
<element> Text1 <subelement/> Text2 lost
</element>– element text as VARCHAR(n); n <= 4000
Increased Load Time – sample document: 3.3. MB, 130.000 tuples, 13 minutes
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
16
Opaque Approach
Characteristics– XML document stored as Large Object (LOB)
– document completely preserved
Storage
Insert into tblXMLClob values (1,‘person.xml‘,‘
<person>
<name>Mary</name>
</person>‘ );
DocId url content
1 person.xml <person>
<name>Mary</name>
</person>
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
17
Oracle interMedia Text
Query Facilities of interMedia Text– full-text retrieval (word matching only)
– path expression only together with content search
– no range queries
Example in interMedia Text:SELECT DocId FROM tblXMLClob WHERE
CONTAINS(content,‘(Mary WITHIN name) WITHIN person‘)>0
XML Full-Text Index– Autosectioner Index
– XML Sectioner Index
– WITHIN operator• text_subquery WITHIN elementname
• searches the entire text content of the named tag
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
18
Comparison of Queries
return document IDs word matching (default) no existence test for
elements or attributes restricted set of path
expressions using WITHINe.g.: (xml WITHIN title) WITHIN book
provides limited attribute value searches, no nesting of attribute searches
numeric and data values not type-converted
no range searches on attribute values
return document fragments substring matching search for existing
elements or attributes path expressions
structure-oriented queries //Book/Title/[contains(..‘xml‘)]
searches for attribute values and element text can be combined
considers also decimal values
range searches possible using filters
interMedia Text XPath
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
19
XPath Queries with PL/SQL
Prerequisite– XDK for PL/SQL installed on the server
Parse CLOB into DOM representation
XPath Query
Search Document IDs of all CLOBs of
the XML Table
Execute XPath Query on the DOM tree for each
CLOB Objects of a Document ID
DocIDs with Result XML Documents
server-side
DB
Doc IDs
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
20
Advantages
Information Preservation Handling of Large Documents
– appropriate for document-centric documents with little structure and prose-rich elements
Different XML Document APIs– interMedia Text: restricted set of XPath functionality
– generate a DOM of the document before using XPath queries
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
21
Drawbacks
Restricted Expressiveness of Text Queries Performance vs. Accuracy of Query Results
– interMedia Text queries on CLOBs faster than the DOM-API
– sample document: 12.5 MB, parse time 3 min, load time 5 min
Restrictions of Indexes– maximum tag names for indexing (incl. namespace) 64 bytes
Problems with Markup– character entities
Vendor Dependence– text engines are proprietary, e.g., Oracle interMedia
Stability– maximum document size 50 MB
– memory errors may occur with smaller documents
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
22
User Interface
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
23
XML Database Interface
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• OutlookDatabase Server
DB
StorageMethod
ExportMethod
DeleteMethod
DocumentList
Query Method
Transf. to SQL
Transf. to XML
Middleware / XML Interface
SQLInsert Statem
ent
SQL
Select Statement
Row
Set
SQL
Delete
Statement SQ
LSe
lec t
Stat
emen
tD
ocID
s / D
ocN
ames
SQL
Sele
ctSt
atem
ent
Row S
et
Client
XMLDocument
XMLQuery
DocList
DocID /Doc Name
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
24
Comparison
Import– loss of information– time-consuming– produces lots of tuples
Queries– XML-QL
• fast• new document as
result
Import– no loss of information– faster than structure-
oriented decomposition
Queries– interMedia Text
• fast• only document IDs as
result– XPath
• high response time• flexible granularity of
results
Structure-Oriented Approach Opaque Approach
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
25
Implementation Experience
Import Problems– structure-oriented decomposition
• SAX parser produces FillBuf Error for XML documents > 3 MB
• limitations of VARCHAR columns (max. 4000 bytes)
– opaque Approach• OutOfMemory error during import of XML documents
> 4MB – reason: to little heap size in Java – use start option Xmx<HeapSize>
Queries– Opaque Approach
• OutOfMemory error when parsing CLOB into DB– increase java_pool_size (100-150 MB)– increase shared_pool_size (150-200 MB)
Export and Delete without Problems
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook
T. Kudrass OOPSLA Workshop “Objects, <XML> and Databases“, 2001
26
Outlook
Experience– problems with larger documents in both approaches
– no universal solution for all requirements
Co-Existence of Multiple Storage Approaches – integrate different storage engines
– combine structure-oriented decomposition and opaque approach
– need for a generic XML data type
New Data Model for Structure-Oriented Approach– reduce loss of information
XML Database Interface / Middleware– combines different approaches
– parameterize the XML database interface
• Introduction
• Structure-Oriented Approach
• Opaque Approach
• Prototype Implementation
• Outlook