querying xml documents and data cbu summer school 13.8. - 20.8.2007 (2 ects) prof. pekka...
Post on 19-Dec-2015
216 views
TRANSCRIPT
Querying XML Querying XML Documents and DataDocuments and Data
CBU Summer School CBU Summer School 13.8. - 20.8.2007 (2 ECTS)13.8. - 20.8.2007 (2 ECTS)Prof. Pekka KilpeläinenProf. Pekka Kilpeläinen
Univ of Kuopio, Dept of Computer ScienceUniv of Kuopio, Dept of Computer [email protected]@cs.uku.fi
CBU Summerschool '07
Querying XML: Introduction 2
Introduction & MotivationIntroduction & Motivation
XML appears everywhereXML appears everywhere How to query it?How to query it?
XMLXML
InternetInternet
orderorder
invoiceinvoice
CBU Summerschool '07
Querying XML: Introduction 3
Main Topic: Two XML Query ModelsMain Topic: Two XML Query Models
Region algebraRegion algebra– for retrieval of structuded textfor retrieval of structuded text– "lightweight""lightweight"
» reduced language; for ad-hoc files; efficient reduced language; for ad-hoc files; efficient free implementationfree implementation
XQueryXQuery– for general querying/manipulation of XMLfor general querying/manipulation of XML– "heavy""heavy"
» comprehensive and complex language; for (data comprehensive and complex language; for (data viewed as) XML only; production-use viewed as) XML only; production-use implementations?implementations?
CBU Summerschool '07
Querying XML: Introduction 4
Course OutlineCourse Outline
Intro and Arrangements; Intro and Arrangements;
Structured documentsStructured documents
1 Review of XML Basics 1 Review of XML Basics 1.1 XML and XML docs; 1.2 Document grammars 1.1 XML and XML docs; 1.2 Document grammars
1.3. XML DTDs; 1.4 XML Namespaces 1.3. XML DTDs; 1.4 XML Namespaces
1.5 XML Schema1.5 XML Schema
2 Region Algebra and sgrep2 Region Algebra and sgrep
3 W3C XQuery, and XPath 2.03 W3C XQuery, and XPath 2.0
(Apologies for potential dis-organization!)(Apologies for potential dis-organization!)
CBU Summerschool '07
Querying XML: Introduction 5
ArrangementsArrangements
Background: Background: W3C Recommendations (XML, XQuery)W3C Recommendations (XML, XQuery) Reports (Region Algebra, sgrep)Reports (Region Algebra, sgrep) Earlier courses; own research and experimentsEarlier courses; own research and experiments
Some material (to be posted) at Some material (to be posted) at http://www.cs.uku.fi/~kilpelai/CBU07/http://www.cs.uku.fi/~kilpelai/CBU07/
Plan: Plan: Lectures 12 h; Lectures 12 h; hands-on exercises 8 h hands-on exercises 8 h
CBU Summerschool '07
Querying XML: Introduction 6
Structured DocumentsStructured Documents
DocumentDocument: : – a structured representation of information on some a structured representation of information on some
medium (medium ( message) message)
– normally for a human readernormally for a human reader» memos, manuals, articles, books, …memos, manuals, articles, books, …
– also application-to-application messagesalso application-to-application messages» e.g., btw client and server in e.g., btw client and server in Web ServicesWeb Services
– "prose-oriented XML" vs "data-oriented XML""prose-oriented XML" vs "data-oriented XML"– can be treated as a single unit can be treated as a single unit
» (a web page vs a web site)(a web page vs a web site)
CBU Summerschool '07
Querying XML: Introduction 7
Presentation vs StructurePresentation vs Structure
Presentation informs the Presentation informs the human readerhuman reader about the about the meaning of text and the role of its partsmeaning of text and the role of its parts
Markup Markup indicates the presentation or the indicates the presentation or the meaning of different parts of text meaning of different parts of text
» originally hand-written annotations for the typesetter originally hand-written annotations for the typesetter
– nowadays primarily codes embedded in digital nowadays primarily codes embedded in digital documents; documents; <Tags><Tags>
CBU Summerschool '07
Querying XML: Introduction 8
Markup and Markup LanguageMarkup and Markup Language
Procedural markup Procedural markup – commands (start boldface, produce empty line, indent commands (start boldface, produce empty line, indent
5 mm, ...)5 mm, ...)– proprietary word processor formats, nroff, TeX, ...proprietary word processor formats, nroff, TeX, ...
Descriptive Descriptive oror generic markup generic markup– indicates conceptual structures using chosen namesindicates conceptual structures using chosen names– LaTeX: LaTeX: \begin{abstract}\begin{abstract} ... ... \\end{abstract}end{abstract}
– HTML: HTML: <TITLE> <TITLE> ...... </TITLE> </TITLE> Markup languageMarkup language
– a fixed set of markup notations (e.g. nroff, TeX, HTML, a fixed set of markup notations (e.g. nroff, TeX, HTML, SVG, …) SVG, …)
CBU Summerschool '07
Querying XML: Introduction 9
Structure in DocumentsStructure in Documents
HierarchyHierarchy or or nestingnesting is ubiquitous is ubiquitous– Sections w. subsections etcSections w. subsections etc– (Also overlapping hierarchies!)(Also overlapping hierarchies!)
Linear orderLinear order essential in prose documents essential in prose documents– less important in documents representing data objectsless important in documents representing data objects
HypertextHypertext and and cross-referencescross-references
XML: proper hierarchies, tree-like structures, XML: proper hierarchies, tree-like structures, with cross-references via attribute valueswith cross-references via attribute values
CBU Summerschool '07
Querying XML: Introduction 10
1 Document Instances and Grammars1 Document Instances and Grammars
Overview of fundamentals, and some Overview of fundamentals, and some details, of XMLdetails, of XML
1.1 XML and XML documents1.1 XML and XML documents1.2 Basics of document grammars 1.2 Basics of document grammars 1.3 Basics of XML DTDs1.3 Basics of XML DTDs1.4 XML Namespaces1.4 XML Namespaces
1.5 XML Schema1.5 XML Schema
CBU Summerschool '07
Querying XML: Introduction 11
2.1 XML and XML documents2.1 XML and XML documents
XML - Extensible Markup Language,XML - Extensible Markup Language,W3C Recommendation, February 1998W3C Recommendation, February 1998– not an official standard, but a stable industry standardnot an official standard, but a stable industry standard– 22ndnd Ed 2000, 3 Ed 2000, 3rdrd Ed 2004, 4 Ed 2004, 4thth Ed 2006 Ed 2006
» editorial revisions, editorial revisions, notnot new versions of XML 1.0 new versions of XML 1.0
a simplified subset of SGML, Standard a simplified subset of SGML, Standard Generalized Markup Language, ISO 8879:1987Generalized Markup Language, ISO 8879:1987– validvalid XML documents are also SGML documents XML documents are also SGML documents
CBU Summerschool '07
Querying XML: Introduction 12
What is XML?What is XML?
ExtensibleExtensible Markup Language Markup Language is is notnot a markup a markup language! language! – does not fix a tag set nor its semantics does not fix a tag set nor its semantics
(like markup languages like HTML do)(like markup languages like HTML do)
XML documents have XML documents have no inherentno inherent (processing or (processing or presentation) presentation) semanticssemantics– even though many think that XML is semantic or self-even though many think that XML is semantic or self-
describing; See nextdescribing; See next
CBU Summerschool '07
Querying XML: Introduction 13
Semantics of XML MarkupSemantics of XML Markup
Meaning of this XML fragment?Meaning of this XML fragment?
– The application has to “understand” the tagsThe application has to “understand” the tags– But better off with the tags, though!But better off with the tags, though!
CBU Summerschool '07
Querying XML: Introduction 14
What is XML (2)?What is XML (2)?
XML XML isis– a way to use markup to represent informationa way to use markup to represent information– a a metalanguagemetalanguage
» supports definition of specific markup languages through XML supports definition of specific markup languages through XML DTDs (Document Type Definitions) or SchemasDTDs (Document Type Definitions) or Schemas
» E.g. XHTML a reformulation of HTML using XMLE.g. XHTML a reformulation of HTML using XML
Often “XML” Often “XML” XML + XML technology XML + XML technology
CBU Summerschool '07
Querying XML: Introduction 15
How does it look?How does it look?
<?xml version=’1.0’ encoding=”iso-8859-1” ?><?xml version=’1.0’ encoding=”iso-8859-1” ?>
<invoice num=”1234”><invoice num=”1234”>
<client clNum=”00-01”><client clNum=”00-01”> <name>Pekka Kilpeläinen</name> <name>Pekka Kilpeläinen</name>
<email>[email protected]</email><email>[email protected]</email>
</client></client>
<item price=”60” unit=”EUR”><item price=”60” unit=”EUR”>XML Handbook</item> XML Handbook</item>
<item price=”350” unit=”FIM”><item price=”350” unit=”FIM”>XSLT Programmer’s Ref</item>XSLT Programmer’s Ref</item>
</invoice></invoice>
CBU Summerschool '07
Querying XML: Introduction 16
Essential Features of XMLEssential Features of XML
Overview of XML essentialsOverview of XML essentials– many details skippedmany details skipped– Learn to consult original sources Learn to consult original sources
(specifications, documentation etc) for details!(specifications, documentation etc) for details!» The XML specification is easy to browseThe XML specification is easy to browse
First of all, XML is a textual or character-based First of all, XML is a textual or character-based way to represent dataway to represent data
CBU Summerschool '07
Querying XML: Introduction 17
XML Document CharactersXML Document Characters
XML documents are made of ISO-10646 (32-bit) XML documents are made of ISO-10646 (32-bit) characterscharacters; in practice of their 16-bit Unicode ; in practice of their 16-bit Unicode subset (used, e.g., in Java)subset (used, e.g., in Java)– Unicode 2.0 defines almost 39,000 distinct charactersUnicode 2.0 defines almost 39,000 distinct characters
Characters have three different aspectsCharacters have three different aspects::– their identification as numeric code pointstheir identification as numeric code points– their their representationrepresentation by bytes by bytes– theirtheir visual presentation visual presentation
CBU Summerschool '07
Querying XML: Introduction 18
External Aspects of CharactersExternal Aspects of Characters
Documents are stored/transmitted as a sequence Documents are stored/transmitted as a sequence of bytes (of 8 bits). An of bytes (of 8 bits). An encodingencoding determines how determines how characters are characters are representedrepresented by bytes. by bytes.– UTF-8 (UTF-8 (7-bit ASCII) is the XML default encoding7-bit ASCII) is the XML default encoding– encoding="KOI8R"encoding="KOI8R" should be OK for Cyrillic textsshould be OK for Cyrillic texts
» (I cannot comment on parser support)(I cannot comment on parser support)
A A fontfont determines the determines the visual presentationvisual presentation of of characterscharacters
CBU Summerschool '07
Querying XML: Introduction 19
XML Encoding of Structure 1XML Encoding of Structure 1
XML is, essentially, a textual encoding scheme of XML is, essentially, a textual encoding scheme of labelledlabelled, , orderedordered and and attributedattributed treestrees::– internal nodes are internal nodes are elementselements labelled by type names labelled by type names– leaves are leaves are text nodestext nodes labelled by string values, or labelled by string values, or
empty element nodesempty element nodes– the left-to-right order of children of a node mattersthe left-to-right order of children of a node matters– element nodes may carry element nodes may carry attributesattributes
= (name, string-value) pairs= (name, string-value) pairs
This view is shared by many XML techniques This view is shared by many XML techniques (DOM, (DOM, XPathXPath, XSLT, , XSLT, XQueryXQuery, ...), ...)
CBU Summerschool '07
Querying XML: Introduction 20
XML Encoding of Structure 2XML Encoding of Structure 2
XML encoding of a treeXML encoding of a tree– corresponds to a pre-order walkcorresponds to a pre-order walk– start of an element node with type name A start of an element node with type name A
denoted by a denoted by a start tagstart tag <A>, and its end <A>, and its end denoted by denoted by end tagend tag </A> </A>
– possible attributes written within the start tag:possible attributes written within the start tag:<A attr<A attr11=“value=“value11” … attr” … attrnn=“value=“valuenn”>”>
» Names attrNames attr11,…,attr,…,attrn n must be distinctmust be distinct
– text nodes written as their string valuetext nodes written as their string value
CBU Summerschool '07
Querying XML: Introduction 21
XML Encoding of Structure: XML Encoding of Structure: ExampleExample
<S><S>
SS
EE
<W><W> <W><W></W></W> <E A=‘1’/><E A=‘1’/>HelloHello world!world!
WW
HelloHello
WW
world!world!
</W></W> </S></S>
A=1A=1
CBU Summerschool '07
Querying XML: Introduction 22
XML: Logical Document StructureXML: Logical Document Structure
ElementsElements – indicated by matching (case-sensitive!) tagsindicated by matching (case-sensitive!) tags<ElementTypeName> <ElementTypeName> …… </ElementTypeName></ElementTypeName>
– can contain text and/or subelementscan contain text and/or subelements– can be can be emptyempty::
<elem-type></elem-type><elem-type></elem-type> or or <elem-type/><elem-type/> (e.g. (e.g. <br/><br/> in in
XHTML)XHTML)– unique root element unique root element document a single tree document a single tree
CBU Summerschool '07
Querying XML: Introduction 23
Logical document structure (2)Logical document structure (2)
AttributesAttributes– name-value pairs attached to elementsname-value pairs attached to elements– in start-tag after the element type namein start-tag after the element type name
<div class="preface" date='990126'><div class="preface" date='990126'> … …
– forms forms ""......"" and and ''......'' are interchangeable are interchangeable Also:Also:
– <!--<!-- commentscomments outside other markup outside other markup -->-->– <?note <?note this would be passed to the application as a this would be passed to the application as a
processing instruction named ‘note’processing instruction named ‘note’?>?>
CBU Summerschool '07
Querying XML: Introduction 24
CDATA SectionsCDATA Sections
““CDATA Sections” to include XML markup CDATA Sections” to include XML markup characters as textual contentcharacters as textual content
<![CDATA[<![CDATA[ Here we can easily include markup Here we can easily include markup characters and, for example, code characters and, for example, code fragments:fragments:
<example>if (Count < 5 && Count > 0) <example>if (Count < 5 && Count > 0) </example></example>
]]>]]>
CBU Summerschool '07
Querying XML: Introduction 25
Two levels of correctness (1)Two levels of correctness (1)
Well-formedWell-formed documents documents – roughly: follows the syntax of XML,roughly: follows the syntax of XML,
markup correct (elements properly nested, tag markup correct (elements properly nested, tag names match, attributes of an element have names match, attributes of an element have unique names, ...)unique names, ...)
– violation is a fatal errorviolation is a fatal error ValidValid documentsdocuments
– (in addition to being well-formed) (in addition to being well-formed) obey an associated grammar (DTD/Schema)obey an associated grammar (DTD/Schema)
CBU Summerschool '07
Querying XML: Introduction 26
XML docs and valid XML docsXML docs and valid XML docs
XML documents = well-formed XML documentsXML documents = well-formed XML documents
DTD-valid documentsDTD-valid documents Schema-valid documentsSchema-valid documents
CBU Summerschool '07
Querying XML: Introduction 27
An XML Processor (Parser)An XML Processor (Parser)
Reads XML documents and reports their contents Reads XML documents and reports their contents to an application to an application – relieves the application from details of markup relieves the application from details of markup – XML Recommendation specifies: XML Recommendation specifies: – recognition of characters as markup or data; what recognition of characters as markup or data; what
information to pass to applications; information to pass to applications; how to check the correctness of documents; how to check the correctness of documents;
– validation based on comparing document against its validation based on comparing document against its grammar grammar
Next: Basics of document grammarsNext: Basics of document grammars
CBU Summerschool '07
Querying XML: Introduction 28
1.2 Basics of document grammars1.2 Basics of document grammars
DTDs are variations of DTDs are variations of context-free grammarscontext-free grammars (CFGs), which are widely used to syntax (CFGs), which are widely used to syntax specification (programming languages, XML, …) specification (programming languages, XML, …) and to parser/compiler generation (e.g. and to parser/compiler generation (e.g. YACC/GNU Bison)YACC/GNU Bison)– No knowledge of them is necessary, but connections No knowledge of them is necessary, but connections
with CFGs may be informative for those that know about with CFGs may be informative for those that know about themthem
CBU Summerschool '07
Querying XML: Introduction 29
DTD/CFG CorrespondenceDTD/CFG Correspondence
DTDDTD
----------------------------------------------------------------
XML documentXML document
element typeelement type
element type declarationelement type declaration
#PCDATA#PCDATA
CFGCFG
------------------------------------------------
parse/syntax treeparse/syntax tree
nonterminalnonterminal
productionproduction
terminalterminal
CBU Summerschool '07
Querying XML: Introduction 30
Example: Three Authors of a RefExample: Three Authors of a Ref
RefRef
Author Author Author Author Author Author TitleTitle
. . .. . .
PublDataPublData
Aho Aho Hopcroft Hopcroft Ullman Ullman The Design and Analysis ...The Design and Analysis ...
Ref Ref Author* Title PublData Author* Title PublData P, P,Author Author Author Title PublData Author Author Author Title PublData L( L(Author* Title PublDataAuthor* Title PublData))
CBU Summerschool '07
Querying XML: Introduction 31
Extended ProductionsExtended Productions
Notice the Notice the regular expressionsregular expressions in in productionsproductions– to describe (potentially infinite) sequencesto describe (potentially infinite) sequences
That is, we are using That is, we are using extendedextended CFGs CFGs– content models (of a DTD) correspond to content models (of a DTD) correspond to
regular expressions (in an ECFG production)regular expressions (in an ECFG production)– > number of element’s children generally > number of element’s children generally
unlimited unlimited
CBU Summerschool '07
Querying XML: Introduction 32
1.3 Basics of XML DTDs1.3 Basics of XML DTDs
A A Document Type DeclarationDocument Type Declaration provides a provides a grammar (grammar (document type definitiondocument type definition,, DTD DTD) for a ) for a class of documents [Defined in XML Rec]class of documents [Defined in XML Rec]
Syntax (in the prolog of a document instance):Syntax (in the prolog of a document instance):<!DOCTYPE<!DOCTYPE rootElemType rootElemType SYSTEMSYSTEM "ex.dtd" "ex.dtd"<!-- <!-- "external subset" in file ex.dtd"external subset" in file ex.dtd --> --> [[ <!–- <!–- an optional "internal subset" an optional "internal subset" --> --> ]]>>
DTD = union of the external and internal subsetDTD = union of the external and internal subset– internal has preference for attribute and entity declsinternal has preference for attribute and entity decls
CBU Summerschool '07
Querying XML: Introduction 33
Markup DeclarationsMarkup Declarations
DTD consists of DTD consists of markup declarationsmarkup declarations – element type declarationselement type declarations
» ≈≈ productions of ECFGsproductions of ECFGs
– attribute-list declarationsattribute-list declarations » for declared element typesfor declared element types
– entity declarationsentity declarations» for physical structuresfor physical structures
– notation declarationsnotation declarations
logical structureslogical structures
CBU Summerschool '07
Querying XML: Introduction 34
How do Declarations Look Like?How do Declarations Look Like?
<!ELEMENT invoice (client, item+)><!ELEMENT invoice (client, item+)>
<!ATTLIST invoice num NMTOKEN #REQUIRED><!ATTLIST invoice num NMTOKEN #REQUIRED>
<!ELEMENT client (name, email?)> <!ELEMENT client (name, email?)>
<!ATTLIST client num NMTOKEN #REQUIRED><!ATTLIST client num NMTOKEN #REQUIRED>
<!ELEMENT name (#PCDATA)> <!ELEMENT name (#PCDATA)>
<!ELEMENT email (#PCDATA)> <!ELEMENT email (#PCDATA)>
<!ELEMENT item (#PCDATA)><!ELEMENT item (#PCDATA)>
<!ATTLIST item <!ATTLIST item
priceprice NMTOKEN #REQUIREDNMTOKEN #REQUIRED
unit (FIM | EUR) ”EUR” >unit (FIM | EUR) ”EUR” >
CBU Summerschool '07
Querying XML: Introduction 35
Element Type DeclarationsElement Type Declarations
General form:General form:<!ELEMENT<!ELEMENT elementTypeName elementTypeName ((EE)>)>
where where EE is a is a content modelcontent model regular expression of element namesregular expression of element names Content model operators:Content model operators:
E | F : choiceE | F : choice EE,, F: concatenation F: concatenationE? : optionalE? : optional E* : zero or moreE* : zero or moreE+ : one or moreE+ : one or more (E) : grouping(E) : grouping
Must groupMust group: : (A,B)|C or A,(B|C), but A,B|C forbidden(A,B)|C or A,(B|C), but A,B|C forbidden
CBU Summerschool '07
Querying XML: Introduction 36
Attribute-List DeclarationsAttribute-List Declarations
Can declare attributes for elements:Can declare attributes for elements:– Name, data type and possible default value Name, data type and possible default value
Example:Example:<!ATTLIST FIG<!ATTLIST FIG
idid IDID #IMPLIED#IMPLIEDdescr CDATA #REQUIREDdescr CDATA #REQUIREDclass (a | b | c) class (a | b | c) "a">"a">
Semantics mainly up to the applicationSemantics mainly up to the application– processor checks that processor checks that IDID attributes are unique and that attributes are unique and that
targets of targets of IDREFIDREF attributes exist attributes exist
CBU Summerschool '07
Querying XML: Introduction 37
Mixed, Empty and Arbitrary ContentMixed, Empty and Arbitrary Content
Mixed contentMixed content::<!ELEMENT P<!ELEMENT P (#PCDATA | I | IMG)*>(#PCDATA | I | IMG)*>
– may contain text and elementsmay contain text and elements Empty contentEmpty content::
<!ELEMENT IMG <!ELEMENT IMG EMPTYEMPTY>>
Unrestricted content: Unrestricted content: ANYANY
(= (= (#PCDATA |(#PCDATA | choice-of-all-declared-element-typeschoice-of-all-declared-element-types)* )* ))
CBU Summerschool '07
Querying XML: Introduction 38
Entities (1)Entities (1)
Named storage units of XML documentsNamed storage units of XML documents Multiple uses:Multiple uses:
– character entitiescharacter entities: : » << << and and << all expand to ‘ all expand to ‘<<‘‘
(treated as data, not as start-of-markup)(treated as data, not as start-of-markup)
» other other predefined entitiespredefined entities: : & > ' "e;& > ' "e;expand toexpand to &&,, > >,, ' ' andand ""
– general entitiesgeneral entities are shorthand notations: are shorthand notations:<!ENTITY UKU "University of Kuopio"><!ENTITY UKU "University of Kuopio">
CBU Summerschool '07
Querying XML: Introduction 39
Entities (2)Entities (2)
physical storage units comprising a documentphysical storage units comprising a document– parsed entitiesparsed entities
<!ENTITY chap1 SYSTEM <!ENTITY chap1 SYSTEM "http://myweb/ch1">"http://myweb/ch1">
– document entity document entity is the starting point of processingis the starting point of processing– entities and elements must nest properly:entities and elements must nest properly:
<!DOCTYPE doc [<!DOCTYPE doc [<!ENTITY chap1 <!ENTITY chap1
((… as above …)… as above …) > ]>> ]><doc><doc>
&chap1;&chap1;</doc></doc>
<sec num="1"><sec num="1">… …
</sec></sec><sec num="2"><sec num="2">
… … </sec></sec>
CBU Summerschool '07
Querying XML: Introduction 40
Unparsed Entities and Parameter EntitiesUnparsed Entities and Parameter Entities
Unparsed entitiesUnparsed entities allow XML documents refer to allow XML documents refer to external binary objects like graphics files external binary objects like graphics files – XML processor handles only textXML processor handles only text– I've rarely used theseI've rarely used these
Parameter entitiesParameter entities are used in DTDs are used in DTDs– useful for modularizing declarationsuseful for modularizing declarations
We skip theseWe skip these
CBU Summerschool '07
Querying XML: Introduction 41
1.4 XML Namespaces1.4 XML Namespaces
Documents often comprise parts processed by different Documents often comprise parts processed by different applications (and/or defined by different grammars) applications (and/or defined by different grammars)
– for example, in XSLT scripts:for example, in XSLT scripts:
<xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>
<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1>
</xsl:template></xsl:template>
– How to manage multiple sets of names?How to manage multiple sets of names?
HTML HTML elementselements
XSLT XSLT elements/elements/
instructionsinstructions
CBU Summerschool '07
Querying XML: Introduction 42
XML Namespaces (2/5) XML Namespaces (2/5)
Solution: Solution: – By introducing (arbitrary) local name By introducing (arbitrary) local name prefixesprefixes, ,
and binding them to (fixed) globally unique URIsand binding them to (fixed) globally unique URIs– For example, the local prefix “For example, the local prefix “xsl:xsl:” ”
conventionally used in XSLT scriptsconventionally used in XSLT scripts
CBU Summerschool '07
Querying XML: Introduction 43
XML Namespaces briefly (3/5)XML Namespaces briefly (3/5)
Namespace identified by a URI (through Namespace identified by a URI (through the associated local prexif) the associated local prexif) e.g.e.g. http://www.w3.org/http://www.w3.org/1999/XSL/Transform1999/XSL/Transform for XSLTfor XSLT
– conventional but not required to use URLsconventional but not required to use URLs– the identifier has to be unique, but no need to be an the identifier has to be unique, but no need to be an
addressaddress
Association inherited to sub-elementsAssociation inherited to sub-elements– see the next example (of an XSLT script)see the next example (of an XSLT script)
CBU Summerschool '07
Querying XML: Introduction 44
XML Namespaces (4/5) XML Namespaces (4/5)
<xsl:stylesheet version=<xsl:stylesheet version="1.0""1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns="http://www.w3.org/TR/xhtml1/strict">xmlns="http://www.w3.org/TR/xhtml1/strict">
<!-- XHTML is the ’default namespace’ --><!-- XHTML is the ’default namespace’ --><xsl:template match="doc/title"> <xsl:template match="doc/title"> <H1><H1>
<xsl:apply-templates /><xsl:apply-templates /> </H1> </H1> </xsl:template> </xsl:template>
</xsl:stylesheet></xsl:stylesheet>
CBU Summerschool '07
Querying XML: Introduction 45
XML Namespaces briefly (5/5)XML Namespaces briefly (5/5)
Mechanism built on top of basic XMLMechanism built on top of basic XML– overloads attribute syntax (overloads attribute syntax (xmlns:xmlns:) to introduce ) to introduce
namespacesnamespaces– does not affect validation does not affect validation
» namespace attributes have to be declared for DTD-namespace attributes have to be declared for DTD-validityvalidity
» all element type names have to be declared (with their all element type names have to be declared (with their initial prefixes!)initial prefixes!)
– > Other schema languages (XML Schema, Relax NG) > Other schema languages (XML Schema, Relax NG) better for validating documents with Namespacesbetter for validating documents with Namespaces
CBU Summerschool '07
Querying XML: Introduction 46
1.5 XML Schemas1.5 XML Schemas
A quick look at XML SchemaA quick look at XML Schema– W3C Recommendation,W3C Recommendation,
11stst Ed. May, 2001; 2 Ed. May, 2001; 2ndnd Ed. Oct, 2004: Ed. Oct, 2004:» XML Schema Part 0: Primer (readable non-XML Schema Part 0: Primer (readable non-
normative introduction; Recommended)normative introduction; Recommended)
» XML Schema Part 1: StructuresXML Schema Part 1: Structures
» XML Schema Part 2: DatatypesXML Schema Part 2: Datatypes
– W3C Draft (didn't lead anywhere?):W3C Draft (didn't lead anywhere?):» Formal Description, 9/2001 Formal Description, 9/2001
CBU Summerschool '07
Querying XML: Introduction 47
Advantages of XML Schema Advantages of XML Schema (1)(1)
XML syntaxXML syntax– easier to manipulate by programs (than DTDs)easier to manipulate by programs (than DTDs)
Compatibility with namespacesCompatibility with namespaces– can validate against declarations from multiple can validate against declarations from multiple
sourcessources Content datatypesContent datatypes
– 44 built-in datatypes (including primitive Java 44 built-in datatypes (including primitive Java datatypes, datatypes of SQL, and XML attribute datatypes, datatypes of SQL, and XML attribute types)types)
– mechanisms to derive user-defined datatypesmechanisms to derive user-defined datatypes– used as types of XQueryused as types of XQuery
CBU Summerschool '07
Querying XML: Introduction 48
XSDL built-in types XSDL built-in types
(Part 2, Chap. 3)(Part 2, Chap. 3)
NB: all simple values in NB: all simple values in documents documents stringsstrings
**CDATACDATA
**
**
**
**
**
**
**
**
*: XML attribute *: XML attribute typestypes
CBU Summerschool '07
Querying XML: Introduction 49
Advantages of XML Schema Advantages of XML Schema (2)(2)
Element names and Element names and content typescontent types independent; Compare with independent; Compare with – For example, could define For example, could define titlestitles
» of people as “Mr.”/”Mrs.”/”Ms.”, andof people as “Mr.”/”Mrs.”/”Ms.”, and» of chapters as stringsof chapters as strings
– > extend the power of CFGs/DTDs > extend the power of CFGs/DTDs » where non-terminal / tag-name alone determines where non-terminal / tag-name alone determines
its allowed content its allowed content
– (Is this relevant in practice?) (Is this relevant in practice?)
CBU Summerschool '07
Querying XML: Introduction 50
Advantages of XML Schema Advantages of XML Schema (3)(3)
Ability to specify uniqueness and keys within Ability to specify uniqueness and keys within selected parts of the documentselected parts of the document– for example, that for example, that titletitless of chapters should be unique; or of chapters should be unique; or
key attributes of relationskey attributes of relations– uses XPathuses XPath
Support for schema documentation Support for schema documentation – element element annotationannotation with sub-elements with sub-elements
documentationdocumentation (for human readers) and(for human readers) andappInfoappInfo (for applications)(for applications)
– Only these contain text (#PCDATA)Only these contain text (#PCDATA)
CBU Summerschool '07
Querying XML: Introduction 51
Disadvantages of XML Disadvantages of XML SchemaSchema
Complexity (esp. Rec Part 1!) vs. added power Complexity (esp. Rec Part 1!) vs. added power – > a long learning curve> a long learning curve– > slow adoption by users> slow adoption by users
Immaturity of implementations (?)Immaturity of implementations (?)– W3C web site mentions ~ 60 tools/processorsW3C web site mentions ~ 60 tools/processors– Apache Xerces claims full XSDL supportApache Xerces claims full XSDL support– Some features difficult to implement efficientlySome features difficult to implement efficiently
Alternative schema languages have been suggested, Alternative schema languages have been suggested, tootoo– Relax NGRelax NG– SchematronSchematron– ... ...
CBU Summerschool '07
Querying XML: Introduction 52
XSDL through ExampleXSDL through Example
– Next: walk-through of an XML schema exampleNext: walk-through of an XML schema example– from Chapter 2 of the XML Schema Primerfrom Chapter 2 of the XML Schema Primer
– Consider modelling purchase orders like below:Consider modelling purchase orders like below:
<purchaseOrder orderDate="1999-10-20"><purchaseOrder orderDate="1999-10-20"> <shipTo country="US"> <shipTo country="US"> <name>Alice Smith</name> <name>Alice Smith</name> <street>123 Maple Street</street> <street>123 Maple Street</street> <city>Mill Valley</city> <city>Mill Valley</city> <state>CA</state> <state>CA</state> <zip>90952</zip> <zip>90952</zip> </shipTo> </shipTo>
CBU Summerschool '07
Querying XML: Introduction 53
purchaseOrderpurchaseOrder instance instance continuescontinues
<billTo country="US"><billTo country="US"> <name>Robert Smith</name> <name>Robert Smith</name> <street>8 Oak Avenue</street> <street>8 Oak Avenue</street> <city>Old Town</city> <city>Old Town</city> <state>PA</state> <state>PA</state> <zip>95819</zip> </billTo> <zip>95819</zip> </billTo> <comment>Hurry, my lawn is wild!</comment> <comment>Hurry, my lawn is wild!</comment>
<items><item partNum="872-AA"><items><item partNum="872-AA"> <productName>Lawnmower</productName><productName>Lawnmower</productName> <quantity>1</quantity><quantity>1</quantity> <USPrice>148.95</USPrice><USPrice>148.95</USPrice> <comment>Only if electric</comment><comment>Only if electric</comment> </item></item>
CBU Summerschool '07
Querying XML: Introduction 54
End of the example instanceEnd of the example instance
<item partNum="926-AA"><item partNum="926-AA"> <productName>Baby Phone</productName><productName>Baby Phone</productName> <quantity>1</quantity><quantity>1</quantity> <USPrice>39.98</USPrice><USPrice>39.98</USPrice> <shipDate>1999-05-21</shipDate><shipDate>1999-05-21</shipDate> </item></item> </items></items></purchaseOrder></purchaseOrder>
Next: A schema for such purchase ordersNext: A schema for such purchase orders
CBU Summerschool '07
Querying XML: Introduction 55
The Purchase Order Schema (1/5)The Purchase Order Schema (1/5)
<xs:schema <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="purchaseOrder" type="POrdType"/><xs:element name="purchaseOrder" type="POrdType"/>
<xs:element name="comment" type="xs:string"/><xs:element name="comment" type="xs:string"/>
<xs:complexType name="POrdType"> <xs:complexType name="POrdType"> <xs:sequence> <xs:sequence>
<xs:element name="shipTo" type="USAddr"/> <xs:element name="shipTo" type="USAddr"/> <xs:element name="billTo" type="USAddr"/> <xs:element name="billTo" type="USAddr"/> <xs:element ref="comment" minOccurs="0"/> <xs:element ref="comment" minOccurs="0"/> <xs:element name="items" type="Items"/> <xs:element name="items" type="Items"/>
</xs:sequence> </xs:sequence> <xs:attribute name="ordDate" type="xs:date"/> <xs:attribute name="ordDate" type="xs:date"/> </xs:complexType></xs:complexType>
CBU Summerschool '07
Querying XML: Introduction 56
The Purchase Order Schema (2/5)The Purchase Order Schema (2/5)
<xs:complexType name="USAddr"> <xs:complexType name="USAddr"> <xs:sequence> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="name" type="xs:string"/> <xs:element name="street" <xs:element name="street"
type="xs:string"/> type="xs:string"/> <xs:element name="city" <xs:element name="city" type="xs:string"/> type="xs:string"/> <xs:element name="state" <xs:element name="state" type="xs:string"/> type="xs:string"/> <xs:element name="zip" <xs:element name="zip" type="xs:decimal"/> type="xs:decimal"/> </xs:sequence> </xs:sequence>
<xs:attribute name="country" <xs:attribute name="country" type="xs:NMTOKEN" fixed="US"/> type="xs:NMTOKEN" fixed="US"/>
</xs:complexType></xs:complexType>
CBU Summerschool '07
Querying XML: Introduction 57
The Purchase Order Schema (3/5)The Purchase Order Schema (3/5)
<xs:complexType name="Items"> <xs:complexType name="Items"> <xs:sequence> <xs:sequence> <xs:element name="item" <xs:element name="item"
minOccurs="0" maxOccurs="unbounded"> minOccurs="0" maxOccurs="unbounded"> <xs:complexType> <xs:complexType> <xs:sequence> <xs:sequence>
<xs:element name="productName" <xs:element name="productName" type="xs:string"/> type="xs:string"/>
<xs:element name="quantity"> <xs:element name="quantity"> <xs:simpleType> <xs:simpleType>
<xs:restriction <xs:restriction base="xs:positiveInteger"> base="xs:positiveInteger">
<xs:maxExclusive value="100"/> <xs:maxExclusive value="100"/> </xs:restriction> </xs:restriction>
</xs:simpleType> </xs:simpleType> </xs:element> </xs:element>
anonymous type for anonymous type for itemitem
anon. type anon. type for for
quantityquantity
CBU Summerschool '07
Querying XML: Introduction 58
The Purchase Order Schema (4/5)The Purchase Order Schema (4/5)
<xs:element name="USPrice" <xs:element name="USPrice" type="xs:decimal"/> type="xs:decimal"/>
<xs:element ref="comment" <xs:element ref="comment" minOccurs="0"/> minOccurs="0"/>
<xs:element name="shipDate" <xs:element name="shipDate" type="xs:date" type="xs:date"
minOccurs="0"/> minOccurs="0"/> </xs:sequence> </xs:sequence> <xs:attribute name="partNum" type="SKU" <xs:attribute name="partNum" type="SKU"
use="required"/> use="required"/> </xs:complexType> </xs:complexType> </xs:element> <!-- item --> </xs:element> <!-- item --> </xs:sequence> </xs:sequence>
</xs:complexType> <!-- Items --></xs:complexType> <!-- Items -->
CBU Summerschool '07
Querying XML: Introduction 59
The Purchase Order Schema (5/5)The Purchase Order Schema (5/5)
<!-- Type for Stock Keeping Units, <!-- Type for Stock Keeping Units, (codes for identifying products): --> (codes for identifying products): -->
<xs:simpleType name="SKU"> <xs:simpleType name="SKU"> <xs:restriction base="xs:string"><xs:restriction base="xs:string"><!-- defined by a regular expr: --> <!-- defined by a regular expr: --> <xs:pattern value="\d{3}-[A-Z]{2}" /> <xs:pattern value="\d{3}-[A-Z]{2}" />
<!-- 3 digits, hyphen, 2 letters --> <!-- 3 digits, hyphen, 2 letters --> </xs:restriction> </xs:restriction>
</xs:simpleType> </xs:simpleType>
</xs:schema> </xs:schema>
CBU Summerschool '07
Querying XML: Introduction 60
XML Schema: SummaryXML Schema: Summary
XSDL: an XML-based grammar XSDL: an XML-based grammar formalismformalism– W3C Recommendation; Alternative to DTDsW3C Recommendation; Alternative to DTDs
» support for namespacessupport for namespaces» richer content and attribute datatypesricher content and attribute datatypes
Well accepted(?) in XML industryWell accepted(?) in XML industry– e.g., to describe messages btw clients and servers e.g., to describe messages btw clients and servers
in in Web servicesWeb services; (See, e.g., Web Services ; (See, e.g., Web Services Description Language, Vers. 2.0, W3C Draft 3/07)Description Language, Vers. 2.0, W3C Draft 3/07)
– for typing of for typing of XQueryXQuery