xml: what and why? application: web services xquery: the xml query language good news: xquery, as...

Download XML: What and Why?  Application: Web Services  XQuery: The XML Query Language  Good News: XQuery, as a declarative language, is ideal for automatic

If you can't read please download the document

Upload: steven-bailey

Post on 14-Dec-2015

226 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1

Slide 2 XML: What and Why? Application: Web Services XQuery: The XML Query Language Good News: XQuery, as a declarative language, is ideal for automatic parallel execution Bad News: We still need Java We limmit the automatic parallel execution XQueryP scripting extension and the tradeoff CS315B Slide 3 XML A universal format for structured documents and data. XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks. Thus, HTML is about displaying information, while XML is about describing information. CS315B Domain-specific Languages for Parallelism Slide 4 Can represent a wide variety of both structured and unstructured data Can be used in integrating heterogeneous data sources (traditional/relational databases, data files, email messages, web pages, etc.) Can be used on a variety if devices including PCs, PDAs, smart mobile phones, etc. B U T M A I N L Y... Helps companies to cut costs in information exchange CS315B Domain-specific Languages for Parallelism Slide 5 Differences Commonalities XMLRelational Data Model TreeTable Data and schemas should not be correlated. Data can exist with or without schema, or with multiple schemas. Schema first, then data XML Logical and Physical Data Independence Declarative Semantics CS315B Domain-specific Languages for Parallelism Slide 6 A WS is a class on the Web. Like an RPC, which identified by a URI (e.g. http://my.service:234) accepts as argument an XML envelope returns an XML response. Server Application Logic Client Web Service XML Client XML Client XML CS315B Domain-specific Languages for Parallelism Slide 7 Typical Architecture Server Application Logic (Java /.NET) Web Service (XML Domain) XML DB XML XQuery (XML Domain) Server Application Logic Client Web Service XML CS315B Domain-specific Languages for Parallelism Slide 8 XQuery is a declarative programming language, designed to manipulate and query XML data. With XQuery you describe what you want to achieve and leave the how to the runtime system It is essentially designed for optimizability, including automatic parallelization of the execution of the queries CS315B Domain-specific Languages for Parallelism Slide 9 333 RDBMS Paul I Information Retrieval using RDBMS Beyond Simple Information Retrieval Extension of RDBMS features 333 RDBMS Paul I Information Retrieval using RDBMS Beyond Simple Information Retrieval Extension of RDBMS features Slide 10 Syntactic sugar that combines FOR, LET, IF Example Return the number of title elements of the chapter I of the book FOR var IN expr LET var := exprWHERE expr RETURN expr XQUERYSQL Analogy FOR $chapters IN /book//chapter LET $titles := $chapters//title WHERE $chapters/num = I RETURN count($titles) similar to FROM no analogy in SQL similar to WHERE similar to SELECT CS315B Domain-specific Languages for Parallelism Slide 11 (docId, sPos, ePos, level) docId: identifier of the document sPos : starting position of the element or string within the XML doc ePos : end position of the element (for string => same as sPos) level : nesting depth within the document Slide 12 CS315B Domain-specific Languages for Parallelism To facilitate the evaluation of the XQuery expressions, an index is created for all the nodes within the XML database. TermdocIdsPosePoslevel book11360 ISBN1241 title1571 chapter111351 title115202 title122263 title128324 RDBMS1662 119 3 RDBMS130 3 Example: Suppose we have the containment query : chapter//title Search the table for all entries in which term= chapterSearch the table for all entries in which term= chapter { (1,11,35,1) } Search the table for all entries in which term= titleSearch the table for all entries in which term= title { (1,5,7,1), (1,15,20,2), (1,22,26,3), (1,28,32,4) } (1,22,26,3), (1,28,32,4) } Combine them!Combine them! {,,, } } Slide 13 Beowulf cluster: An example of a high performance parallel computing system used for parallel processing of XML Queries Several processing nodes interconnected via a switch Each node has its own CPU with a sizable cache, a large main memory (typically>1GB) and a hd CS315B Domain-specific Languages for Parallelism Slide 14 Master: runs the file. Serves as the point system for the clustering S/W to route duties and monitor all individual nodes (i.e., slaves) Beowulf: Open source s/w like Linux MPI library for broadcasting and point-to-point messages among the clusters nodes. CS315B Domain-specific Languages for Parallelism Slide 15 Phase 1: Distribute the entries of the fully- inverted index among the cluster nodes for processing (e.g., round-robin distribution, hash-based distribution). Phase 2: Each cluster processes the containment query to generate the corresponding lists of index entries. Phase 3: The elements of the generated list are checked against one another to produce the result set. CS315B Domain-specific Languages for Parallelism Slide 16 Despite of XQuery we still need Java/.NET to: implement user interfaces call Web services; interact with other programs expose functions as Web service write complex applications Trade-off between optimizability (on one side) & flexibility, determinism and expressive power (on the other side) Query languages are more optimizable but pay a price on the other side Imperative languages lack optimizability but the semantics are simpler, deterministic and richer CS315B Domain-specific Languages for Parallelism Slide 17 The ultimate goal: get rid of Java => all XQuery XQueryP: Extension of XQuery for scripting Server Application Logic (Java /.NET) Web Service (XML Domain) XML DB XML XQuery (XML Domain) Server Application Logic Client Web Service XML CS315B Domain-specific Languages for Parallelism Slide 18 Prototype in Big OracleDB Presented at Plan-X 2005 Prototype in BerkeleyDB-XML Might be open sourced (if interest) MXQuery http://www.mxquery.org (Java) http://www.mxquery.org Runs on mobile phones: Java CLDC1.1; some cuts even run CLDC 1.0 Eclipse Plugin available since March 2007 Zorba C++ engine (FLWOR Foundation) Small footprint, performance, extensibility, potentially embeddable in many contexts CS315B Domain-specific Languages for Parallelism Slide 19 Ghassan Z. Qadah: Parallel processing of xml databases [2005 IEEE CCECE/CCGEI] Xiaogang Li, Swarup Kumar Sahoo, Gagan Agrawal: XQuery Perspective: Using XML/XQuery for Scientific Applications and Applying Scientific CompilationTechniques [2004 SIGMOD] Daniela Florescu, Donald Kossmann. CS345B: XML and Databases. http://www.stanford.edu/class/cs345b/ W3C XML Query XQuery http://www.w3.org/XML/Query/ CS315B Domain-specific Languages for Parallelism Slide 20 Slide 21 Introduces parts of code that will: Run in Sequential Mode Define the order in which expressions will be evaluated Be strictly deterministic Manually handle exceptions CS315B Slide 22 HealthCare Level Seven http://www.hl7.org/http://www.hl7.org/ Geography Markup Language (GML) Systems Biology Markup Language (SBML) http://sbml.org/ http://sbml.org/ XBRL, the XML based Business Reporting standard http://www.xbrl.org/http://www.xbrl.org/ Global Justice XML Data Model (GJXDM) http://it.ojp.gov/jxdm http://it.ojp.gov/jxdm ebXML http://www.ebxml.org/http://www.ebxml.org/ e.g. Encoded Archival Description Application http://lcweb.loc.gov/ead/ http://lcweb.loc.gov/ead/ Digital photography metadata XMP An XML grammar for sensor data (SensorML) Real Simple Syndication (RSS 2.0) CS315B Domain-specific Languages for Parallelism Slide 23 Xpath 1.0 XSLT 2.0XQuery 1.0 Xpath 2.0 XSLT 1.0 uses extends, almost backwards compatible extends FLWOR expressions Node constructors Validation 1999 2007 CS315B Domain-specific Languages for Parallelism Slide 24 1. Allow to execute sub-computations in a different order Parallelization, rescheduling 2. Possible to use various data access paths 3. Allow lazy evaluation 4. Allow streaming/pipelining between operations (no materialization of intermediate results) 5. Allow various evaluation algorithms for the same logical operation CS315B Domain-specific Languages for Parallelism