getting data out of xml documents bálint joó school of physics university of edinburgh may 02,...
TRANSCRIPT
![Page 1: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/1.jpg)
Getting Data out of XML Documents
Bálint JoóSchool of Physics
University of EdinburghMay 02, 2003
![Page 2: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/2.jpg)
ContentsIn search of a simple API for accessing DOM
The multiple tag problem
What is it?
Is it a problem for us?
How can we get around it?
XPath
What is easy to parse?
Software: XPathReader package
Conclusions
![Page 3: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/3.jpg)
Motivation (Starting Points)
Lack of free Data- binding tools for C/C++
Desire to read ILDG Metadata documents, marshal application data
=> Have to write our own tools
Would like simple API to get at document data
Would like same API to cope with ILDG metadata AND application data.
We got as far as reading into a DOM.
![Page 4: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/4.jpg)
Start With Simple Idea
Consider simple API with functions
push(tagname) -- select tag with name tagname
pop() -- move up a level
getType( tagname , result )
Type = string | float | double | int | bool;
Equivalent API: directory like structure with no absolute paths:
cd(tagname) = push(tagname) , cd(..) = pop()
Simple Data: No Attributes, No Namespaces No Empty Elements.
![Page 5: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/5.jpg)
Example
<? xml version=”1.0”?><foo> <bar>String</bar> <fred>5.0</fred></foo>
Open(''file.xml'');push(''foo'');string bar; getString(''bar'', bar);double fred;getDouble(''fred'', fred);pop();
So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders
BUT Soon we encounter...
![Page 6: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/6.jpg)
The Multiple Tag Problem
<size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>
Consider following snippet:
Lets try our API: push(''size'');
But what does: push(''axis''); do?
![Page 7: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/7.jpg)
Multiple Tag Problem (cont'd)push(“axis”) could select in document order
We could add an index to push(“axis”)
push(“axis”, 1) push(“axis”,2)
We could add an index attribute to <axis>
<axis index=”1”> <axis index=”2”>
But then we'd need a mechanism to match index attribute
We could change the names of axis:
<axis1> <axis2>
We could put the different <axis> into different namespaces -- effectively same as adding attribute
We could try and match the <dimension> tag.
![Page 8: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/8.jpg)
The consequences
Changing tagnames for simplicity of parsing just seems wrong
Matching the <dimension> tag is not possible without first selecting an <axis> in our scheme (locality)
Adding attributes/namespaces complicates API.
This use of different namespaces would be philosophically wrong.
Adding order of occurrance index into API is cleanest
No need to change Schema, Instance documents etc.
Document ordering removes random access capability
![Page 9: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/9.jpg)
In General
For less simple (more general) XML documents duplicate tags can be distinguished by:
Occurrance Order
Name
Attributes
Content
Namespace
An ideal, simple API should allow matching on all of these to interrogate any XML document.
![Page 10: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/10.jpg)
What about Locality ?push(namespace, tagname, attributes, occurrance)
getType(ns, tagname, attributes, occurrance, result)
But NO local parser can match on element content.
need to open a tag based on value of content
BUT can't get to content without opening tag.<size> <num_dimensions>2</num_dimensions> <axis> <dimension>2</dimension> <length>16</length> </axis> <axis> <dimension>1</dimension> <length>16</length> </axis></size>
Document order may not help here Schema document still
satisfied. Would like to match on
<dimension> tag Need to abandon locality
![Page 11: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/11.jpg)
Lesson
In order to avoid ambiguity we must
Restrict the form of markup we deal with
Force decisions onto our Schema writers
OR complicate our API
rely on tag ordering (either implicitly or explicitly)
introduce attributes (forcing decision on Schema writers)
give up locality in the API
![Page 12: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/12.jpg)
Global Queries: XPath
Would like a nice way to encode
tag name
attributes
order of occurrence
attribute/content matching predicates
Can this be done?
YES! Using XPath
![Page 13: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/13.jpg)
XPath Axes
Node
Parent axis: .. Attribute Axis: @
Child axis: ./
Following Sibling Axis(no compact selector)
Preceding Sibling Axis(no compact selector)
XPath Axes specify coordinates for DOM.
Some Axes can include more than one node:
ancestors: parent and all its ancestors
![Page 14: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/14.jpg)
XPath Selectors
tagname selects all children of current node called tagname
* selects all children of node
@name selects all attribute nodes called name
@* selects all atributes nodes of current node.
name[i] selects the i-th occurrance of child node called name
.. selects parent of current node
//name selects name with any set of ancestors
![Page 15: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/15.jpg)
XPath Examples
<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>
XPath Query:
/
Selection
![Page 16: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/16.jpg)
XPath Examples
<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>
XPath Query:
/size
Selection
![Page 17: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/17.jpg)
XPath Examples
<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension> 2 </dimension> <length>16 </length> </axis></size>
XPath Query:
/size/axis
Selection
OR
/size/*
OR
//axis
![Page 18: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/18.jpg)
XPath Examples
<?xml version=”1”?><size> <axis> <dimension> 1 </dimension> <length>16</length> </axis> <axis> <dimension>2</dimension> <length>16 </length> </axis></size>
XPath Query:
/size/axis[2]
Selection/size/axis[dimension=”2”]
OR
Query on element content
Query on order of occurrance
![Page 19: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/19.jpg)
XPath Examples
<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>
XPath Query:
/size/bj:axisSelection
Support Namespaces
![Page 20: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/20.jpg)
XPath Examples
<?xml version=”1”?><size xmlns:bj=”http://fred.org”> <bj:axis> <dimension> 1 </dimension> <length>16</length> </bj:axis> <axis index=”2”> <dimension> 2 </dimension> <length>16 </length> </axis></size>
XPath Query:
/size/axis[@index=”2”]
Selection
Attribute Matching
Visit: http://www.zvon.org/xxl/XPathTutorial
for more ...
![Page 21: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/21.jpg)
XPath Notes
Can return sets of nodes - not just unique node
Has more features:
Functions to turn query results into strings, numbers, booleans
Encodes all features we need
C/C++ linkable XPath Processors exist
Xerces, Xalan, libxml
Solves all our reader API problems in nice way.
![Page 22: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/22.jpg)
XPath Based Reader API
Basic Functions:open(file/stream);getType(xpath_string, result);getAttributeType(xpath_string,
attributeName, result);
Semantics:The xpath_string must identify a unique node.
![Page 23: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/23.jpg)
What is Easy to Parse?Stylistic discussion on Metadata Mailing list.
One particular question:
“ How should we mark up things?”
<size> <dimensions>4</dimensions> <axis> <name>X</name> <length>16</length> </axis> <axis> <name>Y</name> <length>16 </length> </axis></size>
<size> <x value=”16”/> <y value=”16”/> <z value=”16”/> <t value=”32”/></size>
Chris' Way: Tomoteru's Way:
Known as the: “ Element v.s. Attribute”debate in the XML world.
![Page 24: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/24.jpg)
What is Easy to Parse?One statement is that the attribute way is perhaps easier to parse?
With XPath, both ways are easy to parse.
To get the length of the x dimension:
Chris' Way:
number(//size/axis[normalize-space(string(name))=”X”]/length)
getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue);
Tomoteru's Way:
number(//size/x/@value)
getIntAttribute(“//size/x”, “value”, intValue);
Chris' Way has more complex query. But equally simple API Call.
![Page 25: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/25.jpg)
Element v.s. Attribute Debate (aside)
Looked on Web
Tomoteru's way is preferred in general by object modellers (eg. database people)
Mark up most “ atomic” data as attributes
Use tags to indicate “ table structure”
Chris' way is perhaps preferred by archivists or librarians (Go Kim!)
Decide for yourself, a discussion is available at:
http://www.oasis-open.org/cover/elementsAndAttrs.html
Found no universally accepted best practice.
![Page 26: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/26.jpg)
Software: XPathReader
Wrote software to implement XPath Reader API in C++
Wraps around free libxml2 (C) library
Uses overloading and templating
Two Classes:
BasicXPathReader:
Use XPath to get at basic C++ types (ints, std::strings, etc)
XPathReader
Allows reading of Complex Numbers and Arrays.
![Page 27: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/27.jpg)
XPathReader Class Public Members
void open(istream& is); void close(void);
template <typename T> void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result);
template <typename T> void getXPath(const string& xpath, T& result);
int countXPath(const string& xpath_query);
open/close functions:
count results of XPath Query:
get value of attribute from node identified by XPath:
get value of node identified by XPath
![Page 28: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/28.jpg)
Complex Numbers and Arrays
XPathReader Library provides Classes for Complex Numbers and Arrays:
template<typename T> class TComplex { ... };
template<typename T> class Array { ... };
Can have Complex numbers of arrays
Eg for storing real/imaginary parts of arrays:
TComplex< Array< double > >
Can also have Complex-es templated on string-s
Mathematically not sensible...
![Page 29: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/29.jpg)
Complex Number Markup & Marshal
<foo> <cmpx> <re>real part</re> <im>imag part</im> </cmpx></foo>
Invented simple mark up:
can maintain API through C++ function overloading and recursion:
template <typename T>void getXPath(const string& path, TComplex<T>& result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() );}
similar but slightly more involved for Array.
![Page 30: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/30.jpg)
Array Markup
Arrays were marked up as follows:
<foo> <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> <size>N</size> <el idx=”x”> element[0] </el> <el idx=”x+1”> element[1] </el> ... <el idx=”x+N-1”> element[N-1] </el> </array></foo>
This is a general mark up -- suitable for local parsers too
![Page 31: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/31.jpg)
Array Mark - Up Example
<size> <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> <num_dimensions>4</num_dimensions> <axis dimension=”1”> ... </axis> <axis dimension=”2”> ... </axis> ... </array></size>
Minimally invasiveInsert <array> </array> tagsCopy <dimension> tag to attributeEasy to implement with XSL
transformationWorking group needn't amend
current metadata schema for it.
![Page 32: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/32.jpg)
ConclusionsDiscussed API Issues for Parsing XML without full “data binding” tools.
Discussed Repeated Tag problem
Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too.
Discussed C++ Implementation of an XPathReader API
Discussed how to parse compound data types
Described markup for Complex Numbers and Arrays
Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data.
![Page 33: Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003](https://reader030.vdocument.in/reader030/viewer/2022032600/56649dbb5503460f94aac269/html5/thumbnails/33.jpg)
References/Links
XML, DOM, XPath: http://www.w3.org
Tutorials (XPath/XSLT): http://www.zvon.org
libxml2: http://www.xmlsoft.org
Attribute v.s. Entities (and other discussions):
http://www.oasis-open.org/cover/elementsAndAttrs.html
XPathReader software
send email to me: [email protected]
SciDAC CVS repository at JLAB (xpath_reader)
SciDAC: http://www.lqcd.org