xml - cl.lingfil.uu.semarie/undervisning/textanalys16/xml.pdfdefinition...
TRANSCRIPT
Presentation Plan
1 Introduction
2 XML Specificities and Motivations
3 XML: Vocabulary and Techniques
Uppsala May 2016 2/38
Table of Contents
1 Introduction
2 XML Specificities and Motivations
3 XML: Vocabulary and Techniques
Uppsala May 2016 3/38
Definition
A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?
Uppsala May 2016 4/38
Definition
A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?
LATEX\textbf{•} \section{•}HTML <i></i>XML...
Uppsala May 2016 5/38
Why? Historical reasons
Uppsala May 2016 6/38
Why?
Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?
Uppsala May 2016 7/38
Why?
Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?
Uppsala May 2016 7/38
Why?
Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?
Uppsala May 2016 7/38
Why?
Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?
Uppsala May 2016 7/38
Table of Contents
1 Introduction
2 XML Specificities and Motivations
3 XML: Vocabulary and Techniques
Uppsala May 2016 8/38
XML Specificities
XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags
Uppsala May 2016 9/38
XML Specificities
XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags
Uppsala May 2016 9/38
Why is XML useful?
Uppsala May 2016 10/38
Why is XML useful?
XML is useful because:it allows you to share highly compatible data, betweensystems, over time...
It is often used in NLP
Uppsala May 2016 11/38
Why is it useful?
Because XML is flexible, it allows to make some by-products.
Uppsala May 2016 12/38
XML by-products
Because XML is flexible it allows to make some by-products.
OWLMusicXML
Uppsala May 2016 13/38
XML by-products
Because XML is flexible it allows to make some by-products.
OWLMusicXML
Uppsala May 2016 13/38
XML by-products
Because XML is flexible it allows to make some by-products.
OWL
MusicXML
RSS
Uppsala May 2016 13/38
Table of Contents
1 Introduction
2 XML Specificities and Motivations
3 XML: Vocabulary and Techniques
Uppsala May 2016 14/38
Flexible does not mean rule-free
XML document must respect some syntax rules:a good nested order<book><title></title><author></author></book>*<book><title><author></title></author></book>note: you can just as well create empty tags: <author/>XML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">
Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"
Uppsala May 2016 15/38
Flexible does not mean rule-free
XML document must respect some syntax rules:a good nested orderXML is case sensitive<Title></Title><author></author>*<Title></title><author></author>root element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">
Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"
Uppsala May 2016 15/38
Flexible does not mean rule-free
XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">
Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"
Uppsala May 2016 15/38
Flexible does not mean rule-free
XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">
Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"
Uppsala May 2016 15/38
Flexible does not mean rule-free
XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">
Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"
Uppsala May 2016 15/38
QUIZ
<?xml version="1.0"?><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body></note>
Is this (above) a "well formed" XML document?1 Yes2 No
Uppsala May 2016 16/38
QUIZ
<?xml version="1.0"?><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body>
Is this a "well formed" XML document?1 Yes2 No
Uppsala May 2016 17/38
XML Structure
Uppsala May 2016 18/38
XML Structure
XML has what we call a tree structure.
Uppsala May 2016 19/38
XML Structure
XML has what we call a tree structure.Example of tree structure
Uppsala May 2016 19/38
Let’s practice
Are you able to give me the tree structure of this XML?
Uppsala May 2016 20/38
DTD
The structure definition of an XML document is described by a"Document Type Definition", or "dtd"
Uppsala May 2016 21/38
DTD
An XML is sometimes associated to a DTD thanks to an extra lineadded at the beginning of the XML.
(DTD declaration is on the second line)
Uppsala May 2016 22/38
DTD
Anatomy of a DTD
Uppsala May 2016 23/38
DTD
Example:
XML
DTD
Uppsala May 2016 24/38
DTD
VocabularyWhen an XML document respects the DTD we say that it is"valid"
Uppsala May 2016 25/38
Tools
Commands exist to check that your XML is well formed/validWell formed:xmllint -noout document.xmlValid:xmllint -noout -valid document.xml
Uppsala May 2016 26/38
XPath
DefinitionXPath, the XML Path Language, is a query language for selectingnodes from an XML document. XPath uses path expressions toselect nodes or node-sets in an XML document. These pathexpressions look very much like the expressions you see when youwork with a traditional computer file system.
Uppsala May 2016 27/38
XPath tool: xmlstarlet
Among other functions, you can select elements in your XML withxmlstarlet like this:xmlstarlet sel -t -v "XpathCommand" document.xml
Example:xmlstarlet sel -t -v "/Text_description/contents/w[2]/form" lemms.xml
Uppsala May 2016 28/38
XPath
XPath commands look like computer file system pathsCan you guess which output should give this XPath command?/Text_description/contents/w[2]/form
1 vara2 ·3 var4 Det var så lite.5 5
Uppsala May 2016 29/38
What XPath is used for?
XPath is useful for:Navigating in the XMLCreating XSLT
DefinitionXSL stands for EXtensible Stylesheet Language, and is a stylesheet language for XML documents. XSLT stands for XSLTransformations.
Uppsala May 2016 30/38
XSL
XSL is to XML what CSS is to HTML
Uppsala May 2016 31/38
XSL
Uppsala May 2016 32/38
XSL advantages and drawbacks
The XSL language manages:loops <xsl:for-each>conditions <xsl:if>sorting <xsl:sort>
this language is very wordyno regex (must use XSL 2.0)
Uppsala May 2016 33/38
What should you do when you get an XML and anXSLT?
If you want to display it in a browser: add this red line to the XMLand open the XML:
If you want to get the output directly, use a processor.xsltproc document.xml document.xsl > whatever
Uppsala May 2016 34/38
Conclusion
What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!
Uppsala May 2016 35/38
Conclusion
What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!
Uppsala May 2016 35/38
Conclusion
What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!
Uppsala May 2016 35/38
Summary
Today we learnt about a mark up language called XML. Itallows to structure data/text.Since XML is hyper-flexible we sometimes need a DocumentType Definition (DTD) associated with itWe learned some part of the XML terms, mainly ‘well formed’and ‘valid’The XSL based on Xpath language is one solution fortransforming your XML into whatever other document: plaintext, html etc.
Uppsala May 2016 36/38
QUIZ
When an XML document is conform to the dtd we say thatit is:
1 "Well formed"2 "Conformist"3 "Valid"
Uppsala May 2016 37/38
References
The reference, short and clear, with quizzes and practical exercises:http://www.w3schools.com/xml/default.aspBelow an example of parser that can give results in XML format:http://nlp.stanford.edu:8080/corenlp/process
Uppsala May 2016 38/38