xml - cl.lingfil.uu.semarie/undervisning/textanalys16/xml.pdfdefinition...

Post on 21-Apr-2019

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

XML

Marie Dubremetzmarie.dubremetz@lingfil.uu.se

Uppsala, May 2016

Presentation Plan

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 2/38

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 3/38

Definition

A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?

Uppsala May 2016 4/38

Definition

A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?

LATEX\textbf{•} \section{•}HTML <i></i>XML...

Uppsala May 2016 5/38

Why? Historical reasons

Uppsala May 2016 6/38

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 8/38

XML Specificities

XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags

Uppsala May 2016 9/38

XML Specificities

XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags

Uppsala May 2016 9/38

Why is XML useful?

Uppsala May 2016 10/38

Why is XML useful?

XML is useful because:it allows you to share highly compatible data, betweensystems, over time...

It is often used in NLP

Uppsala May 2016 11/38

Why is it useful?

Because XML is flexible, it allows to make some by-products.

Uppsala May 2016 12/38

XML by-products

Because XML is flexible it allows to make some by-products.

OWLMusicXML

Uppsala May 2016 13/38

XML by-products

Because XML is flexible it allows to make some by-products.

OWLMusicXML

Uppsala May 2016 13/38

XML by-products

Because XML is flexible it allows to make some by-products.

OWL

MusicXML

RSS

Uppsala May 2016 13/38

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 14/38

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested order<book><title></title><author></author></book>*<book><title><author></title></author></book>note: you can just as well create empty tags: <author/>XML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitive<Title></Title><author></author>*<Title></title><author></author>root element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

QUIZ

<?xml version="1.0"?><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body></note>

Is this (above) a "well formed" XML document?1 Yes2 No

Uppsala May 2016 16/38

QUIZ

<?xml version="1.0"?><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body>

Is this a "well formed" XML document?1 Yes2 No

Uppsala May 2016 17/38

XML Structure

Uppsala May 2016 18/38

XML Structure

XML has what we call a tree structure.

Uppsala May 2016 19/38

XML Structure

XML has what we call a tree structure.Example of tree structure

Uppsala May 2016 19/38

Let’s practice

Are you able to give me the tree structure of this XML?

Uppsala May 2016 20/38

DTD

The structure definition of an XML document is described by a"Document Type Definition", or "dtd"

Uppsala May 2016 21/38

DTD

An XML is sometimes associated to a DTD thanks to an extra lineadded at the beginning of the XML.

(DTD declaration is on the second line)

Uppsala May 2016 22/38

DTD

Anatomy of a DTD

Uppsala May 2016 23/38

DTD

Example:

XML

DTD

Uppsala May 2016 24/38

DTD

VocabularyWhen an XML document respects the DTD we say that it is"valid"

Uppsala May 2016 25/38

Tools

Commands exist to check that your XML is well formed/validWell formed:xmllint -noout document.xmlValid:xmllint -noout -valid document.xml

Uppsala May 2016 26/38

XPath

DefinitionXPath, the XML Path Language, is a query language for selectingnodes from an XML document. XPath uses path expressions toselect nodes or node-sets in an XML document. These pathexpressions look very much like the expressions you see when youwork with a traditional computer file system.

Uppsala May 2016 27/38

XPath tool: xmlstarlet

Among other functions, you can select elements in your XML withxmlstarlet like this:xmlstarlet sel -t -v "XpathCommand" document.xml

Example:xmlstarlet sel -t -v "/Text_description/contents/w[2]/form" lemms.xml

Uppsala May 2016 28/38

XPath

XPath commands look like computer file system pathsCan you guess which output should give this XPath command?/Text_description/contents/w[2]/form

1 vara2 ·3 var4 Det var så lite.5 5

Uppsala May 2016 29/38

What XPath is used for?

XPath is useful for:Navigating in the XMLCreating XSLT

DefinitionXSL stands for EXtensible Stylesheet Language, and is a stylesheet language for XML documents. XSLT stands for XSLTransformations.

Uppsala May 2016 30/38

XSL

XSL is to XML what CSS is to HTML

Uppsala May 2016 31/38

XSL

Uppsala May 2016 32/38

XSL advantages and drawbacks

The XSL language manages:loops <xsl:for-each>conditions <xsl:if>sorting <xsl:sort>

this language is very wordyno regex (must use XSL 2.0)

Uppsala May 2016 33/38

What should you do when you get an XML and anXSLT?

If you want to display it in a browser: add this red line to the XMLand open the XML:

If you want to get the output directly, use a processor.xsltproc document.xml document.xsl > whatever

Uppsala May 2016 34/38

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Summary

Today we learnt about a mark up language called XML. Itallows to structure data/text.Since XML is hyper-flexible we sometimes need a DocumentType Definition (DTD) associated with itWe learned some part of the XML terms, mainly ‘well formed’and ‘valid’The XSL based on Xpath language is one solution fortransforming your XML into whatever other document: plaintext, html etc.

Uppsala May 2016 36/38

QUIZ

When an XML document is conform to the dtd we say thatit is:

1 "Well formed"2 "Conformist"3 "Valid"

Uppsala May 2016 37/38

References

The reference, short and clear, with quizzes and practical exercises:http://www.w3schools.com/xml/default.aspBelow an example of parser that can give results in XML format:http://nlp.stanford.edu:8080/corenlp/process

Uppsala May 2016 38/38

top related