xml - cl.lingfil.uu.semarie/undervisning/textanalys16/xml.pdfdefinition...

51
XML Marie Dubremetz marie.dubremetz@lingfil.uu.se Uppsala, May 2016

Upload: dodien

Post on 21-Apr-2019

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML

Marie [email protected]

Uppsala, May 2016

Page 2: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Presentation Plan

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 2/38

Page 3: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 3/38

Page 4: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Definition

A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?

Uppsala May 2016 4/38

Page 5: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Definition

A (document) markup language is a modern system for annotatinga document in a way that is syntactically distinguishable from thetext.Examples ?

LATEX\textbf{•} \section{•}HTML <i></i>XML...

Uppsala May 2016 5/38

Page 6: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why? Historical reasons

Uppsala May 2016 6/38

Page 7: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Page 8: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Page 9: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Page 10: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why?

Internet is huge, diverse, heterogeneous, thus how can we perform:data transmission ?standardization ?easy manipulation ?

Uppsala May 2016 7/38

Page 11: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 8/38

Page 12: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML Specificities

XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags

Uppsala May 2016 9/38

Page 13: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML Specificities

XML:XML is made for storing dataIs not made for displaying informationLets you invent your own tags

Uppsala May 2016 9/38

Page 14: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why is XML useful?

Uppsala May 2016 10/38

Page 15: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why is XML useful?

XML is useful because:it allows you to share highly compatible data, betweensystems, over time...

It is often used in NLP

Uppsala May 2016 11/38

Page 16: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Why is it useful?

Because XML is flexible, it allows to make some by-products.

Uppsala May 2016 12/38

Page 17: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML by-products

Because XML is flexible it allows to make some by-products.

OWLMusicXML

Uppsala May 2016 13/38

Page 18: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML by-products

Because XML is flexible it allows to make some by-products.

OWLMusicXML

Uppsala May 2016 13/38

Page 19: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML by-products

Because XML is flexible it allows to make some by-products.

OWL

MusicXML

RSS

Uppsala May 2016 13/38

Page 20: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Table of Contents

1 Introduction

2 XML Specificities and Motivations

3 XML: Vocabulary and Techniques

Uppsala May 2016 14/38

Page 21: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested order<book><title></title><author></author></book>*<book><title><author></title></author></book>note: you can just as well create empty tags: <author/>XML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Page 22: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitive<Title></Title><author></author>*<Title></title><author></author>root element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Page 23: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Page 24: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Page 25: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Flexible does not mean rule-free

XML document must respect some syntax rules:a good nested orderXML is case sensitiveroot element is mandatorywrite comments like that: <! -- My comment -->attributes are between " ", <myTag myAttribute="0">

Vocabulary definitionWhen an XML document respects this syntax we say that it is"well formed"

Uppsala May 2016 15/38

Page 26: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

QUIZ

<?xml version="1.0"?><note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body></note>

Is this (above) a "well formed" XML document?1 Yes2 No

Uppsala May 2016 16/38

Page 27: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

QUIZ

<?xml version="1.0"?><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don’t forget me this weekend!</body>

Is this a "well formed" XML document?1 Yes2 No

Uppsala May 2016 17/38

Page 28: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML Structure

Uppsala May 2016 18/38

Page 29: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML Structure

XML has what we call a tree structure.

Uppsala May 2016 19/38

Page 30: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XML Structure

XML has what we call a tree structure.Example of tree structure

Uppsala May 2016 19/38

Page 31: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Let’s practice

Are you able to give me the tree structure of this XML?

Uppsala May 2016 20/38

Page 32: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

DTD

The structure definition of an XML document is described by a"Document Type Definition", or "dtd"

Uppsala May 2016 21/38

Page 33: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

DTD

An XML is sometimes associated to a DTD thanks to an extra lineadded at the beginning of the XML.

(DTD declaration is on the second line)

Uppsala May 2016 22/38

Page 34: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

DTD

Anatomy of a DTD

Uppsala May 2016 23/38

Page 35: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

DTD

Example:

XML

DTD

Uppsala May 2016 24/38

Page 36: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

DTD

VocabularyWhen an XML document respects the DTD we say that it is"valid"

Uppsala May 2016 25/38

Page 37: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Tools

Commands exist to check that your XML is well formed/validWell formed:xmllint -noout document.xmlValid:xmllint -noout -valid document.xml

Uppsala May 2016 26/38

Page 38: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XPath

DefinitionXPath, the XML Path Language, is a query language for selectingnodes from an XML document. XPath uses path expressions toselect nodes or node-sets in an XML document. These pathexpressions look very much like the expressions you see when youwork with a traditional computer file system.

Uppsala May 2016 27/38

Page 39: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XPath tool: xmlstarlet

Among other functions, you can select elements in your XML withxmlstarlet like this:xmlstarlet sel -t -v "XpathCommand" document.xml

Example:xmlstarlet sel -t -v "/Text_description/contents/w[2]/form" lemms.xml

Uppsala May 2016 28/38

Page 40: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XPath

XPath commands look like computer file system pathsCan you guess which output should give this XPath command?/Text_description/contents/w[2]/form

1 vara2 ·3 var4 Det var så lite.5 5

Uppsala May 2016 29/38

Page 41: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

What XPath is used for?

XPath is useful for:Navigating in the XMLCreating XSLT

DefinitionXSL stands for EXtensible Stylesheet Language, and is a stylesheet language for XML documents. XSLT stands for XSLTransformations.

Uppsala May 2016 30/38

Page 42: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XSL

XSL is to XML what CSS is to HTML

Uppsala May 2016 31/38

Page 43: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XSL

Uppsala May 2016 32/38

Page 44: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

XSL advantages and drawbacks

The XSL language manages:loops <xsl:for-each>conditions <xsl:if>sorting <xsl:sort>

this language is very wordyno regex (must use XSL 2.0)

Uppsala May 2016 33/38

Page 45: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

What should you do when you get an XML and anXSLT?

If you want to display it in a browser: add this red line to the XMLand open the XML:

If you want to get the output directly, use a processor.xsltproc document.xml document.xsl > whatever

Uppsala May 2016 34/38

Page 46: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Page 47: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Page 48: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Conclusion

What experience will tell youEven if XML can (almost)always be handled by writing your ownprogram and regex... Handling XML tools will save you time. Lookin your favourite language documentation: Python, Java etc. thereis always a library for it. As any language XML, XPath, XSLT arelearned through practice!

Uppsala May 2016 35/38

Page 49: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

Summary

Today we learnt about a mark up language called XML. Itallows to structure data/text.Since XML is hyper-flexible we sometimes need a DocumentType Definition (DTD) associated with itWe learned some part of the XML terms, mainly ‘well formed’and ‘valid’The XSL based on Xpath language is one solution fortransforming your XML into whatever other document: plaintext, html etc.

Uppsala May 2016 36/38

Page 50: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

QUIZ

When an XML document is conform to the dtd we say thatit is:

1 "Well formed"2 "Conformist"3 "Valid"

Uppsala May 2016 37/38

Page 51: XML - cl.lingfil.uu.semarie/undervisning/textanalys16/XML.pdfDefinition A(document)markuplanguageisamodernsystemforannotating adocumentinawaythatissyntacticallydistinguishablefromthe

References

The reference, short and clear, with quizzes and practical exercises:http://www.w3schools.com/xml/default.aspBelow an example of parser that can give results in XML format:http://nlp.stanford.edu:8080/corenlp/process

Uppsala May 2016 38/38