xml i. xml meta-language html and xml, which are applications based on sgml (standard generalized...

32
XML <Part> I </Part>

Upload: melvin-malone

Post on 12-Jan-2016

233 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

XML

<Part> I </Part>

Page 2: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

XML Meta-languageHTML and XML, which are applications based on

SGML (standard generalized markup language), use tags (markup) to represent information to both human and machines, and are meant to work on any devise and system

In contrast to HTML (hypertext markup language) which provides a fixed set of predefined formatting tags (e.g., for font face, size, and color; lists) to display a document, XML (extensible markup language) allows users to define their own tags to structure a document

XML separates a document’s content from its format through structural information by describing all parts of the information (i.e., elements), defining their relationships, and constraining the values that they can take

Page 3: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

SyntaxWhile some HTML tags do not have to have an

end tag (e.g., the paragraph <p> tag does not need a </p> tag), the XML tags must all have an end tag

In addition to this different requirement for XML, nesting of XML tags must be in sequence, e.g., the nesting sequence.

For example the following is invalid:

<River> <Name> Mississippi </River></Name>

The valid form is: <River> <Name> Mississippi </Name> </River>

Page 4: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Root elementAll elements of an XML document must be

contained in a root element, and the whole document is structured as a tree

Each transitively nested element is a child of it enclosing, parent element, which is the child of its grandparent of the parent element, and so on

Name of the XML tags are case sensitive, i.e., ocean and Ocean are considered to be two different tags.

Page 5: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Aquifer is the root element <?xml version="1.0" encoding="UTF-8"?>

<Aquifer> xmlns:xsi="http://www.w3.org/2001/XMLSchema-

instance" xsi:noNamespaceSchemaLocation=

"file:/C:/...XML_schemas/Aquifer.xsd"<name>Floridan</name><type>Confined</type>

</Aquifer>

Page 6: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Markup LanguagesXML is a standard meta-language in the sense that it

allows community of users to make markup language for their domain of interest (e.g., hydrogeology markup language; structural geology markup language)

These markup languages are domain-specific vocabularies that include terms (e.g., Fault, Mineral) put into a structure based on their relationships, as understood by the community, and permissible values

One main use of XML, in addition to making it possible to make domain-specific markup languages, is automatic data exchange among different applications

Page 7: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

XML ApplicationsWhereas HTML is mostly about appearance for web

site publishing, XML concentrates on structuring the content of documents

There are two types of XML applications based on purpose: document and data

The document application manipulates information for human consumption (e.g., for publishing)

The data application manipulates information for automatic software processing

Because XML expresses the structure of a document, it can be automatically converted to different formats, and delivered via a variety of media

Page 8: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

StandardsXML has many so called companion standards. These

standards include XSL (XML Stylesheet Language) and CSS (Cascading Stylesheet Language) which are two style sheet languages for XML, allowing conversion of XML code into HTML or other formats

These languages are used for rendering XML on different media (screen, paper)

Document Object Model (DOM) and Simple API for XML (SAX) are the APIs to access XML documents by browsers, editor, etc

Other standards include XLink and XPointer which allow relationships among documents

Namespace standard provides a global scope to the elements, allowing reuse of existing elements into new ones without naming conflicts

Page 9: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

XML CodeTo write XML code, you need an XML editor such as

XML Spy or Oxygen. These editors also allow you to write style sheets and use other standards XMetal is the easiest one, which hides the tags and works

like a word processor

Other editors include XML notepad which is freely available from msdn.microsoft.com, and XML Pro (www.vervet.com)

To view XML code on the Web you need an XML browser Many browsers support XML

As a standard, XML allows exchanging and publishing information by providing mechanisms to define the structure (syntax) of the content

Note that XML is about syntax, and does not provide any semantics about the document. That’s why RDF, RDFS, and OWL were created to convey

meaning of the contents

Page 10: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

UnicodeXML uses the Unicode standard for characters in

a document. Unicode allows 16 bits (UTF-16) per character, which can handle characters of all human (natural) languages in an XML document

UTF-16 uses way more than the limited 256 characters provided by the Windows default character set which only needs 8 bits per character (UTF-8)

The UTF-16 encoding is very important for information globalization, for example to write the name of an element (e.g., ocean) in an XML document in different natural languages (e.g., French, Chinese)

Page 11: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Latin-1 character setOther encodings must be put in the ‘XML

declaration element’

For example, the second attribute in the following XML processing instruction (what is between <? and ?>) allows the use of the Latin-1 character set for European languages:

<?XML version= “1.0” encoding=“ISO-8859-1”?>

Page 12: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

This allows using accents (e.g., î), among other special characters. Character reference can be used to put Unicode character to write foreign words with accents

Benoît Mandelbrot’s name can be written as:

<name> Beno&#238;t Mandelbrot </name>

‘238’ is the ASCII decimal character number for ‘î’, which could be checked in the MS Word’s character map

Page 13: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Predefined entitiesPredefined entities are used to escape the

delimiters in elements or attributes

The following are common predefined entities: &lt; for the ‘less than’ sign, <&amp; for the ampersand sign, &&gt; for the ‘greater than’ sign, > &apos; for the apostrophe sign, ‘ &quot; for the quotation mark, “

The XML parser reading a document with these entities substitutes the value for each entity.

An XML parser recognizes the start tag of any element with the ‘<’ and its end with the ‘>’ character.

It also recognizes the end tag by the ‘</’ and the ‘>’ characters

Page 14: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Entities cont’d Because the < and > characters are reserved, we

cannot use them in the content of elements without using the entities

This means that we can only use the ‘<’ for ‘less than’, and ‘>’ for ‘greater than’ characters in the content of an XML document with entities such as &lt; and &gt;, respectively

Forward slash does not create a problem on its own, but as ‘</’, it needs to be escaped with &lt;/

For example the content of the element: <equation> x < y </equation> will cause an error because the parser will interpret the ‘<’ sign after x as the beginning of a new tag, which is not closed

The correct way of expressing it is: <equation> x&lt;y </equation>

Page 15: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

…We can declare our own entities using the

following construct: <!ENTITY entityName “substitute”>, where ‘substitute’ is the replacement content which the parser puts for the entity when it processes the document

We can reference the entity in a document with the ‘&entityname;’ just like the predefined entities

The &blurb; and &cc; are good examples of the user-defined entities

Notice that the ampersand (&) and semicolon (;) are used by the parser to mark out the entity name

Page 16: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

XML:langBecause XML does not allow space to be

inserted in an element or attribute name, we must use XML:space, which can be set to preserve, to force spacing if it is desired (equivalent to HTML’s <PRE> tag)

XML:lang allows setting the language for the content of a document, for example

XML:lang=”en-US” specifies the language to be American English

Page 17: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Element StructureXML documents are text, and include markup; which

is enclosed in angle brackets (<>), and character data which lie between the markup, e.g., <Country> France </Country>

Here, ‘France’ is an instance of the country element. Together, the markup (tags) and the element content constitute the element

The content of the element (e.g., France) is between the start tag (e.g., <Country>) and the end tag which starts with a slash, e.g., </Country>

The start tag gives a name (generic identifier) and defines the element type (e.g, Country). Descriptive tag names (e.g., Rock, River) may provide an

informal semantics to humans, but not to software

Page 18: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Element contentElements can have the following types of content:

element content: i.e., other sub-elements, character data: i.e., text, mixed content: text and other sub-elements,

and no text or element

If an element has no text or element content, it is an empty element, and is denoted only with the end tag, by putting the slash at the end

Information in an empty element is carried in its attributes

Empty elements can be used for things that may not have content, for example, a document for minerals may have an element called formula, which may not be known by users

Page 19: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Naming ruleValid element name, which are case

sensitive, should start with a letter or underscore, and can include letters (including those with accents), numbers, underscore, dot, or hyphen, but no spacing

The following are all valid XML names: <bodyOfWater>, <Ductile-deformation>, <Fenêtre>, and <magma2solid>

Examples of invalid names are: <2ndRidge>, <strike&dip>, <XMLdoc>, <deformation structure>, and <y=x+6>

Page 20: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

…The ‘XML’ string, in any form, cannot start a

name, and characters other than the ones mentioned above cannot be included in the name

Colon (:), however, can but should not be used because namespaces use them in their declaration (see below)

In this course, for the sake of consistency and distinction, the names of elements are in capital CamelCase, whereas those of the attributes are in lower case camelCase

Page 21: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Tree structureAn element can have child elements nested in

it in a tree structure

The topmost element in a document is called the root element, which is the parent of all the child elements, i.e., all elements must be nested in this mandatory element

For example, the ‘aquifer’ element is the root element in the following instance document

The content between ‘<! --‘ and ‘- - >’ is a comment, and is written for people, and is ignored by the XML parser

Page 22: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Aquifer Element <?XML version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"><xs:element name="Aquifer"> <xs:complexType> <xs:sequence> <xs:element name="name"/> <xs:element name="type"/> <!-- the remaining sub-elements not shown for brevity --> </xs:sequence> </xs:complexType></xs:element></xs:schema>

Notice that the ‘name’ and ‘type’ elements defined for the Aquifer element can be defined as attributes because they have a simple structure

Page 23: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Attributes Whereas all XML documents have at least one element (the root element), the

elements can have zero or more attributes

Data in an XML document may be stored in the elements and/or the attributes of these elements. Attributes are characteristic of the elements that add more information to the content of the elements without modifying the structure of the document

Always ask if an attribute can be an element or a part of an element If it is atomic, i.e., cannot be broken into smaller

information fragments, then it should remain an attribute, it should be an element if it could have its own sub-

elements or attributes

For example, it does not make sense to make the structure of a rock an attribute, because structure is a complex entity, which can have many elements with sub-elements and attributes.

The density of a mineral, could be an attribute if it only takes one value

Page 24: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

The attributes are given names and values, which are enclosed in single or double quotation marks, in the start tag of the element

For example, the Mineral element may have a Boolean monomineralic attribute, which can take a true or false value

Notice the required (single or double) quotation marks that enclose the names in the W3C XSD schema

If an element has more than one attribute, they must be uniquely named, and separated with space in the start tag

Empty elements can have attributes

Page 25: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Mineral element <? XML version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element name="Mineral"> <xs:complexType> <xs:attribute name="type"/> <xs:attribute name ="name"/> <xs:attribute name="monomineralic"

type="xs:boolean"/> <!-- more attributes --> </xs:complexType> </xs:element></xs:schema>

Page 26: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Structure of XML documentsXML instance documents are text files with the .xml

extension

W3C XSD schemas have a .xsd extensions

The instance document starts with an XML declaration

The declaration is put between <?xml and ?>, and has three attributes: version, which is required and currently has the “1.0”

value,encoding, that optionally defines the character set for

the document, with a default value of “UTF-8”, and the

standalone which has a default value of ”Yes”, meaning that it does not have a schema or DTD for processing

Standalone’s value can optionally be set to “no” if the document needs a DTD or schema, for example:

<?XML version=”1.0” encoding=”UTF-8” standalone=”no”?>

Page 27: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

NamespaceXML allows different communities of scientists (e.g.,

oceanography and atmospheric science) to independently develop their own markup languages

More often than not, there is a need to integrate these autonomously developed vocabularies into other applications

It is very common for two markup languages to contain elements that have the same name, but constructed in different element structures

For example, suppose that the mineralogy and sedimentology vocabularies, developed separately by the Mineralogy and Sedimentology communities, both include a same element named ‘Mineral’, albeit with different structure as given in the following XML document code snippet

Page 28: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Mineral instance document<!-- Mineral element in the Mineralogy domain --> <Mineral> <Name>orthoclase</Name> <Type>silicate</Type> <Composition>KAlSi3O6</Composition> <Color>flesh</Color> <Hardness>6</Hardness></Mineral>

<!-- Mineral element in the Sedimentology domain --><Mineral> <Name>quartz</Name> <Type>grain</Type> <Composition>silicate</Composition> <Shape>angular</Shape></Mineral>

Page 29: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Avoiding name conflictThe two ‘Mineral’ elements have different sets of sub-

elements, and mean different things for the human users (not to software, though!)

This is not a problem as long as the two ‘Mineral’ elements are only used locally in their respective domains

If the two vocabularies are shared by an application, there will be a name conflict because of the two differently structured ‘Mineral’ elements

The XML parser and processor would not know which is which, and will through an error

Namespace prevents this kind of name collision by assigning the elements with the same name (e.g., Mineral), which belong to different communities, to different URIs that reference these communities

The namespace will tell the parser that the similar elements belong to different namespaces

Page 30: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Namespace prefixDeclaration of a namespace is done by the xmlns

attribute, which allows choosing both an optional prefix, and a URI for the namespace

The prefix and the URI are put in the outermost element where we want to use the namespace, using the format: <MyElement xmlns:prefix=”URI”>, e.g.,

< Mineral xmlns:min = “http://www.geology.org/minerlogy”>

The MyElement, in this case, is the one for which we are declaring the namespace, and the URI is the identifier for the namespace

Page 31: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

Default namespace If the optional prefix is not provided, then we have a default

namespace

The syntax for default namespace is as follows: <MyElement xmlns =”URI”> for example: <Mineral xmlns = “http://www.geology.org/minerlogy”>

In such a case, all elements in the current document are members of the default namespace. If we intend to share a document with others, it is a good idea to declare a default namespace

The URI identifier in this case can be used later by other people if they want to integrate it with their own vocabulary They can later assign a prefix for this URI

Assume that we want to integrate the markups of mineralogy and sedimentology, and both have an element called Mineral. In this XML document, we are declaring the ‘min’ prefix for the mineralogy, and ‘sed’ prefix for the sedimentology markup:

Page 32: XML I. XML Meta-language HTML and XML, which are applications based on SGML (standard generalized markup language), use tags (markup) to represent information

<?XML version=“1.0”?><!-- assume ‘Lithology’ is the root element --> <Lithology xmlns:min = “http://www.geology.org/minerlogy” xmlns:sed = “http://www.geology.org/sedimentology”<min:Mineral>

<name>orthoclase</name> <type>silicate</type><composition>KAlSi3O8</composition><color>flesh</color><hardness>6</hardness>

</min:Mineral><sed:Mineral>

<name>quartz</name> <type>grain</type><composition>silicate</composition><shape>angular</shape>

<sed:Mineral></Lithology>