structured data

23
Structured Data 1. HTML 2. XML 3. XHTML 4. JSON 5. XMLSchema

Upload: willem

Post on 24-Feb-2016

48 views

Category:

Documents


0 download

DESCRIPTION

Structured Data. HTML XML XHTML JSON XMLSchema. Structured Data. Machine processable data needs to be structured There are many examples Properties files: h ost= example.com p ort=8080 p rotocol=https Comma Separated Values: host,port,protocol example.com,8080,https - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Structured Data

Structured Data

1. HTML2. XML3. XHTML 4. JSON5. XMLSchema

Page 2: Structured Data

Structured Data• Machine processable data needs to be structured• There are many examples• Properties files:

host=example.comport=8080protocol=https

• Comma Separated Values:

host,port,protocol example.com,8080,https• These are examples of ‘flat files’• hard to model composite structures

Page 3: Structured Data

HTML and XML• Derivatives of Standard Generalized Markup Language (SGML).• Offer machine readable, yet machine independent means of conveying

information• Use the angle bracket syntax (<>) to structure the document.• Based on a tree-structure:

root

siblings

<html><head></head><body> <p> hello world </p></body>

</html>

child

Page 4: Structured Data

Elements and Attributes• Elements are structural• Attributes qualify elements

attribute<html><head></head><body bgcolor=“red”> <p> hello world </p></body>

</html>

element

Page 5: Structured Data

Hypertext Markup Language (HTML)

• Its primary purpose is to convey information to a browser for human consumption:– <p>, <bold>, <italic>, <pre> etc.

• It does contain other tags that are not presentational.• Like one for metadata:

– <meta>• And ones that are structural:

– e.g. <head>, <body>, <div>, <span>• And some that are sort of in between:

– e.g. , <ol>, <ul>, <h1>, <title>• HTML can embed information:

– e.g. <img>, <object>• It can also contain style and script content in the header:

– <style>, <script>• Most importantly, it can link to other resources via the anchor tag and href

attribute:– e.g. <a href=“http:// example.com/otherpage.html”>

Page 6: Structured Data

HTML• HTML Example describing a book

<h1>The Cat in the Hat</h1><br><p>by Dr Seuss</p><ul>

<li>Publisher: HarperCollins</li><li>Genre: Children’s Fiction</li><li>Year: 2003</li><li>ISBN: 0-00-715853</li>

</ul>

<br>visit the website <a href=“http://harp.co.uk”>here</a>

Page 7: Structured Data

HTML• The main limitations of HTML are:

– Fixed set of tags– Focus on presentation

• Like the Web, it is primarily for human consumption– Not all HTML is ‘well-formed’, i.e. it breaks the tree structure

• The classic case is orphan <br> tags. Strictly speaking, a tag must either contain child tags, or be an empty tag (<br/>).

• During the browser wars mostly between M$ and Netscape, browsers became very forgiving of invalid markup to recruit users.

• This is just about OK when dealing with a fixed set of presentational tags, free market economics permitting

• But not sustainable and not good for machine parsing

Page 8: Structured Data

Extensible Markup Language (XML)

• XML is (e)xtensible.– You can create your own tags which means– Tags can be understood in semantic terms:

• e.g. <book> contains <author> – XML MUST be well-formed (no structural

inconsistencies like <br>)– validation against a Document Type Definition

(DTD) or XML Schema or RelaxNG document is easier because it is well-formed.• These define what a particular document can contain,

e.g. a book element MUST contain >= 1 author elements

Page 9: Structured Data

XML• XML Example of a book

<?xml version="1.0"?> <book>

<title>The Cat in the Hat</title><author>Dr Seuss</author><isbn>0-00-715853<isbn><genre>Children’s Fiction</genre><published>2003</published><publisher> <name>HarperCollins</name> <url>http://harp.co.uk</url></publisher>

</book>

Page 10: Structured Data

XML Pros• Plain text

– Human readable– Create/edit in standard text editor (if you really want to)

• Self-Describing, Structured Data– Extensible tag language– Machine readable– Can be validated against DTDs and Schema

• Presentation independent– Unlike HTML– Format to other languages using transformations (e.g.

XSLT)• Programming language independent

– Java, C, C++, Visual Basic, Perl…• Simple to parse• Widely used in many domains and for many purposes

Page 11: Structured Data

XML Cons

• The main limitations of XML are:– Verbose way of describing data– How do you include binary data (e.g. images)?

• (work in progress and not ubiquitously supported)– A proliferation of DTD and Schema types because

anyone can create their own tags• Lots of processing time for each new XML doc and

DTD/Schema you come across• New software components to understand the new XML

docs (their semantics not structure)• How do I know if your <author> tag means the same as

my <author> tag?

Page 12: Structured Data

XML Namespaces• This last issue is addressed through namespaces

– Allows a tag to be qualified by a URI:<a:author xmlns:a=“http://andrew/namespace”>

<s:author xmlns:s=“http://sue/namespace”>

• Now I can tell the difference between the two author tags :-)• But the XML is more complicated :-(• And what happens if I change the definition of my author tag?• I suppose I better change the namespace:

prefix namespace

<a:author xmlns:a=“http://andrew/namespace/v1”>

• That’s better :-)• But now every client that understood the previous namespace is

broken :-(

binding

Page 13: Structured Data

RDF XML example<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:foaf="http://xmlns.com/foaf/0.1/”> <foaf:Person rdf:about="#AL"> <foaf:name>Archibald Leach</foaf:name> <foaf:mbox_sha1sum>cf2342293...</foaf:mbox_sha1sum> <foaf:knows> <foaf:Person> <foaf:name>Katharine Hepburn</foaf:name> </foaf:Person> </foaf:knows> </foaf:Person></rdf:RDF>

Page 14: Structured Data

XHTML• In between HTML and XML

– It is valid HTML and valid XML• MUST be well-formed.• Fixed set of tags

– Makes use of HTML non-presentational tags.– Defers presentational concerns completely to

Cascading Style Sheets (CSS)• Instead uses element attributes to inject presentational

hints to the CSS:

<div class=“my-important-type”>I’m important</div>

Class attribute

Page 15: Structured Data

Cascading Style Sheets(CSS)• A rendering language that goes in the header of an HTML page

– Property based• element -type {presentation-key : value}

• CSS allows for extensibility!– I can define a class, and define rendering hints to the browser for that class:

<style type=“text/css”> .my-important-type {font-color: red}</style>And in the document:<div class=“my-important-type”>Hey wait!</div>

• Hey, wait!• at the same time as defining rendering hints to the browser, I’m also

classifying an element in the document.• Perhaps I can use this to support semantic information, not just rendering

information• So I could call my class .book and have elements inside it like .title

and .author. Hmm…

Page 16: Structured Data

XHTML example<head>

<title>My Book</title></head><body>

<div class=“book”><h1 class=“title”>The Cat in the Hat</h1><p>by <span class=“author”>Dr Seuss</span></p><ul> <li>Publisher: <span class=“pub”>HarperCollins</span></li> <li>Genre: <span class=“genre”>Children’s

Fiction</span></li> <li>Year: <span class=“year”>2003</span></li> <li>ISBN: <span class=“isbn”>0-00-715853</isbn></li></ul>

</div><p>visit the website at <a href=“http://harp.co.uk” class=“url” title=“http://harp.co.uk”>here</a>

</body>

Page 17: Structured Data

XHTML with some CSS• Here’s what it looks like in a browser

with a bit of CSS in the head of the HTML page:The important thing to take away here is that the data has not been lost through rendering.It looks nice for a human, but a machine can still extract the book properties

Page 18: Structured Data

HTML 5• Builds on HTML 4• A set of features, rather than a monolithic spec.• Not all browser support all features yet.• HTML 5 MUST be well-formed (XHTML)• Some core features:

– Canvas – drawing area– Video – embed directly – no need for plugins– Local storage– Multi-threaded Javascript– GEO location– Semantic tags – section, header, footer etc.– Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.

Page 19: Structured Data

HTML 5• Micro data – embedded semantic metadata, e.g.

licencing, vCards and your own vocabs.• You can create scopes on a tag:

<section itemscope itemtype="http://data-vocabulary.org/Person">

– Then mark up elements within the scope:<img itemprop="photo” src=“…”/>

<p itemprop=”name”>Andrew</p>

Then publish your vocabulary so people can use it.Publish in human readable for, and RDF for machine processing.

See http://html5demos.com/

Page 20: Structured Data

Javascript Object Notation (JSON)

• Another structured document type, not based on XML.• Instead uses properties, and nested curly braces to describe

data:{"location":

{"id": "WashingtonDC", "city": "Washington DC",

"venue": "Hilton Hotel, Tysons Corner", "address": "7920 Jones Branch Drive”

} }

• Essentially a dictionary• Supports number, string, boolean, array (list) and Object (map)• JSON can be parsed into a Javascript object using the

eval(string) method.• Popular because it is simpler than XML and natively understood

by browsers.

Page 21: Structured Data

XML Schema

• XML Syntax for describing how XML documents should be structured.– Has some built-in data types

• Allows for validation of an XML document

• Allows for code generation– Create objects in your favorite

programming language to manipulate XML documents

Page 22: Structured Data

<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace="urn:book" xmlns:bk="urn:book">

<xsd:element name="book" type="bk:Book"/>

<xsd:complexType name="Book"> <xsd:sequence> <xsd:element name="title" type="xsd:string"/> <xsd:element name="author" type="xsd:string"/> <xsd:element name=”isbn" type="xsd:string"/> <xsd:element name="genre" type="xsd:string"/> <xsd:element name=”published” type="xsd:date" /> <xsd:element name=”publisher" type=”bk:Publisher”/> </xsd:sequence> </xsd:complexType>

<xsd:complexType name=”Publisher"> <xsd:sequence> <xsd:element name=”name" type="xsd:string"/> <xsd:element name=”url" type="xsd:anyURI"/> </xsd:sequence> </xsd:complexType></xsd:schema>

Page 23: Structured Data

Structured Data

• Why use structured data?• Understand how structured data

encapsulates information• What are the strengths/weaknesses of

different types of structured data?