lis512 lecture 4 xml: documents and records. up until now relational databases can store information...

Post on 01-Jan-2016

214 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

lis512 lecture 4

XML: documents and records

up until now

Relational databases can store information that is internal to an organization.

But a lot of information has to related to the outside world.

This is when other considerations come in.

two basic types

There are two basic types of outside communication tools records sets documents

It's difficult to separate them precisely, but let's say that records are much more precisely defined.

general outside communication

Traditional communication has mainly been achieved through issuing of documents.

Example: a court issues a judgment on a case. Most documents contain character data. But they also contains something else. That's where markup comes in.

special outside communication

In special cases, organizations make records available to other.

These records have a format that allows others to process them to

That format is quite rigid and usually purpose-built.

metadata

Metadata is another form of records. The term metadata is usually defined as “data

about data”. As such it is controversial what is metadata and what is data.

As far as we are concerned metadata are records that are attached to documents.

metadata example mail

If you send and receive email, you will sometimes see what is knows an email headers.

These collection of fields are of the form attribute: value.

Example on next slide

From krichel@openlib.org Sun Jul 12 14:55:16 2009Date: Sun, 12 Jul 2009 14:55:16 +0700From: Thomas Krichel <krichel@openlib.org>To: krichel@lilrc.orgMessage-ID: <20090712075516.GA25777@trabbi.openlib.org>MIME-Version: 1.0Content-Type: text/plain; charset=us-asciiContent-Disposition: inlineEnvelope-to: Thomas Krichel <krichel@openlib.org>Return-Path: Thomas Krichel <krichel@openlib.org>User-Agent: Mutt/1.5.18 (2008-05-17)Status: ROContent-Length: 5Lines: 1

metadata example: http headers

• HTTP/1.1 200 OK• Date: Wed, 24 Feb 2010 17:34:33 GMT• Server: Apache/2.2.14 (Debian)• Last-Modified: Sun, 13 Dec 2009 08:03:42 GMT• ETag: "5f8271-f76-47a9798613380"• Accept-Ranges: bytes• Content-Length: 3958• Connection: close• Content-Type: text/html

example id3v1• A fixed 128 byte format. – header 3 bytes "TAG" – title 30 bytes of the title – artist 30 bytes of the artist name – album 30 bytes of the album name – year 4 byte year– comment 30 bytes– zero-byte 1 If a track number is stored, this byte

contains a binary 0.– track 1 The number of the track on the album, or 0. – genre 1 Index in a list of genres, or 255

MARC MARC is as important example of a record

format used in by the library community Integrated Library Systems (ILSs) all

import MARC records into relational database system

export MARC records from relational database systems

MARC records describe records from library catalogs.

MARC format

• The MARC format is very complicated. • The basic structure is – Leader– Directory– Variable Control Fields– Variable Data Fields

MARC leader

• Described in http://www.loc.gov/marc/bibliographic/bdleader.html

• When they talk about character, they mean a byte.

MARC directory

• The MARC directory follows the leader. • I am not sure what it’s purpose is.

• The general record structure is at http://www.loc.gov/marc/specifications/specrecstruc.html

MARC variable fields

• In MARC all field names are numbers. There are three digits to each fields.

• Numbers that start with 00 are for fields that are called control fields.

• Fields that start with 0 are numbers and control fields.

• Fields that do not start with 0 are the main field we study in cataloging.

field indicators

• Each field other than those starting with 00 can have zero, one or two field indicators.

• The field indicator says something additional about the field.

subfields

• Fields other than the one’s starting with 00 admit subfields.

• A subfield is identified by a letter a to z.

markup

Markup is the information contained in a document that is not its contents.

Markup mainly comes with two types of information information related to the structure information related to the appearance

In good documents, structure and appearance are related.

if there where no markup If markup would not exist, it would be quite

trivial to represent every document with a relational database structure.

You simply have a table with character positions (first character position to last character position) and the character found there.

But this would hardly correspond to our idea of a document.

structure The structure of a document is a bit difficult

to define, but easy to understood by example. In a printed document, the sequence of pages

defines one structure. But if the book has chapters and sections,

they to define structures, and so do index page, title pages etc.

A database tableu representation of this becomes messy.

structure The structure of a document is a bit difficult

to define, but easy to understood by example. In a printed document, the sequence of pages

defines one structure. But if the book has chapters and sections,

they to define structures, and so do index page, title pages etc.

A database table representation of this becomes messy.

appearance

Appearance is usually used to communicate the structure of document in a way that aids a human to understand the structure.

For example, look at this slide. We can conside that it is a document. Find way in which the appearance communicates the structure.

appearance

Appearance covers things such as fonts used background and foreground colors positioning of structural elements

If a document has some appearance and structure, it is tough to adapt it to a relational database structure.

XML

XML is a syntax to encode information as documents.

XML is not really a language since it has no vocabulary.

You can use any vocabulary you like.

XML nodes

XML is written in the form of nodes. I will only discuss three types of nodes here character data XML elements attributes to elements

Character data as just that: characters.

XML elements

If you write an element, write something of the form.

<name>contents</name> here name is the name of the element and contents is the contents of the element.

The contents can be character data and or other elements.

XML tags

<name> is the start tag of an element that is called name.

</name> is the end tag of an element that is called name.

XML tags a syntactic feature of XML. They are not nodes.

empty elements

If an element has no contents whatsoever, it can be written as

<foo></foo> or <foo/> in the latter case it is an empty element

element examples

<name>Thomas Krichel</name> <name>Mr. <first>Thomas</first>

<last>Krichel</last></name> <thomaskrichel/> <foo><bar>hello world</bar></foo>

child elements

• If an element is in the contents of another element, it is called a child element.

• When you write an XML document all elements much be children of one single element. That single element is the called the root element. The root element is the only element without a parent element.

attributes

Attributes attach name=value pairs to element.

These attribute value pairs appears written at the start stage

attribute examples

<name type="full">Thomas Krichel</name> <name type="reverse">Krichel,

Thomas</name> <name string="Thomas Krichel"/>

more on attributes

Attributes names and values are strings. Attribute values are surrounded by single or

double quotes. Attributes names are separated from values

by the = sign.

XML application examples

HTML is the language used to encode a specific type of documents known as a web page.

It has a vocabulary on element names and attribute names.

HTML is written in XML syntax or a syntax that is close to it.

example HTML element <a>

The <a> element creates an anchor. This is a part of the document that leads to another.

Where it leads to is given by an attribute called href. Example

<a href="http://openlib.org/home/krichel"> Thomas Krichel</a>

example HTML element <img/>

The HTML element <img/> requests an image to be included in the web page

<img src="http://openlib.org/home/krichel/ToK.gif" alt="picture of Thomas Krichel"/>

Note that this element is empty.

MARC XML

• In order to increase the interoperability of MARC defined a mapping of the MARC format into the XML syntax.

• Not everybody thinks it is a good idea. http://serials.infomotions.com/ngc4lib/archive/2009/200909/1450.html

• A shamelessly copied example is at http://wotan.liu.edu/home/krichel/courses/lis512/external_doc/sandburg.xml

start of the example

<collection><record><leader>01142cam 2200301 a 4500</leader><controlfield tag="001"> 92005291 </controlfield><controlfield tag="003">DLC</controlfield><controlfield tag="005">19930521155141.9</controlfield><controlfield tag="008">920219s1993 caua j 000 0 eng

</controlfield><datafield tag="010" ind1=" " ind2=" "><subfield code="a"> 92005291 </subfield></datafield>

end of the example

<datafield tag="650" ind1=" " ind2="1"><subfield code="a">Visual perception.</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Rand, Ted,</subfield><subfield code="e">ill.</subfield></datafield></record></collection>

comments on example

• In an XML document, there must be one element that all other elements are children of.

• In this case this is the <collection> element.• The <collection> can contain many <record>

elements. In the example, there is just one.• Find the features of MARC as set out in the

description of MARC.

top related