in350 lecture 2: document properties and markup languages august 29, 2002 judith a. molka-danielsen...

27
IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages & Properties

Post on 21-Dec-2015

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

IN350 Lecture 2: Document Properties and Markup Languages

August 29, 2002Judith A. Molka-Danielsen

Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages & Properties

Page 2: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Overview Review Properties of Documents Introduce the concept of Markup Languages. Describe the role of XML.

Page 3: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Classes of document processing

Text Processing: Initially computers were used to do tedious repetitive calculations (billing transactions) on information.

Often the calculations required preprocessing or typesetting of text.

Other issues include information storage (and compression algorithms to optimally store) and storage methods (indexing) and approaches to information retrieval.

Finally there was the preparation and processing of text for presentation purposes.

Page 4: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Classes of document processing

Document Processing: In the 1980s technologies like the PC, ethernet, laser printers, and graphical user interfaces with bit map displays, and text processing that was object based, allowed for indivduals to process documents. A text processing system called Scribe (by Brian Reid at CMU), represented a new kind of processing.

In text processors like IBM's Script, the user marked up text in terms of syntax characteristics, such as "12 point bold courier".

But Scribe formatted in terms of structural characteristics like, "heading". This was a transition to document processing.

Page 5: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Classes of document processing

Hypertext Processing: In the 1990s we saw the development of internetworks, and ubiquitous interfaces (windows).

Tim Berners-Lee at the National Radiation Lab at CERN created HTML and URL (Uniform Resource Locator) protocols so that a simple standardized form of markup, based on Scribe, could be used to describe documents and naming scheme would allow for the universal identification of documents.

So documents could be and viewed in graphical format and large collections linked across multiple internets. This is hypertext processing.

Page 6: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Properties of Documents

Syntax - can express structure, presentation style, semantics, and external actions. It can be implicit in the contents of a document or expressed in a language.

Structure - a structural element like a section can have can have a Formating Style associated with it that tells how the elements relate to each other within the document.

Presentation Style - is how the document is displayed or printed. It can be embedded in the documents such as in TeX, and use macros LaTeX. Or can be defined separately as CSS for HTML documents. Presentation style can be determined by the author (in applications or languages) or the reader (Web browser).

Semantics - the meaning within a language, can be associated with use.

Page 7: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Characteristics continued...

Metadata - information about the organization of the data. Data about the data. Such as, author, publication date, subject codes, etc.

Page 8: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Structured Label Information in Documents

There is a difference between Data and Documents. Documents are formated.

WYSIWYG word processors have problems They make documents that are for one output medium

(printer,online) Proprietary codes are for both style & format But it is hard to convert old document collections

(merge latex and word) Formats like ”headline” only mean BIG font size, but

have no structural meaning within the document People use too many options within a document (30

fonts on a page.

Page 9: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Text and formats

File formats - Word processing formats that are binary formats

include Word and WordPerfect. text - ASCII (American Code for Information Interchange) by

ANSI X3.6. Alternativly there is 16 bit Unicode (ISO 10616). raster graphics -

TIFF Tag Information File Format

GIF - Graphic Interchange Format

JPEG - Joint Photographic Experts Group

An example of a vector graphics standard is CGM Computer Graphics Metafile

printing - PostScript, PDF, EPS, PCL, LCDS, XML Printing Formats, ISO-IEC 10180 Standard Page Description Language, ISO-IEC 8624 Open Document Architecture (ODA)

Page 10: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Text and formats

File formats continued - multimedia

MPEG (motion picture expert group) AVI (audio video interleaved)

email email header - RFC822

SMTP - Simple Mail Transport Protocol, RFC823

POP - Post Office Protocol

IMAP - Intelligent Mail Access Protocol (more advanced than POP)

MIME - Multimedia Internet Mail Extension (attachments)

Page 11: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Text and formats

File formats continued -

For document interchange between applications there is RTF (rich text format).

Compression formats include ARJ, ZIP, and uuencode/uudecode.

Streaming Video formats include: QuickTime –MOV/QT, DivX-MPEG-4, Real Audio/Video – RAM/RM, Window Media - WMV

Page 12: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

What is Markup?

•Markup is everything in a document that is not content. Typesetters used procedural markup to lay out instructions of how a document should look. (16 pt bold Helvetica)

•Word Processing software like Microsoft Word uses Procedural markup. They have a specific set of markup codes. The codes apply to a single physical way of presenting information, such as on a printed page. It doesn't define the appearance on other media like CD-ROM or Internet.

•Descriptive markup, or generic markup, describes the structure of the document rather than the appearance. Content is separate from style. You can publish on all media using the same structure instruction set.

Page 13: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

SGML SGML (Standard Generalized Markup Language, ISO 8879,

1986), specifies a standard method for describing the structure of the document. Structural elements are for example: title, chapter, paragraph. It is an extensible Meta Language. It can supports an infinite variety of document structures like: information bulletins, technical manuals, parts catalogs, design specifications, reports, letters, memos.

The Document Type Definition (DTD) describes the structure of the document. (like a database schema in a database). The DTD provides a framework of elements (chapters, headers). The DTD specifies rules for the relationship between elements, ie. a chapter header must come after the start of a chapter. A document intance is a document whose contents is tagged in conformance with a DTD. A DTD can be applied throughout the whole organization.

Page 14: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

SGML continued

SGML uses tagging to identify the contents position within a DTD structure. So we insert tags around the content. You can nest elements. A parser program verifies that a document follows the rules of a DTD. The parser checks if the document is structurally correct.

Documents can be ported to different formats for different output medium (printer, screen, CD Rom, speaker, TV)

Style is usally handled separately by style sheets, like Cascading Style Sheets (CSS).

Page 15: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

HTML

HTML (first version in 1992) a tagging language that could be used on the World Wide Web for text formatting and linking documents. It adopts the syntax of SGML and is an application of SGML described by a particular DTD. HTML is not an extensible language. Authors cannot add their own tags. HTML supports style sheets written in CSS language (color, font, layout for web pages.) and Frameset to partition the browser window.

XHTML is modular approach to allow the support of markup tags in smaller client devices like cell phones, TVs, cars, kiosks, etc.

Page 16: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Positive features of HTML

HTML uses tags to separate content (text) from format (structure, appearance).

It lets amateurs control markup (good and bad) HTML tags were used for appearance formatting,

but little attention was used toward content structuring.

Page 17: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Negative features of HTML

HTML did not offer enough custom control over the WYSIWYG environment.

Things looked different in different browsers (reader interpreted, not author interpreted).

Navigating through hypertext requires user memory.

Designing hypertext (document collections) for easy searching is hard to do. Spiders, crawlers, robots, AltaVista index all try to index the web.

Page 18: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Comments on CSS

Cascading Style Sheets helped HTML by freeing tags like <font> and <b> from carrying format information. Puts them in the style sheet.

It lets tags like <header> carry structure information.

CSS is a styling tool that can work with other markup languages like XML.

Page 19: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Comments on separation of format and content

Formating• Structure• Appearance

Content•Information•Data

The Document

Structure – HTML does this a little bit. XML has DTD or Schema.

Appearance – or presentation, before HTML did thiswith tags like <b> but now all structurecontrol should be taken out of HTMLdocuments and put in CSS or XSL files.

Page 20: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Why a migration to XML was needed.

Binary files (in native formats) compress tightly for efficient transmission, but they are complex and proprietary. (XML files are larger, with markup there is more to store and transfer.(negative point))

To change documents between applications is hard. Must save data in text formats & move. Conversions were not always good. (XML writers define write formats, standards for loading, saving, open transfer) (between databases)

Lock-in let MS sell new versions of word that could read old format, save in new format, and then old versions could not read the files in new format. But, XML will handle document description and data description. Will not lose structure and labels in move.

Page 21: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

XML – what is it?

XML (XML 1.0, 1998, Extensible Markup Language) is also a meta language in that it describes other languages. There is not pre-defined list of elements.

Elements are specified using a DTD or Schema. Also style sheets can be used to specify the output format of each element (XSL).

XML is based on SGML but it is a subset and is considered easier to program. XML is also supported to be viewed in most current versions of browsers.

Page 22: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

XML related standards XPath Specifications for the data model and

grammar for navigating an XML document. XSL eXtensible Stylesheet Language includes

a language for transforming XML documents (XSLT) and a formatting vocabulary (XSLFO).

XSLT eXtensible Stylesheet Language Transformation defines a transformation language to convert XML documents into other formats.

XLL extensible linking language allow logic to be placed on linking.

Page 23: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

XML related standards & groups OAGIS The Open Application Group's (

www.openapplications.org) Integration Specification for interoperability between ERP packagesOASIS-ebXML

Organization for the Advancement of Structured In- formation Standards (OASIS) Electronic Business XML (www.ebxml.org).

FinXML Financial Markup Language (www.finxml.com) supports a universal standard for data interchange within the capital market. FpML Financial Products Markup Language (www.fpml.org) enables e-commerce activities in the financial derivatives field. OFX Open Financial Exchange (www.ofx.net) for the electronic exchange of financial data.

Page 24: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Other languages

MathML - tags for presenting formulas SMIL - language for scheduling multimedia (Synchronized

Multimedia Integration Language). It uses XML markup to identify and manage the presentation of files containing text, images, sound and video in multi-media presentations.

RDF - resource description format, format to contain metadata inform for XML.

HyTime - an SGML architecture that specifies the generic hypermedia structure of documents. Allows for the design of metaDTDs, for complex multimedia presentations, such as providing music with other media presentation.

See for more information on markup languages http://www.w3.org/

Page 25: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Here is the donut.xml file

<?xml version="1.0"?>

<?xml-stylesheet href="donut.xsl" type="text/xsl"?>

<memo>

<from>Jim</from>

<to>Joe</to>

<subject>Donuts again</subject>

<date>April 13, 2001</date>

<content>Donuts are here. But they will not be

here for long. Benny ate 3. </content>

</memo>

Page 26: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Here is the what you see in IE6.0 of the donut.xml file

From: JimTo: JoeRe: April 13, 2001

Donuts are here. But they will not be here for long. Benny ate 3.

Page 27: IN350 Lecture 2: Document Properties and Markup Languages August 29, 2002 Judith A. Molka-Danielsen Reference: Ch.6 Baeza-Yates, Text & Multimedia Languages

Here is the style sheet donut.xsl<xsl:stylesheet

xmlns:xsl=http://www.w3.org/1999/XSL/Transform version="1.0">

<xsl:output method="html"/>

<xsl:template match="/">

<html> <body>

<xsl:apply-templates select="memo"/>

</body> </html> </xsl:template>

<xsl:template match="memo">

<p>From:

<xsl:value-of select="from"/> </p>

<p>To:

<xsl:value-of select="to"/> </p>

<p>Re:

<xsl:value-of select="date"/></p><hr />

<p><xsl:value-of select="content"/></p>

</xsl:template> </xsl:stylesheet>