xml tutorial. today’s web: created by hand for-eyes-only can html become smarter? sgml -> xml...
TRANSCRIPT
XML Tutorial
•Today’s web: Created by hand for-eyes-only
•Can HTML become smarter? •SGML -> XML •The next generation web: XML and component-based commerce
•Prologue: XML and EDI
Outline
A Web Created by Hand for Eyes
•Much of the web is “hand-crafted”•HTML often exploited and extended to achieve specific layout and formatting
•HTML has too low an “Information IQ” to enable many desirable applications
The Limits of “Hand-crafting”
Numberof Pages
Time to Convert Word Processing Documentand Apply HTML Markup (minutes/page)
10
10000
1000
100
100000
1 10 60
10 minutes 100 minutes 10 hours
100 minutes 16.67 hours 12.5 days
16.67 hours 20.83 days 4.17 months
20.83 days 6.94 months 3.47 years
6.94 months 5.79 years 34.72 years
Low vs. High “IQ” Encoding
•What information can be encoded?How adaptable or flexible is the format for
encoding style, structure, or markup?Can the format tell you what it encodes?
•ASCII is very low IQ: only character info•SGML is highest IQ: encodes anything and completely specifies the encoding rules
•PDF? HTML?
HTML is too low in IQ
•HTML was designed as a simple markup language
simple structures: headings, lists, linksstrong emphasis on formattingweak for encoding content
•HTML wasn’t designed to encode the structure and semantics needed for complex applications
Web Applications That Need “Smarter” Data
•Data interchange between Web clients•Moving processing from server to client•Multiple client-side views w/o new data•“Information push” from personalized applications
Can HTML be made smarter?
•Create new tags used by your application, or use <META>, DIV, and CLASS (and hope they don’t interfere elsewhere)
•Use a “standard” metadata model (but which one? Dublin Core, PICS, P3, OPS,…)
•Hide applet code in comments (platform dependent?)
•Hack, hack, hack...
Inherent Limitations of HTML
•Not extensible•Limited capability to encode structure •No validation •Lossy interchange
XML
•Extensible Markup Language - a standard way of creating markup languages for the Web
a file format for data representationa schema for describing data or message
structuresa mechanism for extending and annotating
HTML with semantic information
•XML is a simplification of SGML, the Standard Generalized Markup Language
easier to understand and implement
HTML Apartment Listing<HTML>
<HEAD>
<TITLE>An Apartment For Rent</TITLE>
</HEAD>
<BODY>
<H1>Apartment</H1>
<P>1800 square feet, 3 bedrooms, 7 baths.
<H2>No pets, smoking forbidden!</H2>
<H3>Amenities:</H3>
<P>
Sunny location, good view, has air-conditioner.
<H3>Location</H3>
<P>2008 South E. Avenue, Eureka, CA
<H3>Cost, Etc.</H3>
<P>Price: $3600 a month
<P>Contact: (415) 123-4567
<P>Available immediately
<P>This offer posted 1 August 1997 in the Eureka Daily Times
</BODY>
</HTML>
An XML Apartment Listing<?XML VERSION=“1.0”?>
<!DOCTYPE APTLISTING SYSTEM “APTLISTING.DTD”>
<LISTING>
<ADINFO>
<POSTED>March 26, 1997</POSTED>
<WHERE_POSTED>Belmont Courier</WHERE_POSTED>
<CONTACT>(650) 111-2222</CONTACT>
</ADINFO>
<DESCRIPTION>
<AREA>1400 SQUARE FEET</AREA>
<AMENITIES>1 bedroom, 1 bathroom</AMENITIES>
<COMMENT>Small cottage in a big forest</COMMENT>
</DESCRIPTION>
<POLICIES>
<PETS>Not allowed</PETS>
<BOZOS>Not allowed</BOZOS>
</POLICIES>
<COST>$875</COST>
</LISTING>
But First: One Minute SGML
•Standard Generalized Markup Language, ISO 8879
•SGML defines the “markup language” that specifies the logical rules for a given type of document
•Markup transforms a flat stream of text into a set of objects or elements that can be manipulated by other applications
•Since there is no “universal tag set” that can describe all documents, SGML provides the means for defining the tag set that meets your needs
SGML’s Big Idea: Document Types
•Idea of document type easy to understand •The Document Type Definition or DTD defines:
the class of documents that shares a common information model
permissible elements and attributes, their contents, the order in which they occur
•The DTD is the “document schema” that makes an instance “self-describing”
•From a DTD a parser can be generated to test any document for conformance
Examples of Document Types
•User manuals•Reference manuals•Directories•Newsletters•Brochures•Catalogs•Datasheets•Proposals•Dictionaries
•Technical reports•Contracts•Regulations•Policies and procedures• Journal Articles•Textbooks•Purchase Orders• Invoices•Recipes
HTML as a Document Type
•HTML can be described as an application of SGML - the HTML document type
Simple structures: headings, lists, linksStrong emphasis on formatting, weak for
encoding contentNot designed to encode the content
distinctions for any particular industry or application
•But most HTML doesn’t conform to the HTML DTD
Designing a DTD
•Determine information requirements, purposes, uses (and their priorities)
deliver in one or more print and online formatscreate new information productsinterchange with other authors or publishersintegrate information into equipmentmeet company, industry, customer standards
Designing a DTD
•Determine process, tool, external constraints or standards
•Identify and name information components and component containers
•Create categories to organize the components
•Determine when, where, how often components appear
Designing a DTD
•Identify “meta-information” to augment the information components
bibliographic informationprocess and workflow-related information
•Describe the component hierarchy in a graphic notation to visualize it
•Transcribe the graphic notation into formal syntax
•Test the analysis on sample documents•Document the process and the results
SGML: Close, but no Cigar
•SGML has been successful in niches, but hasn’t been adopted by rank-and-file Web publishers
“the quiet revolution”“the million dollar secret”
•Perceived as too complex (because of features dating from keystroke-minimizing origins)
•Small vendors didn’t have the clout to legitimize SGML in the mass market (but some of them cleverly “dumbed-down” their tools for HTML)
XML: Right Place, Right Time
•Looks like HTML++, but acts like SGML--•Backed by:
World Wide Web Consortium (W3C)Sun - “give Java something to do”Microsoft - with great enthusiasmNetscape - with less enthusiasmSGML tool vendors and consultantsInnovators in EDI community
Specific XML Proposals to Simplify SGML
•All elements have start and end tags•All attributes are: name=“value”•Changed syntax for EMPTY elements
<toc> => <toc/><graphic file=“x.gif”> => <graphic
file=“x.gif”/>
•No & connector in content models•No inclusions and exclusions•DTD not necessary because it can be inferred if instance is “well-formed”
XML Adoption Scenarios
•The transition from the “Web for eyes” to the “automated Web”
1st generation: XML leaves HTML alone2nd generation: HTML as output format
created from XML instance3rd generation: XML repositories
1st Generation XML
•No disruption of existing HTML production processes
•XML production process may have nothing to do with HTML production process
•XML for processes, HTML for eyes, but XML and HTML can be linked together
1st Generation XML Leaves HTML as is
CREATION DELIVERY
conversion toHTML
datasource
XML
HTML “for eyes”
conversion toXML
2nd Generation XML
•Creation of XML is primary process•Replace “hand-crafted” HTML with automated down-translation
•Alternatively, use XML style sheet to create HTML-like presentation(s)
•“instance at a time” retargeting
Up & Down Translation
Content/structure-based text objects:SGML, XML, databases
Formatted electronic text:HTML, word processing files
Unstructured electronic text:ASCII
Printed text
More
str
uct
ure
(en
erg
y)
Easie
r to tra
nsla
te to
2nd Generation XML Restores Order
XMLsource
conversion toXML
datasource
“HTML-like”
XML
downtranslate
downtranslate
HTML
HDML
XML stylesheet(s)
downtranslate
HTML as an Output Format
•Treating HTML as an output format generated from an SGML source repository insulates you from ongoing changes to HTML and the latest proprietary extensions
•HTML created by “down translation” can be richer in structure and more consistent that HTML created by hand at many times the cost
3rd Generation XML
•reuse, not just retargeting•XML a first-class citizen from the start•content-oriented DTD•native authoring, or enhanced markup by editorial or production staff
•no longer file at a time, create db and work on it
•support for custom applications
3rd Generation XML Repository
Input 1
Input 2
Input 3
Output 1
Output 2
Output 4
Output 3
Input 4
“up-translation”or decom-position
“down-translation”or assembly
X
M
L
Retargeting and Reuse Requirements
•different delivery channelsWebCD-ROM, CD-ROM + Web hybridsBraille, large print, voice synthesis (ICADD)
•different “dialects” of HTML for different browsers or bandwidths or as HTML changes
•different applications (“slice and dice”)reference manual vs help vs tutorial
XML for the Web’s “Little Languages”
•CDF -- “channel definition format”, eliminates need for proprietary “push” plug-in
•OSD -- “open software description”, for describing configurations for automated distribution of software
•PICS -- for content ratings•RDF -- “resouce description framework”, merging Netscape and Microsoft metadata initiatives
•CBL -- common business language in eCo framework
The Next-Generation Web
The Web is eyeballs-only
Metadata and Object APIs -- “self-describing smart Web”No content encoding
Distributed registries and structure-based retrieval
Things can’t be found
Web catalogs and documents in their “native schema”
No automation of tasks
Agent-based run-time environment
PROBLEMS SOLUTIONS
The Internet Today
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
A Commerce Type Definition (CTD)<!Doctype Taxonomy public "-//CommerceNet//DTD Taxonomy V1.0//EN">
<Taxonomy>
<Head>
<Label>United Airlines</Label>
<Version>1.0</Version>
<Base>World Airline Registry:1.1:2.3.7</Base>
<Registry>toe.commerce.net:2111</Registry>
</Head>
<Body>
<Services>
<Passenger_Flight_Information>
<Flight_Number>UA #200</Flight_Number>
<Flight_Price US>$168.50</Flight_Price US>
<Flight_Dest>Honolulu, Hawaii</Flight_Dest>
</Passenger_Flight_Information>
<Cargo_Flight_Information>
</Cargo_Flight_Information>
</Services>
</Body>
</Taxonomy>
Step 1: XML Metadata
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
CTDCTD
CTD
CTD
CTD
CTD
CTD
CTD
Step 2: Registries
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
CTDCTD
CTD
CTD
CTD
CTD
CTD
CTD
Registry
Registry
Registry
Registry
Common Business Language (CBL)
•Who am I?Company name, contact, public key
certificates
•What am I?Agent/object (API), document (DTD),
database (schema)
•Available dataProduct list, price list, terms and conditions,
catalog, order form
•Available servicesBuy, sell, RFQ, search catalog
Step 3: CBL Components
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
CTDCTD
CTD
CTD
CTD
CTD
CTD
CTD
Registry
Registry
Registry
Registry
Step 4: Agents
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
CTDCTD
CTD
CTD
CTD
CTD
CTD
CTD
Registry
Registry
Registry
Registry
Agent
Agent
Agent
Step 5: Business Services
FTP Server
Web ServerDocuments
Database
Database
Application
Application
Web ServerDocuments
Web ServerDocuments
CTDCTD
CTD
CTD
CTD
CTD
CTD
CTD
Registry
Registry
Registry
Registry
Agent
Agent
Agent
Trust Intermediaries
Matchmaking Services
Wrapping Up
•HTML will continue to exist, but most serious publishers will produce HTML and XML versions of their content from the same “smarter” source
•XML unifies document and database perspectives and tools for Web publishing and lets them be automated in the same way
Prologue: XML and EDI
•XML appeals to the EDI community because:
it reinforces the move to Internet EDIit suggests a way to make transaction
sets easier to define and “self-describing”
•But which kind of XML/EDI?incremental strategy of wrapping existing
EDI transactions in XML syntaxradical re-thinking of EDI to create XML
“fragments” for transaction components that are dynamically combined as needed
Learning More
•The “mother of all information” about XML is the “SGML Home Page” - www.sil.org/sgml/xml.html
•Best overall book for managers to get started with SGML and XML is ABCD…SGML by Liora Alschuler
•Best overall book for HTML-savvy types is SGML on the Web by Yuri Rubinsky & Murray Maloney