1 4/13/01 cse 121/131 programming spring 2001 lecture notes 7 2000-2001 a. sahuguet & v.tannen

14/13/01

CSE 121/131

Programming

Spring 2001

Lecture Notes 7

2000-2001 A. Sahuguet & V.Tannen

24/13/01

Data on the Web, today: HTML

. . .<a name="primary"><H2> Primary Faculty </H2><DL> <DT><BR> <A href="http://www.cis.upenn.edu/~alur/info.html"><IMG SRC="images/resdesc.gif" ALIGN=right ALT="resdesc"></A><A href="http://www.cis.upenn.edu/~alur/home.html"><IMG SRC="images/home.gif" ALIGN=right ALT="Home"></A><B>Rajeev Alur</B><BR>Associate Professor, Computer and Information Science <DD> Formal support for design and analysis of reactive, real-time, and

hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing.

. . .

34/13/01

Data on the Web, tomorrow: XML. . .<primary>

<name>

<first>Rajeev</first>

<last>Alur</last>

</name>

<title>Associate Professor</title>

<department>Computer and Information Science</department>

<bio>http://www.cis.upenn.edu/~alur/info.html</bio>

<homepage>http://www.cis.upenn.edu/~alur/home.html</homepage>

<interest>Formal support for design and analysis of reactive, real-time, and

hybrid systems. Hardware verification; Software engineering; Control of

distributed multi-agent systems; Logic and concurrency theory;

Distributed computing.</interest>

</primary> . . .

44/13/01

What is XML?

• Like HTML, XML is a “document markup language” i.e., a way to enrich text with tags and attributes.

• HTML’s markup is about visual presentation. However, it is difficult for a program to manipulate the data in HTML.

• XML’s markup is about the meaning of the information. This makes it easier for programs to manipulate XML.

• Still, what we saw on the previous slide is an external format. Internally, XML is represented as trees.

54/13/01

How XML overcomes some HTML limitations

• Using XML, content providers can separate form and content.

XML Content

Wireless Markup Language HTML

XSL (Stylesheets)

HTML(Web-TV)

http://www.wapforum.org/docs/technical/wml-30-apr-98.pdf

64/13/01

Wireless Applications

• Hand-held devices have some constraints– small display– narrowband network connection– limited memory and computational resources

• HTML is not suitable to deliver information to them-> Need for a Wireless Markup Language (WML)

• What WML offers– specific layout– new metaphor (deck, cards)– state management– binary XML format to make data more concise

The same metaphor can be used for e-forms in various domains: interactive kiosks, medical forms, etc.

74/13/01

Manipulating XML documents

• Manipulation– parsing: reading, checking syntax, transforming in internal

format– navigating– modifying

• Fortunately, XML comes with a standard API that offers all these features

Document Object Model (DOM)

API: Application Programming Interface

84/13/01

DOM

• “DOM provides a programmatic access to the content, structure and style of XML documents and allows languages such as Java to extract information from documents containing specific tags as if they were objects.” [Ardent’s white paper on XML]

• Platform neutral API designed by W3C using CORBA/IDL

• Mapping to various programming languages (Java, C++, Perl, etc.)

• DOM supported by all the major players• DOM makes XML documents parser and

representation independent

94/13/01

DOM overview

• What DOM is doing

<TABLE> <TBODY><TR><TD>Shady Grove</TD><TD>Aeolian</TD></TR><TR><TD>Over the River, Charlie</TD><TD>Dorian</TD></TR></TBODY></TABLE>

104/13/01

The DOM API (overview)

Node

Attr CharacterData

Comment Text

CDATASection

Document Element Entity

NodeList

interface DocumentcreateAttribute(…)createCDATASection(…)createComment(…)createElement(…)createTextNode(…)

interface NodeappendChild(…)getAttributes(…)getChildNodes(…)

interface ElementgetAttribute(name)getAttributeNode(name)getElementsByTagName(name)

The full API can be found at http://www.w3c.org/DOM

114/13/01

DOM in action

• We take an HTML page from the IBM Patent server and we XML-ize it.

• From it, we want to extract some specific information, such as the name of the inventors.

• 4 ways to do it– Java DOM– Java XQL– Perl– XML-QL (will return an XML document)

124/13/01

The Patent Example

Converted using W4F

http://www.patents.ibm.com/details?pn=US05592660__&language=en

http://db.cis.upenn.edu/cgi-bin/serveXML?SERVICE=Patent&URL=http://www.patents.ibm.com/details?pn=US05592660__&language=en

134/13/01

DOM with Javaimport com.ibm.xml.parser.*; import org.w3c.dom.*; import java.io.*;

public class Test{ public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] ));

NodeList nodes = doc.getElementsByTagName("Inventor"); int n = nodes.getLength(); for(int i=0; i<n; i++)

{ Element node = (Element) nodes.item(i); String href= node.getAttribute("First_Name"); System.out.println(href); }

}}

144/13/01

DOM with Java and XQL (GMD, IBM)

import de.gmd.ipsi.xql.*; import org.w3c.dom.*;import com.ibm.xml.parser.*; import java.io.*;

public class XQLTest{ public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new

FileInputStream( args[0] ));

XQLResult r = XQL.execute("//Inventor", doc ); for(int i=0; i<r.getLength(); i++)

{ Element inventor = (Element) r.getItem(i); String href = inventor.getAttribute("First_Name"); System.out.println(href); }

} }

154/13/01

DOM with Perl

• Extracting the name of the Inventors from the IBM Patent database.

#!/usr/bin/perl

use XML::DOM;

my $parser = new XML::DOM::Parser;my $doc = $parser->parsefile ("patent.xml");

my $nodes = $doc->getElementsByTagName ("Inventor");my $n = $nodes->getLength;

for (my $i = 0; $i < $n; $i++){ my $node = $nodes->item ($i); my $href = $node->getAttribute ("First_Name"); print $href, "\n";

}

Include the Perl package

Instantiate a new parserand parse the source file.

Get the list of nodes that correspond to <Inventor>.

For each node, extract the First_Name attribute and print it.

164/13/01

SAX, a low-level alternative to DOM

• SAX– simple API for XML– supported by most XML parsers– event-driven parser

• Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events– startDocument– startElement– endElement– endDocument

• The programmer has to write a document handler that captures these events and do something with the tokens.

174/13/01

An Example of SAXpublic class OutputHandler implements DocumentHandler { private PrintWriter pw; } public OutputHandler() { this.pw = new

PrintWriter( System.out ); } public OutputHandler(PrintWriter pw)

{ this.pw = pw; } public String toString() { pw.flush(); return ""; } public void characters(char[] ch, int

start, int length) { pw.print(new String(ch, length)); return ""; } /* to be continued … */

public void endDocument() { pw.println(""); } public void endElement(String name) { pw.println("</" + name + ">"); } public void startDocument() { pw.println("<?xml version=\"1.0\"?>"); return; } public void startElement(String name,

AttributeList atts) { pw.print("<" + name); if (atts != null) for(int i = 0; i < atts.getLength(); ++i) pw.print(" " + atts.getName(i) + "=\"" +

atts.getValue(i) + "\""); pw.println(">"); return; }}

184/13/01

SAX vs DOM

• SAX– does not store anything in memory (great for stream-based

processing)– navigation in the document is clumsy– does not permit to update an XML document

• DOM– permits updates– offers the DOM API for navigation/construction– requires the entire document to be stored in main memory

194/13/01

The Missing Link• There is only a “gentlemen’s agreement” between the application and its XML environment.

• Why do we need to go beyond that?– performance– static guarantees (helps to identify and control failures)

• How do we create a tight contract between the application and its XML environment?

XML(input) Application

XML(output)

204/13/01

XML Binding• Requirements

– high-level specification for XML (e.g. DTD, XML-Schemas, UML, etc.)– a mapping to your favorite programming language (e.g. Java)– a compiler that will generate code (“stubs” that define an API)

(Same paradigm as CORBA/IDL or ODMG/ODL)

Sun’s Proposal: <http://www.javasoft.com/xml/white-papers.html>

XMLspec. compiler stubs

214/13/01

Generic (DOM/SAX) vsDomain Specific API

generic API– generic parsing– getElement(“order”)– getAttribute(“date”)– generic marshalling

only runtime checks

domain specific API– domain specific parsing– get_order()– get_date()– domain specific marshalling

both static and runtime checks

• Instead of a generic API (e.g. SAX, DOM), the application will use a domain specific API generated from the specification.

• Issues– mapping accurately XML “types” to a programming language– static checks vs runtime checks (some features from the specification cannot be checked statically)

224/13/01

XML programming

• Resources– Java and XML, Brett McLaughlin, Mike Loukides

• XML parsers (DOM/SAX)– Apache http://xml.apache.org/xerces-j/index.html

– Oracle http://technet.us.oracle.com/tech/xml/

– Sun Project X http://java.sun.com/xml/

– Microsoft http://msdn.microsoft.com/xml/default.asp

• XML-binding frameworks– Oracle ClassGenerator

http://technet.us.oracle.com/tech/xml/classgen/index.htm

– Castor http://castor.exolab.org/

1 4/13/01 cse 121/131 programming spring 2001 lecture notes 7 2000-2001 a. sahuguet & v.tannen

Documents

html limitationsusing

xml documents parser

style of xml documents

specific information

information science

document markup language

htmls markup

hybrid systems