1 4/13/01 cse 121/131 programming spring 2001 lecture notes 7 2000-2001 a. sahuguet & v.tannen
TRANSCRIPT
14/13/01
CSE 121/131
Programming
Spring 2001
Lecture Notes 7
2000-2001 A. Sahuguet & V.Tannen
24/13/01
Data on the Web, today: HTML
. . .<a name="primary"><H2> Primary Faculty </H2><DL> <DT><BR> <A href="http://www.cis.upenn.edu/~alur/info.html"><IMG SRC="images/resdesc.gif" ALIGN=right ALT="resdesc"></A><A href="http://www.cis.upenn.edu/~alur/home.html"><IMG SRC="images/home.gif" ALIGN=right ALT="Home"></A><B>Rajeev Alur</B><BR>Associate Professor, Computer and Information Science <DD> Formal support for design and analysis of reactive, real-time, and
hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing.
. . .
34/13/01
Data on the Web, tomorrow: XML. . .<primary>
<name>
<first>Rajeev</first>
<last>Alur</last>
</name>
<title>Associate Professor</title>
<department>Computer and Information Science</department>
<bio>http://www.cis.upenn.edu/~alur/info.html</bio>
<homepage>http://www.cis.upenn.edu/~alur/home.html</homepage>
<interest>Formal support for design and analysis of reactive, real-time, and
hybrid systems. Hardware verification; Software engineering; Control of
distributed multi-agent systems; Logic and concurrency theory;
Distributed computing.</interest>
</primary> . . .
44/13/01
What is XML?
• Like HTML, XML is a “document markup language” i.e., a way to enrich text with tags and attributes.
• HTML’s markup is about visual presentation. However, it is difficult for a program to manipulate the data in HTML.
• XML’s markup is about the meaning of the information. This makes it easier for programs to manipulate XML.
• Still, what we saw on the previous slide is an external format. Internally, XML is represented as trees.
54/13/01
How XML overcomes some HTML limitations
• Using XML, content providers can separate form and content.
XML Content
Wireless Markup Language HTML
XSL (Stylesheets)
HTML(Web-TV)
http://www.wapforum.org/docs/technical/wml-30-apr-98.pdf
64/13/01
Wireless Applications
• Hand-held devices have some constraints– small display– narrowband network connection– limited memory and computational resources
• HTML is not suitable to deliver information to them-> Need for a Wireless Markup Language (WML)
• What WML offers– specific layout– new metaphor (deck, cards)– state management– binary XML format to make data more concise
The same metaphor can be used for e-forms in various domains: interactive kiosks, medical forms, etc.
74/13/01
Manipulating XML documents
• Manipulation– parsing: reading, checking syntax, transforming in internal
format– navigating– modifying
• Fortunately, XML comes with a standard API that offers all these features
Document Object Model (DOM)
API: Application Programming Interface
84/13/01
DOM
• “DOM provides a programmatic access to the content, structure and style of XML documents and allows languages such as Java to extract information from documents containing specific tags as if they were objects.” [Ardent’s white paper on XML]
• Platform neutral API designed by W3C using CORBA/IDL
• Mapping to various programming languages (Java, C++, Perl, etc.)
• DOM supported by all the major players• DOM makes XML documents parser and
representation independent
94/13/01
DOM overview
• What DOM is doing
<TABLE> <TBODY><TR><TD>Shady Grove</TD><TD>Aeolian</TD></TR><TR><TD>Over the River, Charlie</TD><TD>Dorian</TD></TR></TBODY></TABLE>
104/13/01
The DOM API (overview)
Node
Attr CharacterData
Comment Text
CDATASection
Document Element Entity
NodeList
interface DocumentcreateAttribute(…)createCDATASection(…)createComment(…)createElement(…)createTextNode(…)
interface NodeappendChild(…)getAttributes(…)getChildNodes(…)
interface ElementgetAttribute(name)getAttributeNode(name)getElementsByTagName(name)
The full API can be found at http://www.w3c.org/DOM
114/13/01
DOM in action
• We take an HTML page from the IBM Patent server and we XML-ize it.
• From it, we want to extract some specific information, such as the name of the inventors.
• 4 ways to do it– Java DOM– Java XQL– Perl– XML-QL (will return an XML document)
124/13/01
The Patent Example
Converted using W4F
134/13/01
DOM with Javaimport com.ibm.xml.parser.*; import org.w3c.dom.*; import java.io.*;
public class Test{ public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] ));
NodeList nodes = doc.getElementsByTagName("Inventor"); int n = nodes.getLength(); for(int i=0; i<n; i++)
{ Element node = (Element) nodes.item(i); String href= node.getAttribute("First_Name"); System.out.println(href); }
}}
144/13/01
DOM with Java and XQL (GMD, IBM)
import de.gmd.ipsi.xql.*; import org.w3c.dom.*;import com.ibm.xml.parser.*; import java.io.*;
public class XQLTest{ public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new
FileInputStream( args[0] ));
XQLResult r = XQL.execute("//Inventor", doc ); for(int i=0; i<r.getLength(); i++)
{ Element inventor = (Element) r.getItem(i); String href = inventor.getAttribute("First_Name"); System.out.println(href); }
} }
154/13/01
DOM with Perl
• Extracting the name of the Inventors from the IBM Patent database.
#!/usr/bin/perl
use XML::DOM;
my $parser = new XML::DOM::Parser;my $doc = $parser->parsefile ("patent.xml");
my $nodes = $doc->getElementsByTagName ("Inventor");my $n = $nodes->getLength;
for (my $i = 0; $i < $n; $i++){ my $node = $nodes->item ($i); my $href = $node->getAttribute ("First_Name"); print $href, "\n";
}
Include the Perl package
Instantiate a new parserand parse the source file.
Get the list of nodes that correspond to <Inventor>.
For each node, extract the First_Name attribute and print it.
164/13/01
SAX, a low-level alternative to DOM
• SAX– simple API for XML– supported by most XML parsers– event-driven parser
• Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events– startDocument– startElement– endElement– endDocument
• The programmer has to write a document handler that captures these events and do something with the tokens.
174/13/01
An Example of SAXpublic class OutputHandler implements DocumentHandler { private PrintWriter pw; } public OutputHandler() { this.pw = new
PrintWriter( System.out ); } public OutputHandler(PrintWriter pw)
{ this.pw = pw; } public String toString() { pw.flush(); return ""; } public void characters(char[] ch, int
start, int length) { pw.print(new String(ch, length)); return ""; } /* to be continued … */
public void endDocument() { pw.println("<!-- end of document -->"); } public void endElement(String name) { pw.println("</" + name + ">"); } public void startDocument() { pw.println("<?xml version=\"1.0\"?>"); return; } public void startElement(String name,
AttributeList atts) { pw.print("<" + name); if (atts != null) for(int i = 0; i < atts.getLength(); ++i) pw.print(" " + atts.getName(i) + "=\"" +
atts.getValue(i) + "\""); pw.println(">"); return; }}
184/13/01
SAX vs DOM
• SAX– does not store anything in memory (great for stream-based
processing)– navigation in the document is clumsy– does not permit to update an XML document
• DOM– permits updates– offers the DOM API for navigation/construction– requires the entire document to be stored in main memory
194/13/01
The Missing Link• There is only a “gentlemen’s agreement” between the application and its XML environment.
• Why do we need to go beyond that?– performance– static guarantees (helps to identify and control failures)
• How do we create a tight contract between the application and its XML environment?
XML(input) Application
XML(output)
204/13/01
XML Binding• Requirements
– high-level specification for XML (e.g. DTD, XML-Schemas, UML, etc.)– a mapping to your favorite programming language (e.g. Java)– a compiler that will generate code (“stubs” that define an API)
(Same paradigm as CORBA/IDL or ODMG/ODL)
Sun’s Proposal: <http://www.javasoft.com/xml/white-papers.html>
XMLspec. compiler stubs
214/13/01
Generic (DOM/SAX) vsDomain Specific API
generic API– generic parsing– getElement(“order”)– getAttribute(“date”)– generic marshalling
only runtime checks
domain specific API– domain specific parsing– get_order()– get_date()– domain specific marshalling
both static and runtime checks
• Instead of a generic API (e.g. SAX, DOM), the application will use a domain specific API generated from the specification.
• Issues– mapping accurately XML “types” to a programming language– static checks vs runtime checks (some features from the specification cannot be checked statically)
224/13/01
XML programming
• Resources– Java and XML, Brett McLaughlin, Mike Loukides
• XML parsers (DOM/SAX)– Apache http://xml.apache.org/xerces-j/index.html
– Oracle http://technet.us.oracle.com/tech/xml/
– Sun Project X http://java.sun.com/xml/
– Microsoft http://msdn.microsoft.com/xml/default.asp
• XML-binding frameworks– Oracle ClassGenerator
http://technet.us.oracle.com/tech/xml/classgen/index.htm
– Castor http://castor.exolab.org/