1 the simple api for xml (sax) part i ©copyright 2003-2004. these slides are based on material from...

42
1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-Verlag) by Ethan Cerami. Please email [email protected] for permission to copy.

Post on 19-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

1

The Simple API for XML (SAX)Part I

©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-Verlag) by Ethan Cerami. Please email [email protected] for permission to copy.

Page 2: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

2

Road Map

• SAX Overview– What is SAX?– Advantages/Disadvantages

• Basic SAX Examples– About Xerces 2 Parser– XMLReader Interface– ContentHandler Interface– Extending the SAX Default Handler

• Checking for Well-Formedness

Page 3: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

3

SAX Overview

Page 4: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

4

Introduction to SAX

• The Simple API for XML (SAX) is a standard, event-based interface for parsing XML documents.

• Versions:– SAX 1.0: original standard– SAX 2.0: current standard

• SAX is a de facto standard, supported by most XML parsers today.

• Unlike DOM, it is not an official W3C standard.• SAX was originally built explicitly for Java, but

SAX now exists for other languages, including Perl, Python, etc.

Page 5: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

5

SAX Interface

• At its core, SAX is simply a series of interfaces that are implemented by an XML parser.

• Because different parsers implement the same SAX interface, you can easily swap in/out different parsers.

Page 6: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

6

SAX Interface

JavaApp

SAX Interface

XercesParser

CrimsonParser

ÆlfredParser

XMLDocument

Implementation details are hiddenbehind the SAX interface. You cantherefore swap parsers in/out.

Same idea as JDBC.

Page 7: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

7

Advantages/Disadvantages

• Advantages– Very widely implemented by just about every XML

Parser– Fast Performance– Low Memory Overhead

• Disadvantages– Does not provide an easy to navigate XML tree like

DOM or JDOM.– Does not provide an easy mechanism for

creating/modifying XML documents.

Page 8: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

8

Basic SAX Example

Page 9: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

9

Xerces 2 Parser

• All of our examples will use the Xerces 2 Parser.• Xerces 2 is the latest open source XML parser from

the Apache XML Group.• The Distribution is available at:

http://xml.apache.org/xerces2-j/• The distribution includes two JAR files:

– xmlParserAPIs.jar: • includes the relevant XML APIs, including DOM Level 2,

SAX 2.0, and JAXP 1.2.– xercesImpl.jar:

• includes the Xerces implementation of the XML APIs.

Page 10: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

10

BasicSAX.java

• First example illustrates the simplest SAX functionality:– Creates an XML Parser object– Parses a document specified on the command line– Receives SAX events and prints these to the console.

• First, let’s examine a sample XML document. Then view the output when this document is parsed.

Page 11: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

11

Sample XML Document

<?xml version='1.0' standalone='no' ?><!DOCTYPE DASDNA SYSTEM'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' ><DASDNA> <SEQUENCE id="1" version="8.30" start="1000" stop="1050"> <DNA length="51">

taatttctcccattttgtaggttatcacttcactctgttgactttcttttg </DNA> </SEQUENCE> <SEQUENCE id="2" version="8.30" start="1000" stop="1050"> <DNA length="51">

taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg </DNA> </SEQUENCE></DASDNA>

Document contains twosequences of DNA.

Page 12: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

12

Start DocumentStart Element: DASDNAStart Element: SEQUENCEStart Element: DNACharacters: taatttctcccattttgtaggttatcacttcactctgttgactttcttttgCharacters: End Element: DNAEnd Element: SEQUENCEStart Element: SEQUENCEStart Element: DNACharacters: taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttgCharacters: End Element: DNAEnd Element: SEQUENCEEnd Element: DASDNAEnd Document

Sample Output

Page 13: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

13

package com.oreilly.bioxml.sax;

import org.xml.sax.Attributes;import org.xml.sax.ContentHandler;import org.xml.sax.Locator;import org.xml.sax.SAXException;import org.xml.sax.XMLReader;import org.xml.sax.helpers.XMLReaderFactory;

import java.io.IOException;

/** * Basic SAX Example. * Illustrates basic implementation of the SAX Content Handler. */public class SAXBasic implements ContentHandler {

public void startDocument() throws SAXException { System.out.println("Start Document"); }

Page 14: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

14

public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str); }

public void endDocument() throws SAXException { System.out.println("End Document"); }

public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName); }

public void endPrefixMapping(String prefix) throws SAXException { // No-op }

Page 15: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

15

public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { // No-op }

public void processingInstruction(java.lang.String target, java.lang.String data) throws SAXException { // No-op }

public void setDocumentLocator(Locator locator) { // No-op }

public void skippedEntity(String name) throws SAXException { // No-op }

public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { System.out.println("Start Element: " + localName); }

Page 16: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

16

public void startPrefixMapping(String prefix, String uri) throws SAXException { // No-op }

/** * Prints Command Line Usage */ private static void printUsage() { System.out.println ("usage: SAXBasic xml-file"); System.exit(0); }

/** * Main Method * Options for instantiating XMLReader Implementation: * 1) XMLReader parser = XMLReaderFactory.createXMLReader(); * 2) XMLReader parser = XMLReaderFactory.createXMLReader * ("org.apache.xerces.parsers.SAXParser"); * 3) XMLReader parser = new org.apache.xerces.parsers.SAXParser(); */

Page 17: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

17

public static void main(String[] args) { if (args.length != 1) { printUsage(); } try { SAXBasic saxHandler = new SAXBasic(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }}

Page 18: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

18

Main SAX Interfaces

• SAX provides two main interfaces:– XMLReader: implemented by the XML parser. – ContentHandler: implemented by your application in

order to receive SAX events.• Each time an event occurs, e.g. start element, end

element, the XML parser calls the ContentHandler and informs you of the specific event.

Page 19: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

19

XMLReader Interface

• You have three main options for instantiating an XMLReader class.

• Option 1: Use the SAX XMLReaderFactory class with no arguments:XMLReader parser =

XMLReaderFactory.createXMLReader();• The factory will attempt to instantiate an

XMLReader based on system defaults.

Page 20: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

20

Option 1: Continued

• You can specify a system property from the java command line via the -D option.

• For example, the following line invokes the SAXBasic class and specifies the Xerces2 XML Parser:

• javaw.exe -Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser com.oreilly.bioxml.sax.SAXBasic sample.xml

• The advantage of using system properties is that you can dynamically change parsers at any time without recompiling any code.

• If the Factory is unable to determine any valid system defaults, it will throw a SAXException, with a specific message: "System property org.xml.sax.driver not specified."

Page 21: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

21

Using Different Parsers

• The specific class the implements the XMLReader interface varies from parser to parser. For example:

• For the Xerces XML Parser, it's org.apache.xerces.parser.SAXParser.

• For the Crimson XML Parser, it's org.apache.crimson.parser.XMLReaderImpl.

Page 22: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

22

Option 2

• Call the XMLReaderFactory with a String argument indicating the class name that implements the XMLReader interface:

• For example:

XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser");

Page 23: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

23

Option 3

• Instantiate the XMLReader implementation directly:

• For example:XMLReader parser = new

org.apache.xerces.parsers.SAXParser();

• This option works fine. • However, note that if you switch parsers,

you will need to recompile.

Page 24: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

24

XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");

parser.parse(“simple.xml”);

Using an XMLReader

• Once you have an XMLReader class, you can call the parse() method to start parsing:

• You can pass a local file name or an absolute URL to the parse() method.

Page 25: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

25

ContentHandler Interface

• The ContentHandler receives all SAX events.• In total, there are 11 defined events.• The most important events/methods are defined below:

characters Receive notification ofcharacter data.

endDocument Receive notification of the endof a document.

endElement Receive notification of the end of an element.

Continued…

Page 26: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

26

Content Handler API (cont)

ignorableWhitespace Receive notification of ignorable whitespace inelement content.

setDocumentLocator Receive an object for locating the origin of SAXdocument events.

startDocument Receive notification of the beginning of a document.

startElement Receive notification of the beginning of an element.

Page 27: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

27

Character “Chunking”

• Suppose you have the following piece of XML:<DNA length="51">taatgcaactaaatccaggcgaagcatttcagcttaaccccg</DNA>• You will receive a start element event, followed by one or more

character events.• Parsers are free to call the characters() method any way they

want. For example, one parse might do the following:– characters (“t”);– characters (“a”);– characters (“a”);

• Another parser might do this:– characters (“taatgcaactaaatccagg”);– characters (“cgaagcatttcagcttaaccccg”);

Page 28: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

28

Character Chunking

• Your application needs to be able to handle either of these strategies.

• To do this, it is best to store character data in some kind of buffer, like StringBuffer.

• For example: /** * Processes Character Events via Buffer */ public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); currentText.append(str); }

Page 29: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

29

Using ContentHandlers

• To receive events, you must:– Implement the ContentHandler interface– Register your content handler with the XML parser:

XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser");parser.setContentHandler(saxHandler);parser.parse(args[0]);

Page 30: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

30

ContentHandler Implementation

• Here’s a sample implementation that just outputs information about each event:

public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str);}public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName);}

Page 31: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

31

Using the SAX Default Handler

Page 32: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

32

SAX Default Handler

• In total, an implementation of ContentHandler must implement 11 methods.

• You usually don’t need to intercept all 11 of these events.

• It is therefore much easier to extend the SAX DefaultHandler.

• The DefaultHandler provides no-op implementations of all methods. You can therefore simply override those that you want.

• The next few slides provides an example.

Page 33: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

33

package com.oreilly.bioxml.sax;

import org.xml.sax.helpers.DefaultHandler;import org.xml.sax.helpers.XMLReaderFactory;import org.xml.sax.SAXException;import org.xml.sax.Attributes;import org.xml.sax.XMLReader;

import java.io.IOException;

/** * Basic SAX Example. * Illustrates extending of DefaultHandler */public class SAXDefaultHandler extends DefaultHandler {

public void startDocument() throws SAXException { System.out.println("Start Document"); }

Page 34: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

34

public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str); }

public void endDocument() throws SAXException { System.out.println("End Document"); }

public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName); }

public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { System.out.println("Start Element: " + localName); }

Only override those methods that you need.

Page 35: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

35

/** * Prints Command Line Usage */ private static void printUsage() { System.out.println ("usage: SAXDefaultHandler xml-file"); System.exit(0); }

/** * Main Method */ public static void main(String[] args) { if (args.length != 1) { printUsage(); } try { SAXDefaultHandler saxHandler = new SAXDefaultHandler(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser");

Page 36: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

36

parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } }}

By extending the Default Handler, your code is much more compact and concise.

The output of this program is identical to the first example.

Page 37: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

37

Checking for Well-Formedness

Page 38: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

38

Defaults

• By default, the Xerces XML parser (and most other parsers) will check for well-formedness, but they will not automatically check for validity.

• Suppose we have the following document on the next page.

Page 39: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

39

Sample Document: Not Well-formed

<?xml version='1.0' standalone='no' ?><!DOCTYPE DASDNA SYSTEM'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' ><DASDNA> <SEQUENCE id="1" version="8.30" start="1000" stop="1050"> <DNA length="51">taatttctcccattttgtaggttatcacttcactctgttgactttcttttg </SEQUENCE> <SEQUENCE id="2" version="8.30" start="1000" stop="1050"> <DNA length="51">taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg </DNA> </SEQUENCE></DASDNA>

This document is not well-formed,because I deleted one of the end</DNA> tags.

Page 40: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

40

Sample Output

Start DocumentStart Element: DASDNAStart Element: SEQUENCEStart Element: DNACharacters: taatttctcccattttgtaggttatcacttcactctgttgactttcttttgCharacters: [Fatal Error] ensemble_dna_error.xml:8:5: The element type "DNA" must be terminated

by the matching end-tag "</DNA>".org.xml.sax.SAXParseException: The element type "DNA" must be terminated by the matching end-tag "</DNA>".at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)at com.oreilly.bioxml.sax.SAXBasic.main(SAXBasic.java:101)

This is a fatal error.The parser thereforethrows a SAXParseException.

Page 41: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

41

Try / Catch Clause

try { SAXDefaultHandler saxHandler = new SAXDefaultHandler(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }

Indicates a fatal parsing error, such as errors inwell-formedness.

Indicates an IO Error, such asfailed network connection.

Page 42: 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

42

Summary

• SAX is a standard, event-based interface for parsing XML documents.

• It is a de facto standard, not an official W3C standard.

• XML Parsers must implement the XMLReader interface.

• Applications must implement the ContentHandler interface.

• For more concise programs, extend the SAX Default Handler.

• Make sure to surround calls to parse() with a try/catch clause.