XWeave - SAX2 Parser Tutorial

This tutorial is extracted from Appendix B of Designing XML Databases and is updated to use SAX2 parser. The complete code described in this tutorial is available as a jar file.

There are two major types of XML parsers: tree-based parsers and event-based parsers. A tree-based parser creates an internal tree structure. The tree structure is navigated by the application to extract information. The DOM (Domain Object Model) parser is a tree-based parser that creates objects for each element and character data region in the document. An event-based parser reports parsing events, such as the start and end of an element and does not build an internal tree structure. The SAX (Simple API for XML) parser is an event-based parser that may be more efficient for extracting data from large documents.

The SAX parser calls a user-defined method at the beginning and end of each element and for each character data region. The user-defined methods can create an alternative data structure more efficiently than DOM that meets the requirements of the application. For example, if the application is only extracting part of the data from an XML stream, then the parsing may be much faster using SAX than creating the DOM objects that would not be used. However, the event-based parsing may require the creation of additional, temporary data structures to retain parsed information before placing it in the final data structure. For example, character data may be associated with the containing element. In that case, a stack of elements is necessary to facilitate the association.

XML parsers are often created using a “factory” pattern. A string of the class name for the parser is passed to the parser factory to create a parser. Many companies—including IBM, Sun, and Oracle—provide parsers that can be created using this approach. An advantage of using a parser factory is that the parser can be easily replaced if a more efficient one is found. The main caveat when using the factory pattern is to be sure to include the specified parser class in the CLASSPATH environment variable.

The process of creating a SAX parser consists of the following:

Creating an instance of the Parser.
Creating and initializing an instance of the Parser Handler.
Initializing the Parser to use the Parser Handler.
Calling the Parser with the XML URL.
The Parser retrieves and parses the document, calling the Handler at:

The beginning of the document.
The beginning of each element.
The end of each character data region.
The end of each element.
The end of the document.

There are two versions of the SAX parser currently available. The code in this chapter uses the SAX2 parser framework that includes support for namespaces (among other things). The methods described here are slightly different than the ones in SAX. Also, the classes in this Appendix refer to an Apache Xerces parser, though any SAX2 parser may be used. Code to create a parser is:

String parserClass = "org.apache.xerces.parsers.SAXParser";

XMLReader parser = XMLReaderFactory.createXMLReader(parserClass);

The parser class is defined and then the parser factory is used to create a parser. After the parser is created, a document handler is created, and the parser is set to use that handler:

DefaultHandler handler = new DebugHandler();

parser.setDocumentHandler(handler);

parser.setErrorHandler(handler);

The parser is called by passing a URL to the parser, such as:

parser.parse(xmlFile);

The entire code is wrapped in exception handlers to catch exceptions, and the result as a method is:

public void parse(String xmlFile) {

String parserClass = "org.apache.xerces.parsers.SAXParser";

try {

XMLReader parser = XMLReaderFactory.createXMLReader(parserClass);

DefaultHandler handler = new DebugHandler();

parser.setDocumentHandler(handler);

parser.setErrorHandler(handler);

try {

parser.parse(xmlFile);

} catch (SAXException se) {

se.printStackTrace();

} catch (IOException ioe) {

ioe.printStackTrace();

}

} catch (ClassNotFoundException ex) {

ex.printStackTrace();

} catch (IllegalAccessException ex) {

ex.printStackTrace();

} catch (InstantiationException ex) {

ex.printStackTrace();

}

A simple class to execute the parser is

package com.xweave.xmldb.ui.test1;

import org.xml.sax.*;

import org.xml.sax.helpers.*;

import org.w3c.dom.Document;

import java.io.IOException;

public class Test {

public Test() {

super();

}

public static void main(java.lang.String[] args) {

String fileName = null;

if (args.length < 1) {

System.err.println("Test: requires <file> as argument");

return;

} else {

fileName = args[0];

}

(new Test()).parse(fileName);

}

public void parse(String xmlFile) {

//code given above

}

This code is available as Test.java.

A simple handler can be created as:

package com.xweave.xmldb.ui.test1;

/**

* Test Handler that prints debug statements

import org.xml.sax.*;

class DebugHandler extends org.xml.sax.helpers.DefaultHandler implements org.xml.sax.ContentHandler {

public DebugHandler() {

super();

}

public void characters(char[] chars, int start, int length) {

String string = new String(chars, start, length);

System.out.println("chars="+string+"!");

}

public void endDocument() {

System.out.println("end document");

}

public void endElement(String namespaceURI, String localName, String qName) {

System.out.println("end: "+localName);

}

public void startDocument() {

System.out.println("start document");

}

public void startElement(String namespaceURI, String localName, String qname, Attributes attrList) {

System.out.println("start: "+localName);

}

This code is available as DebugHandler.java.

The DebugHandler class implements the DocumentHandler SAX interface and has methods for the following:

The beginning of the document—emits “start document”
The beginning of each element—emits “start: <name>”
The end of each character data region—emits the character data region (ended with a “!” to delimit white space)
The end of each element—emits “end: <name>”
The end of the document—emits “end document”

When executing the test procedure on a simple document like the following:

<?xml version="1.0"?>

<doc>

<body>Hello world!!!</body>

</doc>

The handler will create output something like this:

start document

start: doc

start: body

chars=Hello world!!!!

end: body

end: doc

end document

The phrase “something like” is somewhat misleading because the output will be this:

start document

start: doc

chars=

start: body

chars=Hello world!!!!

end: body

chars=

end: doc

end document

Actually, two extra character data regions surround the element with type name “body”. The character data consists of a carriage return. In practice, character data regions may be trimmed of white space and empty regions can be ignored. A document that would give the former output is:

<?xml version="1.0"?>

<doc><body>Hello world!!!</body></doc>

A tree-based parser, such as DOM, would filter such occurrences, but because the SAX event-based parser is at a lower level, the application that uses it must filter those occurrences.