This tutorial is extracted from Appendix B of Designing
XML Databases and is updated to use SAX2 parser. The
complete code described in this tutorial is available as
a jar file.
There are two major types of XML parsers: tree-based parsers
and event-based parsers. A tree-based parser creates an
internal tree structure. The tree structure is navigated
by the application to extract information. The DOM (Domain
Object Model) parser is a tree-based parser that creates
objects for each element and character data region in the
document. An event-based parser reports parsing events,
such as the start and end of an element and does not build
an internal tree structure. The SAX (Simple API for XML)
parser is an event-based parser that may be more efficient
for extracting data from large documents.
The SAX parser calls a user-defined method at the beginning
and end of each element and for each character data region.
The user-defined methods can create an alternative data
structure more efficiently than DOM that meets the requirements
of the application. For example, if the application is only
extracting part of the data from an XML stream, then the
parsing may be much faster using SAX than creating the DOM
objects that would not be used. However, the event-based
parsing may require the creation of additional, temporary
data structures to retain parsed information before placing
it in the final data structure. For example, character data
may be associated with the containing element. In that case,
a stack of elements is necessary to facilitate the association.
XML parsers are often created using a factory
pattern. A string of the class name for the parser is passed
to the parser factory to create a parser. Many companiesincluding
IBM, Sun, and Oracleprovide parsers that can be created
using this approach. An advantage of using a parser factory
is that the parser can be easily replaced if a more efficient
one is found. The main caveat when using the factory pattern
is to be sure to include the specified parser class in the
CLASSPATH environment variable.
The process of creating a SAX parser consists of the following:
- Creating an instance of the Parser.
- Creating and initializing an instance of the Parser
Handler.
- Initializing the Parser to use the Parser Handler.
- Calling the Parser with the XML URL.
- The Parser retrieves and parses the document, calling
the Handler at:
- The beginning of the document.
- The beginning of each element.
- The end of each character data region.
- The end of each element.
- The end of the document.
There are two versions of the SAX parser currently available.
The code in this chapter uses the SAX2 parser framework
that includes support for namespaces (among other things).
The methods described here are slightly different than the
ones in SAX. Also, the classes in this Appendix refer to
an Apache Xerces parser, though any SAX2 parser may be used.
Code to create a parser is:
String parserClass = "org.apache.xerces.parsers.SAXParser";
XMLReader parser = XMLReaderFactory.createXMLReader(parserClass);
The parser class is defined and then the parser factory
is used to create a parser. After the parser is created,
a document handler is created, and the parser is set to
use that handler:
DefaultHandler handler = new DebugHandler();
parser.setDocumentHandler(handler);
parser.setErrorHandler(handler);
The parser is called by passing a URL to the parser, such
as:
parser.parse(xmlFile);
The entire code is wrapped in exception handlers to catch
exceptions, and the result as a method is:
public void parse(String xmlFile) {
String
parserClass = "org.apache.xerces.parsers.SAXParser";
try
{
XMLReader
parser = XMLReaderFactory.createXMLReader(parserClass);
DefaultHandler
handler = new DebugHandler();
parser.setDocumentHandler(handler);
parser.setErrorHandler(handler);
try
{
parser.parse(xmlFile);
}
catch (SAXException se) {
se.printStackTrace();
}
catch (IOException ioe) {
ioe.printStackTrace();
}
}
catch (ClassNotFoundException ex) {
ex.printStackTrace();
}
catch (IllegalAccessException ex) {
ex.printStackTrace();
}
catch (InstantiationException ex) {
ex.printStackTrace();
}
}
A simple class to execute the parser is
package com.xweave.xmldb.ui.test1;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.w3c.dom.Document;
import java.io.IOException;
public class Test {
public Test() {
super();
}
public static void main(java.lang.String[] args)
{
String
fileName = null;
if
(args.length < 1) {
System.err.println("Test:
requires <file> as argument");
return;
}
else {
fileName
= args[0];
}
(new
Test()).parse(fileName);
}
public void parse(String xmlFile) {
//code
given above
}
}
This code is available as Test.java.
A simple handler can be created as:
package com.xweave.xmldb.ui.test1;
/**
* Test
Handler that prints debug statements
*/
import org.xml.sax.*;
class DebugHandler extends org.xml.sax.helpers.DefaultHandler
implements org.xml.sax.ContentHandler {
public DebugHandler() {
super();
}
public void characters(char[] chars, int start,
int length) {
String
string = new String(chars, start, length);
System.out.println("chars="+string+"!");
}
public void endDocument() {
System.out.println("end
document");
}
public void endElement(String namespaceURI,
String localName, String qName) {
System.out.println("end:
"+localName);
}
public void startDocument() {
System.out.println("start
document");
}
public void startElement(String namespaceURI,
String localName, String qname, Attributes attrList) {
System.out.println("start:
"+localName);
}
}
This code is available as DebugHandler.java.
The DebugHandler class implements the DocumentHandler SAX
interface and has methods for the following:
- The beginning of the documentemits start
document
- The beginning of each elementemits start:
<name>
- The end of each character data regionemits the
character data region (ended with a ! to delimit
white space)
- The end of each elementemits end: <name>
- The end of the documentemits end document
When executing the test procedure on a simple document
like the following:
<?xml version="1.0"?>
<doc>
<body>Hello world!!!</body>
</doc>
The handler will create output something like this:
start document
start: doc
start: body
chars=Hello world!!!!
end: body
end: doc
end document
The phrase something like is somewhat misleading
because the output will be this:
start document
start: doc
chars=
!
start: body
chars=Hello world!!!!
end: body
chars=
!
end: doc
end document
Actually, two extra character data regions surround the
element with type name body. The character data
consists of a carriage return. In practice, character data
regions may be trimmed of white space and empty regions
can be ignored. A document that would give the former output
is:
<?xml version="1.0"?>
<doc><body>Hello world!!!</body></doc>
A tree-based parser, such as DOM, would filter such occurrences,
but because the SAX event-based parser is at a lower level,
the application that uses it must filter those occurrences.
This tutorial is Copyright 2002 by Mark
Graves and contains material Copyright 2002 by Prentice Hall
PTR. All rights reserved.