Learn Java: Java API for XML Processing

The Java API for XML Processing (JAXP) is for processing XML data using applications written in the Java programming language. JAXP leverages the parser standards Simple API for XML Parsing (SAX) and Document Object Model (DOM) so that you can choose to parse your data as a stream of events or to build an object representation of it. JAXP also supports the Extensible Stylesheet Language Transformations (XSLT) standard, giving you control over the presentation of the data and enabling you to convert the data to other XML documents or to other formats, such as HTML. JAXP also provides namespace support, allowing you to work with DTDs that might otherwise have naming conflicts.

Designed to be flexible, JAXP allows you to use any XML-compliant parser from within your application. It does this with what is called a pluggability layer, which lets you plug in an implementation of the SAX or DOM API. The pluggability layer also allows you to plug in an XSL processor, letting you control how your XML data is displayed.

The basic outline of the SAX parsing APIs are shown in Figure 4-1. To start the process, an instance of the SAXParserFactory class is used to generate an instance of the parser.

The main JAXP APIs are defined in the javax.xml.parsers package. That package contains vendor-neutral factory classes--SAXParserFactory, DocumentBuilderFactory, and TransformerFactory--which give you a SAXParser, a DocumentBuilder, and an XSLT transformer, respectively. DocumentBuilder, in turn, creates a DOM-compliant Document object.

The factory APIs let you plug in an XML implementation offered by another vendor without changing your source code. The implementation you get depends on the setting of the javax.xml.parsers.SAXParserFactory, javax.xml.parsers.DocumentBuilderFactory, and javax.xml.transform.TransformerFactory system properties, using System.setProperties() in the code, in an Ant build script, or -DpropertyName="..." on the command line. The default values (unless overridden at runtime on the command line or in the code) point to Sun's implementation.

The SAX and DOM APIs are defined by the XML-DEV group and by the W3C, respectively. The libraries that define those APIs are as follows:

javax.xml.parsers: The JAXP APIs, which provide a common interface for different vendors' SAX and DOM parsers
org.w3c.dom: Defines the Document class (a DOM) as well as classes for all the components of a DOM
org.xml.sax: Defines the basic SAX APIs
javax.xml.transform: Defines the XSLT APIs that let you transform XML into other forms
The Simple API for XML (SAX) is the event-driven, serial-access mechanism that does element-by-element processing. The API for this level reads and writes XML to a data repository or the web. For server-side and high-performance applications, you will want to fully understand this level. But for many applications, a minimal understanding will suffice.

The DOM API is generally an easier API to use. It provides a familiar tree structure of objects. You can use the DOM API to manipulate the hierarchy of application objects it encapsulates. The DOM API is ideal for interactive applications because the entire object model is present in memory, where it can be accessed and manipulated by the user.

On the other hand, constructing the DOM requires reading the entire XML structure and holding the object tree in memory, so it is much more CPU- and memory-intensive. For that reason, the SAX API tends to be preferred for server-side applications and data filters that do not require an in-memory representation of the data.

Finally, the XSLT APIs defined in javax.xml.transform let you write XML data to a file or convert it into other forms. And, as you'll see in the XSLT section of this tutorial, you can even use it in conjunction with the SAX APIs to convert legacy data to XML.

The parser wraps a SAXReader object. When the parser's parse() method is invoked, the reader invokes one of several callback methods implemented in the application. Those methods are defined by the interfaces ContentHandler, ErrorHandler, DTDHandler, and EntityResolver.

Here is a summary of the key SAX APIs:

SAXParserFactory

A SAXParserFactory object creates an instance of the parser determined by the system property, javax.xml.parsers.SAXParserFactory.

SAXParser

The SAXParser interface defines several kinds of parse() methods. In general, you pass an XML data source and a DefaultHandler object to the parser, which processes the XML and invokes the appropriate methods in the handler object.

SAXReader

The SAXParser wraps a SAXReader. Typically, you don't care about that, but every once in a while you need to get hold of it using SAXParser's getXMLReader() so that you can configure it. It is the SAXReader that carries on the conversation with the SAX event handlers you define.

DefaultHandler

Not shown in the diagram, a DefaultHandler implements the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces (with null methods), so you can override only the ones you're interested in.

ContentHandler

Methods such as startDocument, endDocument, startElement, and endElement are invoked when an XML tag is recognized. This interface also defines the methods characters and processingInstruction, which are invoked when the parser encounters the text in an XML element or an inline processing instruction, respectively.

ErrorHandler

Methods error, fatalError, and warning are invoked in response to various parsing errors. The default error handler throws an exception for fatal errors and ignores other errors (including validation errors). That's one reason you need to know something about the SAX parser, even if you are using the DOM. Sometimes, the application may be able to recover from a validation error. Other times, it may need to generate an exception. To ensure the correct handling, you'll need to supply your own error handler to the parser.

DTDHandler

Defines methods you will generally never be called upon to use. Used when processing a DTD to recognize and act on declarations for an unparsed entity.

EntityResolver

The resolveEntity method is invoked when the parser must identify data identified by a URI. In most cases, a URI is simply a URL, which specifies the location of a document, but in some cases the document may be identified by a URN--a public identifier, or name, that is unique in the web space. The public identifier may be specified in addition to the URL. The EntityResolver can then use the public identifier instead of the URL to find the document--for example, to access a local copy of the document if one exists.

A typical application implements most of the ContentHandler methods, at a minimum. Because the default implementations of the interfaces ignore all inputs except for fatal errors, a robust implementation may also want to implement the ErrorHandler methods.

The SAX Packages
The SAX parser is defined in the packages listed in Table 4-1.

Table 4-1 SAX Packages Package Description
org.xml.sax Defines the SAX interfaces. The name org.xml is the package prefix that was settled on by the group that defined the SAX API.
org.xml.sax.ext Defines SAX extensions that are used for doing more sophisticated SAX processing--for example, to process a document type definition (DTD) or to see the detailed syntax for a file.
org.xml.sax.helpers Contains helper classes that make it easier to use SAX--for example, by defining a default handler that has null methods for all the interfaces, so that you only need to override the ones you actually want to implement.
javax.xml.parsers Defines the SAXParserFactory class, which returns the SAXParser. Also defines exception classes for reporting errors.

The Document Object Model APIs
Figure 4-2 shows the DOM APIs in action.

You use the javax.xml.parsers.DocumentBuilderFactory class to get a DocumentBuilder instance, and you use that instance to produce a Document object that conforms to the DOM specification. The builder you get, in fact, is determined by the system property javax.xml.parsers.DocumentBuilderFactory, which selects the factory implementation that is used to produce the builder. (The platform's default value can be overridden from the command line.)

You can also use the DocumentBuilder newDocument() method to create an empty Document that implements the org.w3c.dom.Document interface. Alternatively, you can use one of the builder's parse methods to create a Document from existing XML data. The result is a DOM tree like that shown in Figure 4-2.

--------------------------------------------------------------------------------

Note: Although they are called objects, the entries in the DOM tree are actually fairly low-level data structures. For example, consider this structure: blue. There is an element node for the color tag, and under that there is a text node that contains the data, blue! This issue will be explored at length in the DOM section of the tutorial, but developers who are expecting objects are usually surprised to find that invoking getNodeValue() on the element node returns nothing! For a truly object-oriented tree, see the JDOM API at http://www.jdom.org.

--------------------------------------------------------------------------------

The DOM Packages
The Document Object Model implementation is defined in the packages listed in Table 4-2.

Table 4-2 DOM Packages Package Description
org.w3c.dom Defines the DOM programming interfaces for XML (and, optionally, HTML) documents, as specified by the W3C.
javax.xml.parsers Defines the DocumentBuilderFactory class and the DocumentBuilder class, which returns an object that implements the W3C Document interface. The factory that is used to create the builder is determined by the javax.xml.parsers system property, which can be set from the command line or overridden when invoking the new Instance method. This package also defines the ParserConfigurationException class for reporting errors.

The Extensible Stylesheet Language Transformations APIs
Figure 4-3 shows the XSLT APIs in action.

A TransformerFactory object is instantiated and used to create a Transformer. The source object is the input to the transformation process. A source object can be created from a SAX reader, from a DOM, or from an input stream.

Similarly, the result object is the result of the transformation process. That object can be a SAX event handler, a DOM, or an output stream.

When the transformer is created, it can be created from a set of transformation instructions, in which case the specified transformations are carried out. If it is created without any specific instructions, then the transformer object simply copies the source to the result.

The XSLT Packages
The XSLT APIs are defined in the packages shown in Table 4-3.

Table 4-3 XSLT Packages Package Description
javax.xml.transform Defines the TransformerFactory and Transformer classes, which you use to get an object capable of doing transformations. After creating a transformer object, you invoke its transform() method, providing it with an input (source) and output (result).
javax.xml.transform.dom Classes to create input (source) and output (result) objects from a DOM.
javax.xml.transform.sax Classes to create input (source) objects from a SAX parser and output (result) objects from a SAX event handler.
javax.xml.transform.stream Classes to create input (source) objects and output (result) objects from an I/O stream.

Simple API for XML (SAX), an event-driven, serial-access mechanism for accessing XML documents. This protocol is frequently used by servlets and network-oriented programs that need to transmit and receive XML documents, because it's the fastest and least memory-intensive mechanism that is currently available for dealing with XML documents, other than StAX.

SAX is an event-driven model (you provide the callback methods, and the parser invokes them as it reads the XML data), and that makes it harder to visualize. Finally, you can't "back up" to an earlier part of the document, or rearrange it, any more than you can back up a serial data stream or rearrange characters you have read from that stream.

For those reasons, developers who are writing a user-oriented application that displays an XML document and possibly modifies it will want to use the DOM mechanism

even if you plan to build DOM applications exclusively, there are several important reasons for familiarizing yourself with the SAX model:

Same Error Handling: The same kinds of exceptions are generated by the SAX and DOM APIs, so the error handling code is virtually identical.
Handling Validation Errors: By default, the specifications require that validation errors (which you'll learn more about in this part of the tutorial) are ignored. If you want to throw an exception in the event of a validation error (and you probably do), then you need to understand how SAX error handling works.
Converting Existing Data: There is a mechanism we can use to convert an existing data set to XML. However, taking advantage of that mechanism requires an understanding of the SAX model.

SAX is fast and efficient, but its event model makes it most useful for such state-independent filtering. For example, a SAX parser calls one method in your application when an element tag is encountered and calls a different method when text is found. If the processing you're doing is state-independent (meaning that it does not depend on the elements have come before), then SAX works fine.

On the other hand, for state-dependent processing, where the program needs to do one thing with the data under element A but something different with the data under element B, then a pull parser such as the Streaming API for XML (StAX) would be a better choice. With a pull parser, you get the next node, whatever it happens to be, at any point in the code that you ask for it. So it's easy to vary the way you process text (for example), because you can process it multiple places in the program.

SAX requires much less memory than DOM, because SAX does not construct an internal representation (tree structure) of the XML data, as a DOM does. Instead, SAX simply sends data to the application as it is read; your application can then do whatever it wants to do with the data it sees.

Pull parsers and the SAX API both act like a serial I/O stream. You see the data as it streams in, but you can't go back to an earlier position or leap ahead to a different position. In general, such parsers work well when you simply want to read data and have the application act on it.
But when you need to modify an XML structure--especially when you need to modify it interactively--an in-memory structure makes more sense. DOM is one such model.

Tuesday, January 1, 2008

Java API for XML Processing

1 comment:

About Me

Blog Archive

Followers