com.aliasi.xml
Class SAXWriter

java.lang.Object
  extended by org.xml.sax.helpers.DefaultHandler
      extended by com.aliasi.xml.SimpleElementHandler
          extended by com.aliasi.xml.SAXWriter
All Implemented Interfaces:
ContentHandler, DTDHandler, EntityResolver, ErrorHandler
Direct Known Subclasses:
XHtmlWriter

public class SAXWriter
extends SimpleElementHandler

A SAXWriter handles SAX events and writes a character-based representation to a specified output stream in the specified character encoding. Characters that can't be encoded in the specified encoding will be written as question marks, rather than being escpaed or throwing exceptions, which is the default behavior of Java; this allows the character-handling methods to be faster than if they had to be inspected for escapes.

An XML 1.0 declaration with an explicit encoding specification is inserted as the first line of the file, where CharSet is the name of the specified charset:

<?xml version="1.0" encoding="CharSet"?>
A DTD may be specified with the method setDTDString(String), which must be called before the start document handler is called.

Comments, processing section, and non-ignorable whitespace are left in place. Ignorable whitespace is removed. The order of attributes is alphabetized.

Characters in PCDATA content are rendered as entities if they are one of the illegal characters, or if they are unicode code points that are not directly encodable in the current character set.

Character Name Entity Escape
&Ampersand &amp;
<Less than &lt;
>Greater than &gt;
"Double quote &quot;
U+wxyz Unencodable Hex Unicode wxyz &#xwxyz;
Note that unmatched unicode surrogate pairs should not be presented through characters(char[],int,int). Specifically, every low surrogate must be followed by a high surrogate, and every high surrogate must be preceded by a low surrogate. A low surrogate is a unicode character in the range U+D800 to U+DBFF inclusive. A high surrogate is a character in the range U+DC00 to U+DFFF inclusive. The code points for sentinels, U+FFFF, and byte-order marking, U+FFFE, should also not be encoded. An attempt to encode an unmatched surrogate or sentinel/indicator will not raise an exception on output; the characters will simply be output. The resulting XML bytes will not be converted back to their original form by a unicode-compliant byte-to-character converter. Default settings of InputStreamReader will simply perform a substitution.

The SAXWriter does not test document well-formedness. Nor does it test well-formedness with respect to a document-type definition (DTD). For instance, an entity foo can be ended by an entity bar and <foo></bar> will be output. As with other handlers, to throw exceptions in the face of ill-formed documents, compose a well-formedness filter with a SAXWriter.

If xmlReader is an XMLReader and contentHandler is a ContentHandler, then

   xmlReader.setContentHandler(contentHandler);
   xmlReader.parse(in);
 
calls the same methods (modulo order of attributes and I/O exceptions) on the contentHandler as the following sequences of methods which write an intermediate XML file:
  FileOutputStream out = new FileOutputStream(fileName);
  xmlReader.setContentHandler(new SAXWriter(out,"UTF8"));
  xmlReader.parse(new InputSource(in));
  out.close();
  xmlReader.setContentHandler(contentHandler);
  xmlReader.parse(fileName);
 

The SAXWriter handles namespace declarations according to the SAX 2 specification. It does this by storing the URI and namespace prefix received through the startPrefixMapping(String,String) event and prints it in the usual way as part of the attribute declaration of the next element. For proper behavior, the SAXWriter must receive start element events that are consistent with the following feature settings:

Feature Value Description
http://xml.org/sax/features/namespaces true or false Provides URI arguments to elements and attributes.
http://xml.org/sax/features/namespace-prefixes true Provides qualified name arguments to elements and attributes.
For more information on these and other features, see:

Since:
LingPipe1.0
Version:
3.8
Author:
Bob Carpenter

Field Summary
 
Fields inherited from class com.aliasi.xml.SimpleElementHandler
CDATA_ATTS_TYPE, EMPTY_ATTS, NO_OP_DEFAULT_HANDLER
 
Constructor Summary
SAXWriter()
          Construct a SAX writer that does not have an output stream or character set specified.
SAXWriter(boolean xhtmlMode)
          Construct a SAX writer with the specified XHTML compliance mode, but without an output stream or character set specified.
SAXWriter(OutputStream out, String charsetName)
          Construct a SAX writer that writes to the specified output stream using the specified character set.
SAXWriter(OutputStream out, String charsetName, boolean xhtmlMode)
          Construct a SAX writer that writes to the specified output stream using the specified character set and specified XHTML compliance.
 
Method Summary
 void characters(char[] ch, int start, int length)
          Prints the characters in the specified range.
 String charsetName()
          Returns the name of the character set being used by this writer.
 void comment(char[] cs, int start, int length)
          Convenience method to write a slice of character data as a comment.
 void comment(String comment)
          Write the specified string as a comment.
 void endDocument()
          Flushes the underlying character writers output to the output stream, trapping all exceptions.
 void endElement(String namespaceURI, String localName, String qName)
          Prints the end element, using the qualified name.
 void ignorableWhitespace(char[] ch, int start, int length)
          Does not print ignorable whitespace.
 void processingInstruction(String target, String data)
          Print a representation of the proecssing instruction.
 void setDTDString(String dtdString)
          Sets the DTD to be written by this writer to the specified value.
 void setOutputStream(OutputStream out, String charsetName)
          Sets the output stream to which the XML is written, and the character set which is used to encode characters.
 void startDocument()
          Prints the XML declaration, and DTD declaration if any.
 void startElement(String namespaceURI, String localName, String qName, Attributes atts)
          Prints the start element, using the qualified name, and sorting the attributes using the underlying string ordering.
 void startPrefixMapping(String prefix, String uri)
          Handles the declaration of a namespace mapping from a specified URI to its identifying prefix.
 
Methods inherited from class com.aliasi.xml.SimpleElementHandler
addSimpleAttribute, characters, characters, characters, characters, createAttributes, createAttributes, createAttributes, createAttributes, createAttributes, createAttributes, endSimpleElement, endSimpleElement, startEndSimpleElement, startEndSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement
 
Methods inherited from class org.xml.sax.helpers.DefaultHandler
endPrefixMapping, error, fatalError, notationDecl, resolveEntity, setDocumentLocator, skippedEntity, unparsedEntityDecl, warning
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SAXWriter

public SAXWriter(OutputStream out,
                 String charsetName)
          throws UnsupportedEncodingException
Construct a SAX writer that writes to the specified output stream using the specified character set. See setOutputStream(OutputStream,String) for details on the management of the output stream and character set.

By default, the SAXWriter is not in XHTML mode. See SAXWriter(OutputStream,String,boolean) for more information.

Parameters:
out - Output stream to which bytes are written.
charsetName - Name of character encoding used to write output.
Throws:
UnsupportedEncodingException - If the character set is not supported.

SAXWriter

public SAXWriter(OutputStream out,
                 String charsetName,
                 boolean xhtmlMode)
          throws UnsupportedEncodingException
Construct a SAX writer that writes to the specified output stream using the specified character set and specified XHTML compliance. See setOutputStream(OutputStream,String) for details on the management of the output stream and character set.

Compliance with the XHTML compliance goes beyond well-formed XML documents. Although each XHTML document must be well-formed XML, not all well-formed XML documents are XHTML compliant. XHTML imposes two additional requirements on the expression of elements. The first requires a space before elements ended inline. Thus although the element <br/> is perfectly well-formed XML, in XHTML it must be written as <br />. The second requirement is that there be a distinct end tag for elements with attributes. This forbids valid XML such as <a name="foo"/>, requiring the alternative form <a name="foo"></a> for XHTML compliance.

Parameters:
out - Output stream to which bytes are written.
charsetName - Name of character encoding used to write output.
xhtmlMode - Set to true to render XHTML-compliant output.
Throws:
UnsupportedEncodingException - If the character set is not supported.

SAXWriter

public SAXWriter()
Construct a SAX writer that does not have an output stream or character set specified. These must be set through setOutputStream(OutputStream,String) or an illegal state exception will be thrown by any output method. By default, the XHTML mode is turned off. See SAXWriter(OutputStream,String,boolean) for more information on XHTML compliance.


SAXWriter

public SAXWriter(boolean xhtmlMode)
Construct a SAX writer with the specified XHTML compliance mode, but without an output stream or character set specified. These must be set through setOutputStream(OutputStream,String) or an illegal state exception will be thrown by any output method. By default, the XHTML mode is turned off. See SAXWriter(OutputStream,String,boolean) for more information on XHTML compliance.

Parameters:
xhtmlMode - Set to true to render XHTML-compliant output.
Method Detail

setDTDString

public void setDTDString(String dtdString)
Sets the DTD to be written by this writer to the specified value. There is no error checking on its well-formedness, and it is not wrapped in any way other than being printed on its own line; this allows arbitrary DTDs to be written.

Parameters:
dtdString - String to write after the XML declaration as the DTD declaration.

setOutputStream

public final void setOutputStream(OutputStream out,
                                  String charsetName)
                           throws UnsupportedEncodingException
Sets the output stream to which the XML is written, and the character set which is used to encode characters. Before writing a document, the output stream and character set must be set by the constructor or by this method. The output stream is not closed after an XML document is written, but all output to the stream will be produced and does not need to be otherwise flushed.

Parameters:
out - Output stream to which encoded characters are written.
charsetName - Character set to use for encoding characters.
Throws:
UnsupportedEncodingException - If the character set is not supported by the Java runtime.

startDocument

public void startDocument()
Prints the XML declaration, and DTD declaration if any.

Specified by:
startDocument in interface ContentHandler
Overrides:
startDocument in class DefaultHandler

endDocument

public void endDocument()
Flushes the underlying character writers output to the output stream, trapping all exceptions.

Specified by:
endDocument in interface ContentHandler
Overrides:
endDocument in class DefaultHandler

startPrefixMapping

public void startPrefixMapping(String prefix,
                               String uri)
Handles the declaration of a namespace mapping from a specified URI to its identifying prefix. The mapping is buffered and then flushed and printed as an attribute during the next start-element call.

Specified by:
startPrefixMapping in interface ContentHandler
Overrides:
startPrefixMapping in class DefaultHandler
Parameters:
prefix - The namespace prefix being declared..
uri - The namespace URI mapped to prefix.

startElement

public void startElement(String namespaceURI,
                         String localName,
                         String qName,
                         Attributes atts)
Prints the start element, using the qualified name, and sorting the attributes using the underlying string ordering. Namespace URI and local names are ignored, and qualified name must not be null.

Specified by:
startElement in interface ContentHandler
Overrides:
startElement in class DefaultHandler
Parameters:
namespaceURI - The URI of the namespace for this element.
localName - The local name (without prefix) for this element.
qName - The qualified name (with prefix, if any) for this element.
atts - The attributes for this element.

endElement

public void endElement(String namespaceURI,
                       String localName,
                       String qName)
Prints the end element, using the qualified name. Namespace URI and local name parameters are ignored, and the qualified name must not be null

Specified by:
endElement in interface ContentHandler
Overrides:
endElement in class DefaultHandler
Parameters:
namespaceURI - The URI of the namespace for this element.
localName - The local name (without prefix) for this element.
qName - The qualified name (with prefix, if any) for this element.

characters

public void characters(char[] ch,
                       int start,
                       int length)
Prints the characters in the specified range.

Specified by:
characters in interface ContentHandler
Overrides:
characters in class DefaultHandler
Parameters:
ch - Character array from which to draw characters.
start - Index of first character to print.
length - Number of characters to print.

ignorableWhitespace

public void ignorableWhitespace(char[] ch,
                                int start,
                                int length)
Does not print ignorable whitespace.

Specified by:
ignorableWhitespace in interface ContentHandler
Overrides:
ignorableWhitespace in class DefaultHandler
Parameters:
ch - Character array from which to draw characters.
start - Index of first character to print.
length - Number of characters to print.

processingInstruction

public void processingInstruction(String target,
                                  String data)
Print a representation of the proecssing instruction. This will be ⟨?Target if there is no data, or * <?Target Data> if there is data.

Specified by:
processingInstruction in interface ContentHandler
Overrides:
processingInstruction in class DefaultHandler
Parameters:
target - Target of the instruction.
data - Value of the instruction, or null.

comment

public void comment(char[] cs,
                    int start,
                    int length)
Convenience method to write a slice of character data as a comment. This method delegates a new string created from the specified slice to comment(String); see that method's documentation for more information.

Exceptions match those thrown by String.String(char[],int,int).

Parameters:
cs - Underlying characters.
start - First character in sequence.
length - Number of characters in sequence.
Throws:
IndexOutOfBoundsException - if start and length are out of bounds.

comment

public void comment(String comment)
Write the specified string as a comment. The string is first sanitized by breaking any double hyphens ("--") with a space (producing "- -"). If the comment starts with a hyphen (-), a space is inserted before the comment (causing it to start with  -). If the comment ends with a hyphen (-), a space is appended (causing it to end ).

Comments are written between comment delimeters for the begin (<--) and end (-->) of a comment. No extra space is inserted after the opening hyphen or before the closing hyphen, and no extra line-breaks, etc. are inserted. The method characters(char[],int,int) may be used for inserting additional formatting, but beware that this adds whitespace to the current element's content which is only ignored if there is a DTD specifiying that no text content is allowed in the current element.

Parameters:
comment - Comment to write.

charsetName

public String charsetName()
Returns the name of the character set being used by this writer.

Returns:
The character set for this writer.