com.aliasi.sentences
Class SentenceAnnotateFilter
java.lang.Object
org.xml.sax.helpers.DefaultHandler
com.aliasi.xml.SimpleElementHandler
com.aliasi.xml.SAXFilterHandler
com.aliasi.xml.ElementStackFilter
com.aliasi.xml.TextContentFilter
com.aliasi.sentences.SentenceAnnotateFilter
- All Implemented Interfaces:
- ContentHandler, DTDHandler, EntityResolver, ErrorHandler
public class SentenceAnnotateFilter
- extends TextContentFilter
A SentenceAnnotateFilter applies sentence-boundary
annotation to the text content of the specified elements. An
instance is constructed with a sentence model and a tokenizer
factory. Optionally, an array of elements to annotate may be
provided; if no array is specified, all text content is annotated.
The element sent is used to wrap sentences. If the
filtered element contains only whitespace, it is not annotated.
There will be no whitespace characters at the start or end of a
sentence element's text content. All inter-sentence whitespace is
retained, but included between sentence elements in the filtered
element's content. For instance, the input
<p> A b. C d. </p>
will yield
<p> <sent>A b.</sent> <sent>C d.</sent> </p>.
Note that the text of a sentence element starts with the first
character of the first token and ends with the last character of
the last token. Inter-sentential whitespace winds up as text
content outside of the sentence. In this case, there is a single
whitespace before the first sentence, two spaces between the
sentences, and a single space after the second sentence.
- Since:
- LingPipe1.0
- Version:
- 1.0.3
- Author:
- Bob Carpenter
|
Field Summary |
static String |
SENTENCE_ELEMENT
Element used to group sentences in sentence annotation,
namely "sent". |
|
Method Summary |
void |
characters(char[] cs,
int start,
int length)
Annotates characters if all characters are being annotated,
otherwise annotates if in an annotated element, otherwise passing
characters directly to contained handler. |
void |
filteredCharacters(char[] cs,
int start,
int length)
Performs sentence-boundary annotation of the specified
characters. |
| Methods inherited from class com.aliasi.xml.SAXFilterHandler |
endDocument, endPrefixMapping, error, fatalError, ignorableWhitespace, notationDecl, processingInstruction, resolveEntity, setDocumentLocator, setHandler, skippedEntity, startPrefixMapping, unparsedEntityDecl, warning |
| Methods inherited from class com.aliasi.xml.SimpleElementHandler |
addSimpleAttribute, characters, characters, characters, characters, createAttributes, createAttributes, createAttributes, createAttributes, createAttributes, createAttributes, endSimpleElement, endSimpleElement, startEndSimpleElement, startEndSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement, startSimpleElement |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SENTENCE_ELEMENT
public static final String SENTENCE_ELEMENT
- Element used to group sentences in sentence annotation,
namely
"sent".
- See Also:
- Constant Field Values
SentenceAnnotateFilter
public SentenceAnnotateFilter(SentenceModel sentenceModel,
TokenizerFactory tokenizerFactory)
- Constructs a sentence annotation filter with the specified
sentence model and tokenizer factory.
- Parameters:
sentenceModel - Sentence model to use for boundary detection.tokenizerFactory - Factory to produce tokenizers for text.
SentenceAnnotateFilter
public SentenceAnnotateFilter(SentenceModel sentenceModel,
TokenizerFactory tokenizerFactory,
String[] elements)
- Constructs a sentence annotation filter with the specified
sentence model and tokenizer factory, and elements whose
text content should be annotated.
- Parameters:
sentenceModel - Sentence model to use for boundary detection.tokenizerFactory - Factory to produce tokenizers for text.elements - List of elements to be annotated.
characters
public void characters(char[] cs,
int start,
int length)
throws SAXException
- Annotates characters if all characters are being annotated,
otherwise annotates if in an annotated element, otherwise passing
characters directly to contained handler. All boundary events
will be passed to the contained handler.
- Specified by:
characters in interface ContentHandler- Overrides:
characters in class TextContentFilter
- Parameters:
cs - Character array to filter.start - First character to filter.length - Number of characters to filter.
- Throws:
SAXException - If there is an exception thrown by the
contained handler.
filteredCharacters
public void filteredCharacters(char[] cs,
int start,
int length)
throws SAXException
- Performs sentence-boundary annotation of the specified
characters. Markup and text SAX events are delegated to the
contained handler.
- Specified by:
filteredCharacters in class TextContentFilter
- Parameters:
cs - Character array to annotate.start - First character to annotate.length - Number of characters to annotate.
- Throws:
SAXException - If there is an exception thrown by the
contained handler.