com.aliasi.corpus
Class ChunkHandlerAdapter

java.lang.Object
  extended by com.aliasi.corpus.ChunkHandlerAdapter
All Implemented Interfaces:
Handler, ObjectHandler<Chunking>

Deprecated. Use TagChunkCodecAdapters.taggingToChunking(TagChunkCodec,ObjectHandler) instead.

@Deprecated
public class ChunkHandlerAdapter
extends Object
implements ObjectHandler<Chunking>

A ChunkHandlerAdapter converts a BIO-coded tag handler to a chunk handler. The adapter handles chunkings by tokenizing their character sequences and then using their chunk sets to produce tags in the begin-in-out (BIO) tagging scheme. For an adapter from a chunk handler to a BIO-coded tag handler, see the sister class ChunkTagHandlerAdapter.

The BIO tagging scheme marks each token as either beginning a chunk (B), contininuing a chunk (I), or not in a chunk (O). For example, consider the following string (with character indices annotated below it):

 John J. Smith lives in Washington.
 0123456789012345678901234567890123
 0         1         2         3
 
with chunks of type PERSON spanning from character 0 (inclusive) to 13 (exclusive) and a chunk of type LOCATION spanning from 23 to 33. With the standard tokenizerIndoEuropeanTokenizerFactory providing tokenization, the tokens, whitespaces and their associated BIO tags are:
Index Whitespace Token Tag
0 "" John B-PERSON
1 " " J I-PERSON
2 "" . I-PERSON
3 " " Smith I-PERSON
4 " " lives O
5 " " in O
6 " " Washington B-PERSON
7 "" . O
8 "" n/a
As usual, the whitespaces with the same index as a token occur before it. Thus the two periods in the input do not have spaces before them, but all other tokens do. Further note there is one additional whitespace following the last tag. The tag B-PERSON is assigned to the first token of the chunk, with the subsequent tokens being assigned I-PERSON. The tag "out" tag O is assigned to each token that is not a substring of a chunk, including the final period.

In order for this adaptation to be faithful, the chunks must be consistent with the tokenizer. Specifically, each chunk must start on the first character of a token and end on the last character of a token. If the person chunk ended at character 14 (exclusive) to include the space after the token Smith, it would no longer be consistent with the tokenizer. In the constructor or using the flag setting method setValidateTokenizer(boolean), the adapter may be configured to raise exceptions if called upon to handle a chunking inconsistent with its tokenizer. The static method consistentTokens(String[],String[],TokenizerFactory) is also provided to test if a given set of tokens and whitespaces is consistent with a tokenizer factory.

Since:
LingPipe2.1
Version:
3.9
Author:
Bob Carpenter

Constructor Summary
ChunkHandlerAdapter(TagHandler tagHandler, TokenizerFactory tokenizerFactory, boolean validateTokenizer)
          Deprecated. See class documentation.
ChunkHandlerAdapter(TokenizerFactory tokenizerFactory, boolean validateTokenizer)
          Deprecated. Construct a chunk handler based on the specified tokenizer factory and an initially null tag handler.
 
Method Summary
static boolean consistentTokens(String[] toks, String[] whitespaces, TokenizerFactory tokenizerFactory)
          Deprecated. Returns true if the specified tokens and whitespaces are consistent with the specified tokenizer factory.
 void handle(Chunking chunking)
          Deprecated. Handle the specified chunking by converting it to a tagging using the BIO scheme and contained tokenizer, then delegating to the contained tag handler.
 void setTagHandler(TagHandler tagHandler)
          Deprecated. See class documentation.
 void setValidateTokenizer(boolean validateTokenizer)
          Deprecated. Sets the tokenizer validation status to the specified value.
static String[] toTags(Chunking chunking, TokenizerFactory factory)
          Deprecated. Returns the array of tags for the specified chunking, relative to the specified tokenizer factory.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ChunkHandlerAdapter

@Deprecated
public ChunkHandlerAdapter(TagHandler tagHandler,
                                      TokenizerFactory tokenizerFactory,
                                      boolean validateTokenizer)
Deprecated. See class documentation.

Create a chunk handler based on the specified tag handler and tokenizer factory. The tag handler may be reset later using setTagHandler(TagHandler). The chunks handled by this handler will be converted to BIO-encoded tag sequences

Parameters:
tagHandler - Tag handler.
tokenizerFactory - Tokenizer factory.
validateTokenizer - Whether or not to validate tokenizer.

ChunkHandlerAdapter

public ChunkHandlerAdapter(TokenizerFactory tokenizerFactory,
                           boolean validateTokenizer)
Deprecated. 
Construct a chunk handler based on the specified tokenizer factory and an initially null tag handler. The tag handler may be reset later using setTagHandler(TagHandler).

Parameters:
tokenizerFactory - Tokenizer factory.
validateTokenizer - Whether or not to validate tokenizer.
Method Detail

setTagHandler

@Deprecated
public void setTagHandler(TagHandler tagHandler)
Deprecated. See class documentation.

Set the tag handler to the specified value.

Parameters:
tagHandler - New tag handler for this class.

setValidateTokenizer

public void setValidateTokenizer(boolean validateTokenizer)
Deprecated. 
Sets the tokenizer validation status to the specified value. If the value is set to true, then every chunking is tested for whether or not it is consistent with the specified tokenizer for this handler.

Parameters:
validateTokenizer - Whether or not to validate tokenizer.

handle

public void handle(Chunking chunking)
Deprecated. 
Handle the specified chunking by converting it to a tagging using the BIO scheme and contained tokenizer, then delegating to the contained tag handler.

Specified by:
handle in interface ObjectHandler<Chunking>
Parameters:
chunking - Chunking to handle.
Throws:
IllegalArgumentException - If tokenizer consistency is being validated and the tokenization is not consistent with the specified chunking.

toTags

public static String[] toTags(Chunking chunking,
                              TokenizerFactory factory)
Deprecated. 
Returns the array of tags for the specified chunking, relative to the specified tokenizer factory.

Parameters:
chunking - Chunking to convert to tags.
factory - Tokenizer factory for token generation.

consistentTokens

public static boolean consistentTokens(String[] toks,
                                       String[] whitespaces,
                                       TokenizerFactory tokenizerFactory)
Deprecated. 
Returns true if the specified tokens and whitespaces are consistent with the specified tokenizer factory. A tokenizer is consistent with the specified tokens and whitespaces if running the tokenizer over the concatenation of the tokens and whitespaces produces the same tokens and whitespaces.

Parameters:
toks - Tokens to check.
whitespaces - Whitespaces to check.
tokenizerFactory - Factory to create tokenizers.
Returns:
true if the tokenizer is consistent with the tokens and whitespaces.