com.aliasi.chunk
Class IoTagChunkCodec

java.lang.Object
  extended by com.aliasi.chunk.IoTagChunkCodec
All Implemented Interfaces:
TagChunkCodec, Serializable

public class IoTagChunkCodec
extends Object
implements Serializable

The IoTagChunkCodec implements a chunk to tag coder/decoder based on the IO encoding scheme and a specified tokenizer factory.

Degenerate Encoding

Although this is a compact encoding in number of tags, it is degenerate in that it does not allow adjacent chunks of the same type. The isEncodable(Chunking) method reflects this behavior.

If consistency is not being enforced, the two entities will simply be run together as a single entity.

IO Encoding

The basis of the IO encoding of a chunking is to break the chunking down into tokens. All tokens that are part of a chunk of type X are tagged as X and all tokens that are not part of an entity are tagged as O.

For instance, consider the following input string:

 John Jones Mary and Mr. J. J. Jones ran to Washington.
 012345678901234567890123456789012345678901234567890123
 0         1         2         3         4         5
and chunking consisting of the string and chunks:
 (0,10):PER, (11,15):PER, (24,35):PER, (43,53):LOC
Recall that indexing is of the first character and one past the last character. Note that the two person names "John Jones" and "Mary", are separate chunks of type PER (for persons), and the location chunk for "Washington" ends before the period.

If we have a tokenizer that breaks on whitespace and punctuation, we have tokens starting at + and continuing through the - signs.

 John Jones Mary and Mr. J. J. Jones ran to Washington.
 +--- +---- +--- +-- +-+ ++ ++ +---- +-- +- +---------+
In particular, note that the the four periods form their own tokens, even though they are adjacent to characters in other tokens. Writing the tokens out in a column, we show the tags used by the BIO encoding to the right:
TokenTag
JohnPER
JonesPER
MaryPER
andO
MrO
.O
JPER
.PER
JPER
.PER
JonesPER
ranO
toO
WashingtonLOC
.O
Note that chunks may be any number of tokens long.

Set of Tags

There is a single tag O, as well as tags X for each chunk type.

Legal Tag Sequences

One nice property of the IO encoding is that all sequences of tags are legal.

Enforcing Tokenization Consistency

If the consistency flag is set on the constructor, attempts to encode chunkings or decode taggings that are inconsistent with the tokenizer will throw illegal argument exceptions.

In order for a tokenizer to be consistent with a chunking, the tokenization of the characterer sequence for the chunking must be such that every chunk start and end occurs at a token start or end. The same rule applies for tagging, in that the chunking produced has to obey the same rules.

For example, if a regular-expression based tokenizer that breaks on whitespace were used for the above example, the character sequence "Washington." is a token, including the final period. This conflicts with the location-type entity, which ends with the last character before the period.

Serialization

Instances of this class are serializable if their underlying tokenizer factories are serializable. Reading them back in produces an instance of the same class with the same behavior.

Since:
LingPipe3.9.1
Version:
3.9.1
Author:
Bob Carpenter
See Also:
Serialized Form

Constructor Summary
IoTagChunkCodec()
          Construct an IO-encoding based tag-chunk coder with a null tokenizer factory that does not enforce cons.
IoTagChunkCodec(TokenizerFactory tokenizerFactory, boolean enforceConsistency)
          Construct an IO-encoding based tag-chunk coder/decoder based on the specified tokenizer factory, enforcing consistency of chunkings and taggings if the specified flag is set.
 
Method Summary
 boolean enforceConsistency()
          Returns true if this codec enforces consistency of the chunkings relative to the tokenizer factory.
 boolean isDecodable(StringTagging tagging)
          Returns true if the specified tagging may be consistently decoded into a chunking.
 boolean isEncodable(Chunking chunking)
          Returns true if the specified chunking may be consistently encoded as a tagging.
 boolean legalTags(String... tags)
          Returns true if the specified sequence of tags is a complete legal tag sequence.
 boolean legalTagSubSequence(String... tags)
          Returns true if the specified sequence of tags is a legal subsequence of tags.
 Iterator<Chunk> nBestChunks(TagLattice<String> lattice, int[] tokenStarts, int[] tokenEnds, int maxResults)
          Returns an iterator over chunks extracted in order of highest probability up to the specified maximum number of results.
 Set<String> tagSet(Set<String> chunkTypes)
          Returns the complete set of tags used by this codec for the specified set of chunk types.
 Chunking toChunking(StringTagging tagging)
          Return the result of decoding the specified tagging into a chunking.
 TokenizerFactory tokenizerFactory()
          Return the tokenizer factory for this codec.
 String toString()
          Return a string-based representation of this codec.
 StringTagging toStringTagging(Chunking chunking)
          Return the string tagging that fully encodes the specified chunking.
 Tagging<String> toTagging(Chunking chunking)
          Return the tagging that partially encodes the specified chunking.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

IoTagChunkCodec

public IoTagChunkCodec()
Construct an IO-encoding based tag-chunk coder with a null tokenizer factory that does not enforce cons. A codec constructed with this method only supports the conversion of a string tagging to a chunking, not vice-versa.


IoTagChunkCodec

public IoTagChunkCodec(TokenizerFactory tokenizerFactory,
                       boolean enforceConsistency)
Construct an IO-encoding based tag-chunk coder/decoder based on the specified tokenizer factory, enforcing consistency of chunkings and taggings if the specified flag is set.

Parameters:
tokenizerFactory - Tokenizer factory for generating tokens.
enforceConsistency - Set to true to ensure all coded chunkings and decoded taggings are consistent for round trips.
Method Detail

tagSet

public Set<String> tagSet(Set<String> chunkTypes)
Description copied from interface: TagChunkCodec
Returns the complete set of tags used by this codec for the specified set of chunk types.

Modifying the returned set will not affect the codec.

Specified by:
tagSet in interface TagChunkCodec
Parameters:
chunkTypes - Set of types for chunks.
Returns:
Set of all tags used to encode chunks of types in the specified set.

legalTagSubSequence

public boolean legalTagSubSequence(String... tags)
Description copied from interface: TagChunkCodec
Returns true if the specified sequence of tags is a legal subsequence of tags. See the companion method TagChunkCodec.legalTags(String[]) to test if a complete sequence is legal.

A sequence of tags is a legal subsequence if a legal sequence may be created by adding more tags to the front and/or end of the specified sequence.

Providing an empty sequence of tags always returns true. The result for a single input tag determines if the tag itself is legal. For longer sequences, the tags must all be legal and their order must be legal.

Specified by:
legalTagSubSequence in interface TagChunkCodec
Parameters:
tags - Sequence of tags to test.
Returns:
true if the sequence of tags is legal as a subsequence of some larger sequence.

legalTags

public boolean legalTags(String... tags)
Description copied from interface: TagChunkCodec
Returns true if the specified sequence of tags is a complete legal tag sequence. The companion method TagChunkCodec.legalTagSubSequence(String[]) tests if a substring of tags is legal.

Specified by:
legalTags in interface TagChunkCodec
Parameters:
tags - Variable length array of tags.
Returns:
true if the specified sequence of tags is a complete legal tag sequence.

toChunking

public Chunking toChunking(StringTagging tagging)
Description copied from interface: TagChunkCodec
Return the result of decoding the specified tagging into a chunking.

Specified by:
toChunking in interface TagChunkCodec
Parameters:
tagging - Tagging to decode.
Returns:
Chunking resulting from tagging.

toStringTagging

public StringTagging toStringTagging(Chunking chunking)
Description copied from interface: TagChunkCodec
Return the string tagging that fully encodes the specified chunking.

Specified by:
toStringTagging in interface TagChunkCodec
Parameters:
chunking - Chunking to encode.
Returns:
Tagging that encodes the chunking.
Throws:
UnsupportedOperationException - If the tokenizer factory is null.

toTagging

public Tagging<String> toTagging(Chunking chunking)
Description copied from interface: TagChunkCodec
Return the tagging that partially encodes the specified chunking. This method does not return the underlying character sequence or token positions -- that functionality is available from the method TagChunkCodec.toStringTagging(Chunking).

This method will typically be more efficient than toStringTagging(), but implementations may just return the same value, because StringTagging extends Tagging<String>.

This method may be implemented by delegating to call to TagChunkCodec.toStringTagging(Chunking), but a direct implementation is often more efficient.

Specified by:
toTagging in interface TagChunkCodec
Parameters:
chunking - Chunking to encode.
Returns:
Tagging that encodes the chunking.
Throws:
UnsupportedOperationException - If the tokenizer factory is null.

nBestChunks

public Iterator<Chunk> nBestChunks(TagLattice<String> lattice,
                                   int[] tokenStarts,
                                   int[] tokenEnds,
                                   int maxResults)
Description copied from interface: TagChunkCodec
Returns an iterator over chunks extracted in order of highest probability up to the specified maximum number of results.

Specified by:
nBestChunks in interface TagChunkCodec
Parameters:
lattice - Lattice from which chunks are extracted.
maxResults - Maximum number of chunks to return.
Returns:
Iterator over the chunks in the lattice in order from highest to lowest probability.

toString

public String toString()
Return a string-based representation of this codec.

Overrides:
toString in class Object
Returns:
A string-based representation of this codec.

enforceConsistency

public boolean enforceConsistency()
Returns true if this codec enforces consistency of the chunkings relative to the tokenizer factory. Consistency requires each chunk to start on the first character of a token and requires each chunk to end on the last character of a token (as usual, ends are one past the last character).

Returns:
true if this codec enforces consistency of chunkings relative to tokenization.

tokenizerFactory

public TokenizerFactory tokenizerFactory()
Return the tokenizer factory for this codec. The tokenizer factory may be null if this was only constructed as a decoder.

Returns:
The underlying tokenizer factory.

isEncodable

public boolean isEncodable(Chunking chunking)
Returns true if the specified chunking may be consistently encoded as a tagging. A chunking is encodable if none of the chunks overlap, and if all chunks begin on the first character of a token and end on the character one past the end of the last character in a token.

Subclasses may enforce further conditions as defined in their class documentation.

Specified by:
isEncodable in interface TagChunkCodec
Parameters:
chunking - Chunking to test.
Returns:
true if the chunking is consistently encodable.
Throws:
UnsupportedOperationException - If the tokenizer is null so that this is only a decoder.

isDecodable

public boolean isDecodable(StringTagging tagging)
Returns true if the specified tagging may be consistently decoded into a chunking. A tagging is decodable if its tokens are the tokens produced by the tokenizer for this coded and if the tags form a legal sequence.

Specified by:
isDecodable in interface TagChunkCodec
Parameters:
tagging - Tagging to test for decodability.
Returns:
true if decoding then encoding produces the specified tagging.
Throws:
UnsupportedOperationException - If the tokenizer is null so that this is only a decoder.