com.aliasi.chunk
Class TokenShapeChunker

java.lang.Object
  extended by com.aliasi.chunk.TokenShapeChunker
All Implemented Interfaces:
Chunker

public class TokenShapeChunker
extends Object
implements Chunker

A TokenShapeChunker uses a named-entity TokenShapeDecoder and tokenizer factory to implement entity detection through the chunk.Chunker interface. A named-entity chunker is constructed from a tokenizer factory and decoder. The tokenizer factory creates the tokens that are sent to the decoder. The chunks have types derived from the named-entity types found.

The tokens and whitespaces returned by the tokenizer are concatenated to form the underlying text slice of the chunks returned by the chunker. Thus a tokenizer like the stop list tokenizer or Porter stemmer tokenizer will create a character slice that does not match the input. A whitespace-normalizing tokenizer filter can be used, for example, to produce normalized text for the basis of the chunks.

Since:
LingPipe2.1
Version:
3.8
Author:
Mitzi Morris, Bob Carpenter

Method Summary
 Chunking chunk(char[] cs, int start, int end)
          Return the set of named-entity chunks derived from the underlying decoder over the tokenization of the specified character slice.
 Chunking chunk(CharSequence cSeq)
          Return the set of named-entity chunks derived from the uderlying decoder over the tokenization of the specified character sequence.
 void setLog2Beam(double beamWidth)
          Sets the log (base 2) beam width for the decoder.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

chunk

public Chunking chunk(CharSequence cSeq)
Return the set of named-entity chunks derived from the uderlying decoder over the tokenization of the specified character sequence.

For more information on return results, see chunk(char[],int,int).

Specified by:
chunk in interface Chunker
Parameters:
cSeq - Character sequence to chunk.
Returns:
The named-entity chunking of the specified character sequence.

chunk

public Chunking chunk(char[] cs,
                      int start,
                      int end)
Return the set of named-entity chunks derived from the underlying decoder over the tokenization of the specified character slice. Iterating over the returned set is guaranteed to return the sentence chunks in their original textual order. As noted in the class documentation, a tokenizer factory may cause the underlying character slice for the chunks to differ from the slice provided as an argument.

Specified by:
chunk in interface Chunker
Parameters:
cs - Characters underlying slice.
start - Index of first character in slice.
end - Index of one past the last character in the slice.
Returns:
The chunking over the specified character slice.

setLog2Beam

public void setLog2Beam(double beamWidth)
Sets the log (base 2) beam width for the decoder. The beam is synchronous by token, with any hypothesis whose log (base 2) probability is more than the beam width's worse than the best hypothesis is removed from further consideration.

Parameters:
beamWidth - Width of beam.
Throws:
IllegalArgumentException - If the beam width is not positive.