com.aliasi.sentences
Class HeuristicSentenceModel

java.lang.Object
  extended by com.aliasi.sentences.AbstractSentenceModel
      extended by com.aliasi.sentences.HeuristicSentenceModel
All Implemented Interfaces:
SentenceModel
Direct Known Subclasses:
IndoEuropeanSentenceModel, MedlineSentenceModel

public class HeuristicSentenceModel
extends AbstractSentenceModel

A HeuristicSentenceModel determines sentence boundaries based on sets of tokens, a pair of flags, and an overridable method describing boundary conditions.

There are three sets of tokens specified for a heuristic model:

Note that all of these sets perform case insensitive tests.

There are also two flags in the constructor that determine aspects of sentence boundary detection:

A further condition is imposed on sentence initial tokens by method possibleStart(String[],String[],int,int). This method checks a given token in sequence of tokens and whitespaces to determine if it is a possible sentence start. The default implementation in this class is to rule out tokens that start with lowercase letters.

The final condition is that a token cannot be a stop unless it is followed by non-empty whitespace.

The resulting model will miss tokens as boundaries that act as both sentence boundaries and end-of-abbreviation markers for known abbreviations. It will add spurious sentence boundaries that appear after unknown abbreviations and are followed by whitespace and a capitalized word.

Our approach is loosely based on the article:

Mikheev, Andrei. 2002. Periods, Capitalized Words, etc. Computational Linguistics 28(3):289-318.

Since:
LingPipe1.0
Version:
3.8
Author:
Mitzi Morris, Bob Carpenter

Constructor Summary
HeuristicSentenceModel(Set<String> possibleStops, Set<String> impossiblePenultimate, Set<String> impossibleStarts)
          Constructs a capitalization-sensitive heuristic sentence model with the specified set of possible stop tokens, impossible penultimate tokens, and impossible sentence start tokens.
HeuristicSentenceModel(Set<String> possibleStops, Set<String> impossiblePenultimate, Set<String> impossibleStarts, boolean forceFinalStop, boolean balanceParens)
          Construct a heuristic sentence model with the specified sets of possible stop tokens, impossible penultimate tokens, impossible start tokens, and flags for whether the final token is forced to be a stop, and whether parentheses are balanced.
 
Method Summary
 boolean balanceParens()
          Returns true if this model does parenthesis balancing.
 void boundaryIndices(String[] tokens, String[] whitespaces, int start, int length, Collection<Integer> indices)
          Adds the sentence final token indices as Integer instances to the specified collection, only considering tokens between index start and end-1 inclusive.
 boolean forceFinalStop()
          Returns true if this model treats any input-final token as a stop.
protected  boolean possibleStart(String[] tokens, String[] whitespaces, int start, int end)
          Return true if the specified start index can be a sentence start in the specified array of tokens and whitespaces running up to the end token.
 
Methods inherited from class com.aliasi.sentences.AbstractSentenceModel
boundaryIndices, boundaryIndices, verifyBounds, verifyTokensWhitespaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HeuristicSentenceModel

public HeuristicSentenceModel(Set<String> possibleStops,
                              Set<String> impossiblePenultimate,
                              Set<String> impossibleStarts)
Constructs a capitalization-sensitive heuristic sentence model with the specified set of possible stop tokens, impossible penultimate tokens, and impossible sentence start tokens. Note that these sets are case insensitive. The default constructor sets the balance parentheses and force final stops flags to false.

Parameters:
possibleStops - Possible tokens on which to stop a sentence.
impossiblePenultimate - Tokens that may not precede a stop.
impossibleStarts - Tokens that may not follow a stop.

HeuristicSentenceModel

public HeuristicSentenceModel(Set<String> possibleStops,
                              Set<String> impossiblePenultimate,
                              Set<String> impossibleStarts,
                              boolean forceFinalStop,
                              boolean balanceParens)
Construct a heuristic sentence model with the specified sets of possible stop tokens, impossible penultimate tokens, impossible start tokens, and flags for whether the final token is forced to be a stop, and whether parentheses are balanced. Note that the token sets are case insensitive.

Parameters:
possibleStops - Possible tokens on which to stop a sentence.
impossiblePenultimate - Tokens that may not precede a stop.
impossibleStarts - Tokens that may not follow a stop.
Method Detail

forceFinalStop

public boolean forceFinalStop()
Returns true if this model treats any input-final token as a stop. This ensures that in truncated inputs, all tokens are or are followed by a sentence boundary. For instance, if the input is the array of tokens {"a", "b", ".", "c", "d"}, then if "d" is not in the set of possible stops, then the tokens "c" and "d" will not be assigned to a sentence. If the allow-any-final-token flag is true, then in the case where the "d" is final in the input, it will be taken to end a sentence.

The value is set in the constructor HeuristicSentenceModel(Set,Set,Set,boolean,boolean). See the class documentation for more information.

Returns:
true if any token may be a stop if it is final in the input.

balanceParens

public boolean balanceParens()
Returns true if this model does parenthesis balancing. Note that the value is set in the constructor HeuristicSentenceModel(Set,Set,Set,boolean,boolean). See the class documentation for more information.

Returns:
true if this model does parenthesis balancing.

boundaryIndices

public void boundaryIndices(String[] tokens,
                            String[] whitespaces,
                            int start,
                            int length,
                            Collection<Integer> indices)
Adds the sentence final token indices as Integer instances to the specified collection, only considering tokens between index start and end-1 inclusive.

Specified by:
boundaryIndices in interface SentenceModel
Specified by:
boundaryIndices in class AbstractSentenceModel
Parameters:
tokens - Array of tokens to annotate.
whitespaces - Array of whitespaces to annotate.
start - Index of first token to annotate.
length - Number of tokens to annotate.
indices - Collection into which to write the boundary indices.

possibleStart

protected boolean possibleStart(String[] tokens,
                                String[] whitespaces,
                                int start,
                                int end)
Return true if the specified start index can be a sentence start in the specified array of tokens and whitespaces running up to the end token.

The implementation in this class requires the first token to be non-empty and have a first character that is not lower case according to Character.isLowerCase(char).

The start and end indices should be within range for the tokens and whitespaces as a precondition to this method being called. For a precise definition, see AbstractSentenceModel.verifyBounds(String[],String[],int,int). All calls from the abstract sentence model obey this constraint.

Parameters:
tokens - Array of tokens to check.
whitespaces - Array of whitespaces to check.
start - Index of first token to check.
end - Index of last token to check.