com.aliasi.sentences
Class MedlineSentenceModel

java.lang.Object
  extended by com.aliasi.sentences.AbstractSentenceModel
      extended by com.aliasi.sentences.HeuristicSentenceModel
          extended by com.aliasi.sentences.MedlineSentenceModel
All Implemented Interfaces:
SentenceModel, Serializable

public class MedlineSentenceModel
extends HeuristicSentenceModel
implements Serializable

A MedlineSentenceModel is a heuristic sentence model designed for operating over biomedical research abstracts as found in MEDLINE.

The MEDLINE model assumes that parentheses are balanced as defined in the class documentation for HeuristicSentenceModel. It also assumes the final token is a sentence boundary, overriding any other possible checks. This is set because there are many truncated MEDLINE abstracts, and this ensures that every token falls within a sentence in the result.

The sets required by the superclass constructor HeuristicSentenceModel.HeuristicSentenceModel(Set,Set,Set,boolean,boolean) determine which tokens are possible sentence stops, which are disallowed before stops, and which are disallowed as starts. These three sets are:

Possible Stops
.
..
!
?
Impossible Penultimates
some scientific and publishing terms
personal/professional titles/suffixes
months, times
corporate designators
common abbreviations
back quotes, commas
Impossible Sentence Starts
possible stops (see above)
close parens, brackets, braces
;
:
-
--
---
%

This class overrides the default implementation of the possible start token method to allow a sentence start to be any sequence of tokens uninterrupted by spaces that contains a non-lowercase letter character. This behavior is described with examples in its implementing method's documentation: possibleStart(String[],String[],int,int).

Singleton Instance

The instance accessible through the static constant INSTANCE may be used anywhere a MEDLINE sentence model is needed.

Thread Safety

A MEDLINE sentence model is thread safe after safely published.

Serialization

A MEDLINE sentence model may be serialized. The deserialized object will be the singleton instance.

Since:
LingPipe2.1
Version:
3.9
Author:
Mitzi Morris, Bob Carpenter
See Also:
Serialized Form

Field Summary
static MedlineSentenceModel INSTANCE
          A single instance which may be used anywhere a MEDLINE sentence model is needed.
 
Constructor Summary
MedlineSentenceModel()
          Construct a MEDLINE sentence model.
 
Method Summary
protected  boolean possibleStart(String[] tokens, String[] whitespaces, int start, int end)
          Return true if the specified start index can be a sentence start in the specified array of tokens and whitespaces running up to the end token.
 
Methods inherited from class com.aliasi.sentences.HeuristicSentenceModel
balanceParens, boundaryIndices, forceFinalStop
 
Methods inherited from class com.aliasi.sentences.AbstractSentenceModel
boundaryIndices, boundaryIndices, verifyBounds, verifyTokensWhitespaces
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INSTANCE

public static final MedlineSentenceModel INSTANCE
A single instance which may be used anywhere a MEDLINE sentence model is needed.

Constructor Detail

MedlineSentenceModel

public MedlineSentenceModel()
Construct a MEDLINE sentence model.

Method Detail

possibleStart

protected boolean possibleStart(String[] tokens,
                                String[] whitespaces,
                                int start,
                                int end)
Return true if the specified start index can be a sentence start in the specified array of tokens and whitespaces running up to the end token.

For MEDLINE, this implementation returns true if the sequence of contiguous tokens starting with the specified token contains an uppercase or digit character. Each token is considered, beginning with the specified start token and continuing through all tokens that are not separated by non-empty whitespace, up to the token with the end index minus one. If any of the tokens contains an uppercase or digit character, then the result is true. Otherwise, the result is false.

For example, if the first token is "Therefore", then it can be a sentence start because it contains the non-lowercase letter "T". Similarly, the token "pH" can be a sentence start, as can "p53", because they have non-lower-case characters "H" and "5" respectively. If the underlying sequence is " correlation. p-53 was...", then the array of tokens and whitespaces is:

Index Whitespace Token
0 " " correlation
1 "" .
2 " " p
3 "" -
4 "" 53
5 " " was
6 ... " "
Tokenization of: " correlation. p-53 was ..."
Here, "p" is a valid sentence start token even though it is only a single lowercase character, because it is followed by a hyphen (-) with no intervening whitespace. By way of contrast, the first token "and" in the sequence "and Foo", can't start a sentence because it is separated from the following token by a non-empty whitespace. Recall that the whitespace with the same index as a token precedes the token.

Overrides:
possibleStart in class HeuristicSentenceModel
Parameters:
tokens - Array of tokens to check.
whitespaces - Array of whitespaces to check.
start - Index of first token to check.
end - Index of last token to check.