com.aliasi.sentences
Class MedlineSentenceModel
java.lang.Object
com.aliasi.sentences.AbstractSentenceModel
com.aliasi.sentences.HeuristicSentenceModel
com.aliasi.sentences.MedlineSentenceModel
- All Implemented Interfaces:
- SentenceModel
public class MedlineSentenceModel
- extends HeuristicSentenceModel
A MedlineSentenceModel is a heuristic sentence model
designed for operating over biomedical research abstracts as found
in MEDLINE.
The MEDLINE model assumes that parentheses are balanced as
defined in the class documentation for HeuristicSentenceModel. It also assumes the final token is a
sentence boundary, overriding any other possible checks. This is
set because there are many truncated MEDLINE abstracts, and this
ensures that every token falls within a sentence in the result.
The sets required by the superclass constructor HeuristicSentenceModel.HeuristicSentenceModel(Set,Set,Set,boolean,boolean)
determine which tokens are possible sentence stops, which are
disallowed before stops, and which are disallowed as starts. These
three sets are:
|
|
| Impossible Penultimates |
| some scientific and publishing terms |
| personal/professional titles/suffixes |
| months, times |
| corporate designators |
| common abbreviations |
| back quotes, commas |
|
| Impossible Sentence Starts |
| possible stops (see above) |
| close parens, brackets, braces |
; |
: |
- |
-- |
--- |
% |
|
This class overrides the default implementation of the possible
start token method to allow a sentence start to be any sequence of
tokens uninterrupted by spaces that contains a non-lowercase letter
character. This behavior is described with examples in its
implementing method's documentation: possibleStart(String[],String[],int,int).
- Since:
- LingPipe2.1
- Version:
- 2.1
- Author:
- Mitzi Morris
|
Method Summary |
protected boolean |
possibleStart(String[] tokens,
String[] whitespaces,
int start,
int end)
Return true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
MedlineSentenceModel
public MedlineSentenceModel()
- Construct a MEDLINE sentence model.
possibleStart
protected boolean possibleStart(String[] tokens,
String[] whitespaces,
int start,
int end)
- Return
true if the specified start index can
be a sentence start in the specified array of tokens and
whitespaces running up to the end token.
For MEDLINE, this implementation returns true
if the sequence of contiguous tokens starting with the
specified token contains an uppercase or digit character. Each
token is considered, beginning with the specified start token
and continuing through all tokens that are not separated by
non-empty whitespace, up to the token with the end index minus
one. If any of the tokens contains an uppercase or digit
character, then the result is true. Otherwise,
the result is false.
For example, if the first token is "Therefore", then
it can be a sentence start because it contains the non-lowercase
letter "T". Similarly, the token "pH" can be a sentence start,
as can "p53", because they have non-lower-case characters "H"
and "5" respectively. If the underlying sequence is
" correlation. p-53 was...", then the array of tokens
and whitespaces is:
| Index |
Whitespace |
Token |
| 0 |
" " |
correlation |
| 1 |
"" |
. |
| 2 |
" " |
p |
| 3 |
"" |
- |
| 4 |
"" |
53 |
| 5 |
" " |
was |
| 6 |
... |
" " |
Tokenization of: " correlation. p-53 was ..." |
Here, "p" is a valid sentence start token even though
it is only a single lowercase character, because it is followed
by a hyphen (-) with no intervening whitespace.
By way of contrast, the first token
"and" in the sequence "and
Foo", can't start a sentence because it is separated
from the following token by a non-empty whitespace.
Recall that the whitespace with
the same index as a token precedes the token.
- Overrides:
possibleStart in class HeuristicSentenceModel
- Parameters:
tokens - Array of tokens to check.whitespaces - Array of whitespaces to check.start - Index of first token to check.end - Index of last token to check.