By Mitzi Morris

What is Sentence Detection?

This tutorial shows how to segment a text into its constituent sentences using a LingPipe SentenceModel, and how to evaluate and tune sentence models.

It uses MEDLINE data as the example data. MEDLINE is a collection of 13 million plus citations into the bio-medical literature maintained by the United States National Library of Medicine (NLM), and is distributed in XML format. The MEDLINE Parsing and Indexing Demo covers how to parse this data from XML into a structured Java object.

The first part of this tutorial shows how to segment a text into its constituent sentences using a LingPipe SentenceModel. The second part shows how to use the LingPipe SentenceEvaluator together with a corpus of correctly annotated data (a gold standard) to determine the accuracy of a model. Finally, we discuss the existing sentence models in the API, and ways to tune them.

Using Sentence Models

The SentenceModel Interface

The LingPipe com.aliasi.sentences.SentenceModel interface specifies a means of doing sentence segmentation from arrays of tokens and whitespaces, namely the boundaryIndices method, which takes an array of tokens, and an array of whitespaces, and returns an array of indices of sentence-final tokens.

The SentenceBoundaryDemo.java program shows how to use a sentence model to find sentence boundaries in a text. It takes an input file of plain text. It first processes the file into lists of tokens and whitespace, and then uses the MEDLINE sentence model to find the sentence boundaries. To run this from the command line, type the following on one line (if using Windows, replace the colon ":" with a semicolon ";"):

java
-cp "sentence-demo.jar:../../../lingpipe-4.1.0.jar"
SentenceBoundaryDemo ../../data/sentence_demo.txt

This tutorial also comes with an Ant build.xml file which defines targets used to run all of the demo programs. To run the SentenceBoundaryDemo program execute the Ant target findbounds:

> ant findbounds

which produces the following output (with the [java] tags inserted by Ant removed for clarity):

findbounds:
 INPUT TEXT:
  The induction of immediate-early (IE) response genes, such as egr-1,
  c-fos, and c-jun, occurs rapidly after the activation of T
  lymphocytes. The process of activation involves calcium mobilization,
  activation of protein kinase C (PKC), and phosphorylation of tyrosine
  kinases. p21(ras), a guanine nucleotide binding factor, mediates
  T-cell signal transduction through PKC-dependent and PKC-independent
  pathways. The involvement of p21(ras) in the regulation of
  calcium-dependent signals has been suggested through analysis of its
  role in the activation of NF-AT. We have investigated the inductions
  of the IE genes in response to calcium signals in Jurkat cells (in
  the presence of activated p21(ras)) and their correlated
  consequences.

 150 TOKENS
 151 WHITESPACES
 5 SENTENCE END TOKEN OFFSETS
 SENTENCE 1:
 The induction of immediate-early (IE) response genes, such as egr-1,
  c-fos, and c-jun, occurs rapidly after the activation of T
  lymphocytes.
 SENTENCE 2:
 The process of activation involves calcium mobilization,
  activation of protein kinase C (PKC), and phosphorylation of tyrosine
  kinases.
 SENTENCE 3:
 p21(ras), a guanine nucleotide binding factor, mediates
  T-cell signal transduction through PKC-dependent and PKC-independent
  pathways.
 SENTENCE 4:
 The involvement of p21(ras) in the regulation of
  calcium-dependent signals has been suggested through analysis of its
  role in the activation of NF-AT.
 SENTENCE 5:
 We have investigated the inductions
  of the IE genes in response to calcium signals in Jurkat cells (in
  the presence of activated p21(ras)) and their correlated
  consequences.

The inputs to the SentenceModel method boundaryIndices are an array of tokens and an array of whitespaces. Therefore we must first process the text into token and whitespace arrays, then identify sentence boundaries. The SentenceBoundaryDemo.java program uses the class com.aliasi.tokenizer.IndoEuropeanTokenizerFactory to provide a tokenizer, and a com.aliasi.sentences.MedlineSentenceModel to do the sentence boundary detection:

static final TokenizerFactory TOKENIZER_FACTORY
    = IndoEuropeanTokenizerFactory.INSTANCE;
static final SentenceModel SENTENCE_MODEL
    = new MedlineSentenceModel();

The TokenizerFactory method tokenizer returns a a com.aliasi.tokenizer.Tokenizer. The tokenize method parses the text into tokens and whitespaces, adding them to their respective lists:

List<String> tokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer
    = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),
                                  0,text.length());
tokenizer.tokenize(tokenList,whiteList);

The tokenList and whiteList arrays produced by the tokenizer are parallel arrays. The whitespace at index [i] is that which precedes the token at index [i]. The tokenizer returns elements for the whitespace preceding the first token and the whitespace following the last token. Therefore in the above example we see that the whitespace array contains 151 elements, while the token array contains 150 elements.

We convert the ArrayList objects into their corresponding String arrays, and then invoke the boundaryIndices method:

String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
int[] sentenceBoundaries
    = SENTENCE_MODEL.boundaryIndices(tokens,whites);

The boundaryIndices method returns an array whose values are the indices of the elements in the tokens array which are sentence final tokens. To extract the sentences we iterate through the sentence bounaries array, keeping track of the indices of the sentence start and end tokens, and printing out the correct elements from the tokens and whitespaces arrays. Here is the code to print out the sentences found in the abstract, one per line:

int sentStartTok = 0;
int sentEndTok = 0;
for (int i = 0; i < sentenceBoundaries.length; ++i) {
    sentEndTok = sentenceBoundaries[i];
    System.out.println("SENTENCE "+(i+1)+": ");
    for (int j=sentStartTok; j <= sentEndTok; j++) {
        System.out.print(tokens[j]+whites[j+1]);
    }
    System.out.println();
    sentStartTok = sentEndTok+1;
}

The above code block prints every token in the tokens array, and the whitespace following that token. Because line breaks count as whitespace, the individual sentences show the same pattern of spacing and linebreaks as in the input text.

Chunkings and Chunkers

In this section we show how to simplify the task of dealing with sentences and sentence boundaries, by rewriting the SentenceBoundaryDemo to use a com.aliasi.sentences.SentenceChunker.

The rewritten program is SentenceChunkerDemo.java. To run this program execute the Ant target findchunks as before, which produces:

> ant findchunks

findchunks:
 INPUT TEXT:
  The induction of immediate-early (IE) response genes, such as egr-1,
  c-fos, and c-jun, occurs rapidly after the activation of T
  lymphocytes. The process of activation involves calcium mobilization,
  activation of protein kinase C (PKC), and phosphorylation of tyrosine
  kinases. p21(ras), a guanine nucleotide binding factor, mediates
  T-cell signal transduction through PKC-dependent and PKC-independent
  pathways. The involvement of p21(ras) in the regulation of
  calcium-dependent signals has been suggested through analysis of its
  role in the activation of NF-AT. We have investigated the inductions
  of the IE genes in response to calcium signals in Jurkat cells (in
  the presence of activated p21(ras)) and their correlated
  consequences.

 SENTENCE 1:
 The induction of immediate-early (IE) response genes, such as egr-1,
  c-fos, and c-jun, occurs rapidly after the activation of T
  lymphocytes.
 SENTENCE 2:
 The process of activation involves calcium mobilization,
  activation of protein kinase C (PKC), and phosphorylation of tyrosine
  kinases.
 SENTENCE 3:
 p21(ras), a guanine nucleotide binding factor, mediates
  T-cell signal transduction through PKC-dependent and PKC-independent
  pathways.
 SENTENCE 4:
 The involvement of p21(ras) in the regulation of
  calcium-dependent signals has been suggested through analysis of its
  role in the activation of NF-AT.
 SENTENCE 5:
 We have investigated the inductions
  of the IE genes in response to calcium signals in Jurkat cells (in
  the presence of activated p21(ras)) and their correlated
  consequences.

The above output is almost identical to that of SentenceBoundaryDemo except that there is no tokenization information. This is because the SentenceChunker handles tokenization.

A SentenceChunker is constructed from a TokenizerFactory and a SentenceModel:

static final TokenizerFactory TOKENIZER_FACTORY
    = IndoEuropeanTokenizerFactory.INSTANCE;
static final SentenceModel SENTENCE_MODEL
    = new MedlineSentenceModel();
static final SentenceChunker SENTENCE_CHUNKER
    = new SentenceChunker(TOKENIZER_FACTORY,
                          SENTENCE_MODEL);

The SentenceChunker method chunk produces a com.aliasi.chunk.Chunking over the text. A Chunking is a set of com.aliasi.chunk.Chunk objects over a shared CharSequence. The chunkSet method returns the set of (sentence) chunks, and the charSequence method returns the underlying character sequence.

Chunking chunking
    = SENTENCE_CHUNKER.chunk(text.toCharArray(),
                             0,text.length());
Set<Chunk> sentences = chunking.chunkSet();
String slice = chunking.charSequence().toString();

We use the start and end index information from each chunk to print the text of the sentence in the abstract:

int i = 1;
for (Chunk sentence : sentences) {
    int start = sentence.start();
    int end = sentence.end();
    System.out.println("SENTENCE "+(i++)+":");
    System.out.println(slice.substring(start,end));
}

Evaluating Sentence Models

In this section we show how to evaluate a sentence model.

To evaluate a sentence model, we need a reference corpus of text which has a set of sentence boundary markers. For MEDLINE data, we can use the GENIA XML corpus as the gold standard. The GENIA XML corpus is a set of 2000 MEDLINE abstracts which have been annotated for sentence boundaries and biomedical terms ("cons" elements). Here is a sample abstract from this corpus (prettified with whitespace):

<abstract>
<sentence>

<cons lex="Toremifene"
sem="G#other_organic_compound">Toremifene</cons> exerts
multiple and varied effects on the <cons lex="gene_expression"
sem="G#other_name">gene expression</cons> of <cons
lex="human_peripheral_mononuclear_cell" sem="G#cell_type">human
peripheral mononuclear cells</cons>.
</sentence>
<sentence>
After short-term, in <cons
lex="vitro_exposure" sem="G#other_name">vitro exposure</cons>
to <cons lex="therapeutical_level"
sem="G#other_name">therapeutical levels</cons>, distinct
changes in <cons lex="(AND P-glycoprotein_expression
steroid_receptors_expression p53_expression Bcl-2_expression)"
sem="(AND G#other_name G#other_name G#other_name
G#other_name)"><cons
lex="P-glycoprotein*">P-glycoprotein</cons>, <cons
lex="steroid_receptor*">steroid receptors</cons>, <cons
lex="p53*">p53</cons> and <cons
lex="Bcl-2*">Bcl-2</cons> <cons
lex="*expression">expression</cons></cons> take
place.
</sentence>
<sentence>
In view of the increasing use of <cons
lex="antiestrogen" sem="G#lipid">antiestrogens</cons> in
<cons lex="cancer_therapy" sem="G#other_name">cancer
therapy</cons> and <cons lex="cancer_prevention"
sem="G#other_name">prevention</cons>, there is obvious merit
in <cons lex="long-term_in_vivo_study"
sem="G#other_name">long-term in vivo studies</cons> to be
conducted.
</sentence>
</abstract>

In order to run parts 2 and 3 of this tutorial, the GENIA corpus must be downloaded directly from the GENIA project website. To do this:

To use this corpus in an evaluation, we parse each abstract into a com.aliasi.chunk.Chunking object. The LingPipe API provides a SAX parser with this functionality: com.aliasi.corpus.parsers.GeniaSentenceParser. The Chunking.charSequence holds the plain text of the abstract, and the Chunking.chunkSet holds the sentence boundary information.

We use this as the reference chunking against which to evaluate the performance of a sentence model by using a SentenceChunker, as in the SentenceChunkerDemo, above: first we create a SentenceChunker for the sentence model we wish to evaluate. We invoke the Chunk method on the text of the abstract (the charSequence of the reference chunking). This gives us a response chunking.

To evaluate the response chunking against the reference chunking we compare the members of the respective chunkSet objects, that is, we compare the set of sentences that we know to be in the abstract with the set of sentences found by the sentence model, using a 4-way classification:

The LingPipe API provides a com.aliasi.sentences.SentenceEvaluator which creates this evaluation for us. From the javadoc:

A SentenceEvaluator handles reference chunkings by constructing a response chunking and adding them to a sentence evaluation. The resulting evaluation may be retrieved through the method evaluation() at any time.

This evaluator class implements the ObjectHandler<Chunking> interface. The chunkings passed to the handle(Chunking) method are treated as reference chunkings. Their character sequence is extracted using Chunking#charSequence() and the contained sentence chunker is used to produce a response chunking over the character sequence. The resulting pair of chunkings is passed to the contained sentence evaluation.

Running the evaluation is straigtforward: we create a GeniaSentenceParser instance and a SentenceEvalutator instance, and then set the SentenceEvaluator as the default handler for the GeniaSentenceParser. The GeniaSentenceParser parses each abstract into a reference chunking, and then invokes the handle(Chunking) method of the the SentenceEvaluator. The SentenceEvaluator creates the response chunking from the reference chunking, and adds the pair of reference, response chunkings to the evaluation, so that parsing and evaluation are carried out in tandem. The SentenceEvaluator object contains a com.aliasi.sentences.SentenceEvaluation object, which contains all the evaluation cases, (the pairs of reference and response chunkings), and the evaluation metrics, which are updated as each new case is added to the evaluation.

The SentenceEvaluation contains a com.aliasi.chunk.ChunkingEvaluation object, which evaluates the sentences qua chunkings. The SentenceEvaluation also evaluates the sentence model solely in terms of the sentence end boundaries. As we saw in the first part of this tutorial, the SentenceModel doesn't identify sentence initial tokens, only the sentence-final tokens. Implicit in this model is the assumption that all tokens belong to a sentence, therefore once we have found the end token in a sentence, we know that the start token of the next sentence must be the following token. Evaluations which score chunking errors and evaluations which score sentence end boundary errors yield different counts of the errors made by the sentence model. Consider the case where the sentence model fails to identify a sentence boundary in a sentence:

ref:  ------------X-------------+------
text: See Spot run. Run spot run. (...)
resp: --------------------------+------
pos:            11111111112222222222
pos:  012345668901234567890123456789

The reference chunking will contain two Chunk objects, with start and end values of (0,13), (14,27) respectively. The response chunking will contain one Chunk object, which start and end values (0,27). The ChunkingEvaluation will add the two reference chunking chunks to the set of false negatives, and one response chunking chunk to the set of false positives. The SentenceEvaluation will compare sets of end boundaries. The reference chunking end boundaries set contains the values 13 and 27, while the response chunking contains only 27, therefore the SentenceEvaluation counts the missed sentence boundary at position 13 as a single false negative. This approach to counting errors has two advantages: the statistics returned by counting only end boundary errors are better, since the overall number of false positives and false negatives is lower; and the sets of false positives and negatives contain only examples where the sentence-final boundary was incorrect. This latter point is relevant for the developer who is building or tuning the sentence model and will be covered in detail in the third part of this tutorial.

The SentenceModelEvaluator.java program shows how to construct and run an evaluator, and report the results of the evaluation. This program runs the GeniaSentenceParser over the GENIA XML corpus, and prints out the result of the evaluation.

To run this program execute the Ant target evaluate.

> ant evaluate

evaluate:
 Chunking Evaluation statistics
   Total=16623
   True Positive=16350
   False Negative=134
   False Positive=139
   True Negative=0
   Positive Reference=16484
   Positive Response=16489
   Negative Reference=139
   Negative Response=134
   Accuracy=0.9835769716657643
   Recall=0.9918709051201164
   Precision=0.9915701376675359
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9917204985897552
     (...)

 Sentence Evaluation end boundary statistics
   Total=16542
   True Positive=16431
   False Negative=53
   False Positive=58
   True Negative=0
   Positive Reference=16484
   Positive Response=16489
   Negative Reference=58
   Negative Response=53
   Accuracy=0.9932898077620602
   Recall=0.9967847609803446
   Precision=0.9964825034871733
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9966336093167136
     (...)

The accuracy, precision, and recall statistics are derived from the counts of True Positive (TP), False Negative (FP), False Positive (FP), and True Negative (TN) sentences as follows:

Note: The evalaute ant task assumes that the GENIA corpus has been dowloaded per instructions above, and that the files "GENIAcorpus3.02.xml" and "gpml.dtd" are in the lingpipe/demos/data directory. If either of these files are missing, the task will fail with a java.io.FileNotFoundException.

The SentenceModelEvaluator program is straightforward. First we create a SentenceChunker (as we did in the SentenceChunkerDemo.java program in section 1.2, above), and pass it in to the SentenceEvaluator constructor:

TokenizerFactory tokenizerFactory
    = IndoEuropeanTokenizerFactory.INSTANCE;
SentenceModel sentenceModel
    = new MedlineSentenceModel();
SentenceChunker sentenceChunker
    = new SentenceChunker(tokenizerFactory,sentenceModel);
SentenceEvaluator sentenceEvaluator
    = new SentenceEvaluator(sentenceChunker);

Then we create a GeniaSentenceParser, and set the SentenceEvaluator as its handler:

GeniaSentenceParser parser
    = new GeniaSentenceParser(sentenceEvaluator);
parser.setHandler(sentenceEvaluator);

The name of GENIA XML corpus file is passed in as a command line argument to the program. As the parser parses the corpus, SentenceEvaluator adds pairs of reference, response chunkings to the evaluation, therefore the only call that we need to carry out evaluation is the call to the parser's parse method:

File inFile = new File(args[0]);
parser.parse(inFile);

Once the file has been parsed, we obtain the results of the evaluation from the com.aliasi.sentences.SentenceEvaluation object that the SentenceEvaluator contains. Both the chunking evaluation and the sentence end boundary evaluation use a com.aliasi.classify.PrecisionRecallEvaluation object to tally their results. This class contains suite of descriptive statistics for binary classification tasks. The toString method returns a formatted representation of these statistics.

SentenceEvaluation sentenceEvaluation
    = sentenceEvaluator.evaluation();

PrecisionRecallEvaluation chunkingStats =
    sentenceEvaluation.chunkingEvaluation()
                      .precisionRecallEvaluation();
System.out.println("Chunking Evaluation statistics");
System.out.println(chunkingStats.toString());

PrecisionRecallEvaluation endBoundaryStats =
    sentenceEvaluation.endBoundaryEvaluation();
System.out.println("Sentence Evaluation end boundary statistics");
System.out.println(endBoundaryStats.toString());

The errors made by the sentence model are written to two files: EvaluatorFalseNegatives.txt and EvaluatorFalsePositives.txt.

EvaluatorFalseNegatives.txt contains a listing of sentences in the reference set (GENIA corpus) which are not in the response set (the sentence chunking returned by the MEDLINE sentence model), i.e. these are the sentences where the sentence model missed an end boundary. Here is an excerpt from this output file:

17. n has been termed "A/R tolerance." Exposing HUVECs to A/R induces an
18. d by electromobility shift assays. alpha 4 beta 1 ligation alone had
19. oetic cell lines containing Oct2,. CRISP-3 is pre-B cell-specific, N
20. ated by [3H]dexamethasone binding. Serum cortisol and urinary free c
21. ed lower calcemic effects in vivo. Large or polar substitutions on C
22. pha B, encoding the alpha subunit. alpha B is the mouse homologue of

EvaluatorFalsePositives.txt contains sentences in the response chunking which are not in the reference chunking, i.e. these are chunks that the sentence model incorrectly identified a token as a sentence-final token. Here is an excerpt from this output file:

20. ited neutrophil influx into the E. histolytica-infected intestinal x
21. d demonstrate a lower effect of C. Sub. on Ca2+ transport. Finally,
22. tous bioinactivation mechanism. 4. Fluorescence HPLC showed that SMX
23. d upstream of Cp resulted in a ca. two- to fivefold reduction in Cp
24.  p65, prompt rapid apoptosis of T. parva-transformed T cells. Our fi
25. of exposure to MTBE or benzene. 3. Peripheral blood lymphocytes (PBL
26. signal transduction pathways in C. pneumoniae-infected endothelial c

This output is generated by iterating over the set of false negatives and false positives returned by the SentenceEvaluation object. The members of this set are com.aliasi.chunk.ChunkAndCharSeq objects. A ChunkAndCharSeq object is a composite, containing a Chunk and the character sequence that contains it. This allows us to examine the start and end points of the sentence in context, using the spanStartContext and spanEndContext methods:

int i = 0;
Set falseNegatives
    = sentenceEvaluation.falseNegativeEndBoundaries();
OutputStream fnFileOut
    = new FileOutputStream("EvaluatorFalseNegatives.txt");
PrintStream falseNegOut =
    new PrintStream(fnFileOut);
for (Iterator<ChunkAndCharSeq> it = falseNegatives.iterator();
     it.hasNext(); ++i ) {

    ChunkAndCharSeq sentence = it.next();
    falseNegOut.println(i + ". "
                        + sentence.spanEndContext(34));
}
falseNegOut.close();
int j = 0;
Set<Integer> falsePositives
    = sentenceEvaluation.falsePositiveEndBoundaries();
OuptutStream fpFileOut
    = new FileOutputStream("EvaluatorFalsePositives.txt"));
PrintStream falsePosOut =
    new PrintStream(fpFileOut);
for (Iterator<ChunkAndCharSeq> it = falsePositives.iterator();
     it.hasNext(); ++j ) {

    ChunkAndCharSeq sentence = it.next();
    falsePosOut.println(j + ". "
                        + sentence.spanEndContext(34));
}
falsePosOut.close();

These files are mainly of interest to the model developer who wishes to identify the kinds of errors made by the sentence model.

Developing and Tuning Sentence Models

In this section we show how to develop and tune a sentence model, again using the GENIA corpus as a gold standard. The source code for this demo contains a class DemoSentenceModel.java. The reader is encouraged to try successive modifications to the DemoSentenceModel program, and to use the SentenceModelEvaluator.java program to assess the impact of these changes on the model's performance.

Like the MedlineSentenceModel, the DemoSentenceModel extends the com.aliasi.sentences.HeuristicSentenceModel class. A HeuristicSentenceModel determines sentence boundaries based on sets of tokens, a pair of flags, and an overridable method describing boundary conditions, the bounaryIndices method. The gist of the HeuristicSentenceModel.bounaryIndices algorithm is that sentence boundaries are identified by looking at a token together with the tokens which precede and follow it. If a token is a sentence-final token, then the sentence boundary is the index of the character one past the last character in that token. In order for a token to be a sentence-final token, it must be a member of the set of sentence-final punctutation tokens, such as periods (.) and question marks (?). Furthermore, it must be followed by whitespace, and the following token (if any) must be a legal start token for a sentence. Sentences containing abbreviations such as "Mr. Smith" are problematic because a simplistic sentence model will treat the period following "Mr." as a sentence-final token. Therefore it is necessary to check the penultimate token in the sentence, and disallow common abbreviations.

The heuristic sentence model uses three sets of tokens:

A further condition is imposed on sentence initial tokens by method possibleStart(String[],String[],int,int). This method checks a given token in sequence of tokens and whitespaces to determine if it is a possible sentence start.

There are also two flags in the constructor that determine aspects of sentence boundary detection:

The initial version of the DemoSentenceModel defines minimal sets of penultimate stops, impossible penultimates, and impossible starts, and doesn't override any methods in HeuristicSentenceModel. Here is its constructor:

public DemoSentenceModel() {
   super(POSSIBLE_STOPS,
         IMPOSSIBLE_PENULTIMATES,
         IMPOSSIBLE_SENTENCE_STARTS,
         false,  // force final stop
         false); // balance parens
}

To evaluate the prefomance of the DemoSentenceModel we change the SentenceModelEvaluator to use the DemoSentenceModel instead (at line 36):

SentenceModel sentenceModel  = new DemoSentenceModel();

Then we once again execute the Ant target evaluate:

> ant evaluate

evaluate:
     (...)
 Sentence Evaluation end boundary statistics
   Total=16620
   True Positive=16344
   False Negative=140
   False Positive=136
   True Negative=0
   Positive Reference=16484
   Positive Response=16480
   Negative Reference=136
   Negative Response=140
   Accuracy=0.9833935018050541
   Recall=0.9915069157971366
   Precision=0.9917475728155339
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9916272297051328
     (...)

This model preforms quite well, with overall accuracy and F-measures above 99%. The number of false positives and false negatives is markedly higher than the corresponding numbers for the MEDLINE sentence model, therefore we examine the EvaluatorFalseNegatives.txt and EvaluatorFalsePositives.txt output files. Here are the first 20 false negatives (sentence boundaries that the DemoSentenceModel failed to identify:

 0.  but not IL-5-nonproducing clones. pIL-5(-511)Luc was transcribed by
 1.  stages of B-cell differentiation. mBob1 interacts with the octamer
 2. omains of phospholipase C gamma 1. p38 also forms a complex with the
 3. f this positive regulatory element
 4. h PKC-dependent signaling systems. gamma B*CaM-K and delta CaM-AI, k
 5. (c))-like molecule, IL-13R alpha1. mRNA levels for IL-13R alpha1, bu
 6. -jun expression was not modulated. c-myc mRNA expression, constituti
 7. the nuclear translocation signals. mNFATc complexed with AP-1 bound
 8. regulating cell-cycle progression. p27Kip1 directly inhibits the cat
 9.  aberrant retinoic acid metabolism
10. f the lytic regulatory gene BZLF 1
11. nally differentiated myeloid cells
12. merized with phosphorylated c-Jun. c-Jun protein isolated from phorb
13. es integration of opposing signals
14. c.beta differs from that of NFATc. alpha in the first NH2-terminal 2
15. ted, regardless of disease status. hLH-2 was mapped to chromosome 9Q
16.  (ABSTRACT TRUNCATED AT 250 WORDS)
17.  (ABSTRACT TRUNCATED AT 250 WORDS)
18.  (ABSTRACT TRUNCATED AT 250 WORDS)
19. o cortisol resistance in monocytes
20. th premature aging syndromes (Down

Roughly have of the above entries are because the MEDLINE abstracts are sometimes truncated, and these truncated abstracts don't end with proper punctuation. In the GENIA corpus, these are labeled as sentences. To handle this, we change the constructor setting the forceFinalStop argument in the superclass's constructor to true:

public DemoSentenceModel() {
   super(POSSIBLE_STOPS,
         IMPOSSIBLE_PENULTIMATES,
         IMPOSSIBLE_SENTENCE_STARTS,
         true,  // force final stop
         false); // balance parens
}

Then we once again execute the Ant target evaluate:

> ant evaluate

evaluate:
     (...)
 Sentence Evaluation end boundary statistics
   Total=16620
   True Positive=16409
   False Negative=75
   False Positive=136
   True Negative=0
   Positive Reference=16484
   Positive Response=16545
   Negative Reference=136
   Negative Response=75
   Accuracy=0.9873044524669073
   Recall=0.9954501334627518
   Precision=0.9917799939558779
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9936116745889976
     (...)

This change cuts the number of false negatives from 140 to 75. The number of false positives remains unchanged. Now we look at the entries in the file EvaluatorFalsePositives.txt. These are places where the DemoSentenceBoundary mistakenly identified punctuation as a sentence boundary. Here are the first 20 entries:

 0.  J., Hinrichs, S. H., Reynolds, R. K., Luciw, P. A., and Jay, G. (19
 1. These suggest that prolonged, i.e. 28 day, glucocorticoid therapy ma
 2. 20%, respectively in monocytes. 2. Danazol did not alter the degrada
 3. of exposure to MTBE or benzene. 3. Peripheral blood lymphocytes (PBL
 4. cocorticoids (Cushing's syndrome). Type II corticosteroid receptors
 5. .-M.Chen, and D.G.Tenen, Mol.Cell. Biol.14:373-381, 1994). Here we r
 6. on of inflammatory cytokine genes. Several other transcription facto
 7. f CD19 cross-linking in 1E8 cells. Supershift experiments revealed t
 8.  4.0 +/- 0.31 and 4.1 +/- 0.34 vs. 2.9 +/- 0.29 nmol/L, p < .001) an
 9. zed an epitope mapping within E1B. When inoculated twice with Ad vec
10. 0.5 (0.2-1.6) fmol/10(7) cells vs. 2.3 +/- 0.9 (1.1-4.4) fmol/10(7)
11. lated at the S-G2/M boundaries. 5. One of the signaling molecules wh
12. ation of GABP factors (E.Flory, A. Hoffmeyer, U.Smola, U.R.Rapp, and
13. 1. Administration of danazol for ove
14. d significantly (1.73 +/- 0.08 vs. 1.16 +/- 0.09 arbitrary units, P
15. the pol gene (E. Verdin, J. Virol. 65:6790-6799, 1991). In the prese
16. genome [Noteborn et al., J. Virol. 65 (1991) 3131-3139] of chicken a
17. y recognized in vitro by donor (D. E.) CD4 T cells in a HLA class II
18. d with the pol gene (E. Verdin, J. Virol. 65:6790-6799, 1991). In th
19. formed cell line from the patient. These observations indicate that
20.  S. H., Reynolds, R. K., Luciw, P. A., and Jay, G. (1988) Nature 335

Entry #15 is typical of many of these errors. MEDLINE abstracts frequently contain citations to other journal articles. These citations contain many abbreviations, both of names and journal titles, and the periods are mistakenly identified as end of sentence markers. Since these citations are almost always offset by parentheses or brackets, using the parenthesis balancing feature of the HeuristicSentenceModel will eliminate this error. Therefore we change the DemoSentenceModel constructor again, this time to:

public DemoSentenceModel() {
   super(POSSIBLE_STOPS,
         IMPOSSIBLE_PENULTIMATES,
         IMPOSSIBLE_SENTENCE_STARTS,
         true,  // force final stop
         true); // balance parens
}

Then we once again execute the Ant target evaluate:

> ant evaluate

evaluate:
     (...)
 Sentence Evaluation end boundary statistics
   Total=16538
   True Positive=16407
   False Negative=77
   False Positive=54
   True Negative=0
   Positive Reference=16484
   Positive Response=16461
   Negative Reference=54
   Negative Response=77
   Accuracy=0.9920788487120571
   Recall=0.9953288036884251
   Precision=0.9967195188627666
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9960236758233421
     (...)

This change cuts the number of false positives from 136 to 54. The number of false negatives increases from 75 to 77. Overall the accuracy of the model is improved, so we keep this change in place.

Once again we consider the remaining false negatives in EvaluatorFalseNegatives.txt. Here are the first 20 entries:

 0. he specific 'pre-activation', i.e. constitutive nuclear translocatio
 1. cal use of steroid receptor drugs. --Vegeto, E., Pollio, G., Pellicc
 2. n has been termed "A/R tolerance." Exposing HUVECs to A/R induces an
 3. rotein expression in FDC clusters. p65 was detected in the cytoplasm
 4. -jun expression was not modulated. c-myc mRNA expression, constituti
 5. n 2 (IL-2) stimulates IL-2R alpha. transcription, thereby amplifying
 6.  stages of B-cell differentiation. mBob1 interacts with the octamer
 7. ly independent of LEF/TCF factors. beta-Catenin and LEF-1 complexes
 8.  of the heme biosynthetic pathway. cDNA clones for the human erythro
 9.  57.1 dpm mg-1 cytosol protein vs. 227.0 +/- 90.8 dpm mg-1 cytosol p
10. (c))-like molecule, IL-13R alpha1. mRNA levels for IL-13R alpha1, bu
11.  distinct genes of the Rel family. p50 is translated as a precursor
12. media has been sustained for 3 mo. with culture doubling times of ab
13. o nephrotoxicity and fibrogenesis? How important are the anti-inflam
14. to the full-length protein as p97. p50B is able to form heteromeric
15. d by alpha-interferon (alpha-IFN). alpha-IFN causes dephosphorylatio
16.  in collagen-stimulated platelets. p38 and p63 may provide a docking
17. HLA-Cw*0702, while FCS reduced it. beta 2-m increased the binding to
18. lls (macrophages and neutrophils). mRNA for c-fes has been detected
19. p50.p65 heterodimers was observed. p50.c-rel heterodimers were also
20. inase-associated lipocalin (NGAL). ngal gene expression was found at

Entry #3 shows a remaining problem for this model: there are biological names which are never capitalized, such as "p65", "mRNA", "alpha-IFN", or beta-Catenin", therefore determining a possible sentence start cannot be done on the basis of initial capitalization. Examination of these names shows that most of them contain digits or uppercase letters. Many names contain hyphens, such as "alpha-IFN" and "c-FOS". These names are problematic since the Indo-European tokenizer will break them into a sequence of three tokens: "c", "-", "FOS", therefore is it necessary to look through the next several tokens following a the possible sentence boundary token to determine whether or not what follows is a good sentence start.

The MEDLINE sentence model class overrides the method possibleStart to allow for names like these. The MedlineSentenceModel.possibleStart method allows any sequence of contiguous tokens containing a non-lowercase character to be a good sentence start. The arguments to this methods are the arrays of tokens and whitespace that the tokenizer produces from the text of the abstract, along with indices into these arrays that give the region of the tokenization that needs to be checked for a possible start. Here is a (slightly simplified) version of this method:

protected boolean possibleStart(String[] tokens,
                                String[] whitespaces,
                                int start, int end) {
    for (int i = start; i < end; i++) {
        if (containsDigitOrUpper(tokens[i]))
            return true;
        if (whitespaces[i+1].length() > 0)
            return false;
    }
    return false;
}

private boolean containsDigitOrUpper(String token) {
    int len = token.length();
    for (int i=0; i < len; i++) {
        if (Character.isUpperCase(token.charAt(i)))
            return true;
        if (Character.isDigit(token.charAt(i)))
            return true;
    }
    return false;
}

If we cut and paste these two methods into the DemoSentenceModel and re-run the evaluation, we get the following results:

> ant evaluate

evaluate:
     (...)
 Sentence Evaluation end boundary statistics
   Total=16539
   True Positive=16452
   False Negative=32
   False Positive=55
   True Negative=0
   Positive Reference=16484
   Positive Response=16507
   Negative Reference=55
   Negative Response=32
   Accuracy=0.9947397061491021
   Recall=0.9980587236107741
   Precision=0.9966680802083965
   Rejection Recall=0.0
   Rejection Precision=0.0
   F(1)=0.9973629171592253
     (...)

Once again we have reduced the number of false negatives by half. This performance is almost as good as that of the LingPipe MEDLINE sentence model (reported in section 2 of this tutorial). The interested reader is encourged to examine the code of the MedlineSentenceModel class to see further possible refinements.

At this point we have acheived very high accuracy against the GENIA corpus. It is not clear how much futher tuning of the model will be useful for the general task of processing the MEDLINE citation index. The GENIA corpus contains only 2000 MEDLINE abstracts, while the number of abstracts in the MEDINE citation index stands at around 10 million. Continuing to tune and evaluate the DemoSentenceModel model against the GENIA corpus runs the risk of overfitting the model to the data, and might actually detract from overall accuracy when processing new data. Therefore we conclude this tutorial here.

References