By Mitzi Morris (Tiger Blue Software)
What is Sentence Detection?
This tutorial shows how to segment a text into its constituent sentences using
a LingPipe SentenceModel, and how to evaluate and tune sentence models.
It uses MEDLINE data as the example data. MEDLINE is a collection of 13 million plus citations into the bio-medical literature maintained by the United States National Library of Medicine (NLM), and is distributed in XML format. The MEDLINE Parsing and Indexing Demo covers how to parse this data from XML into a structured Java object.
The first part of this tutorial shows how to segment a text into its
constituent sentences using a LingPipe SentenceModel.
The second part shows how to use the LingPipe SentenceEvaluator
together with a corpus of correctly annotated data (a gold standard)
to determine the accuracy of a model.
Finally, we discuss the existing sentence models in the API,
and ways to tune them.
Using Sentence Models
The SentenceModel Interface
The LingPipe
com.aliasi.sentences.SentenceModel interface
specifies a means of doing sentence segmentation from arrays of
tokens and whitespaces, namely the boundaryIndices method,
which takes an array of tokens, and an array of whitespaces, and returns
an array of indices of sentence-final tokens.
The SentenceBoundaryDemo.java
program shows how to use a sentence model to find sentence boundaries in a text.
It takes an input file of plain text.
It first processes the file into lists of tokens and whitespace,
and then uses the MEDLINE sentence model to find the sentence boundaries.
To run this from the command line, type the following (on one line):
java -cp "sentence-demo.jar;../../../lingpipe-3.6.0.jar" SentenceBoundaryDemo ../../data /sentence_demo.txt
This tutorial also comes with an Ant
build.xml file which defines
targets used to run all of the demo programs.
To run the SentenceBoundaryDemo program
execute the Ant target findbounds:
> ant findbounds
which produces the following output (with the [java]
tags inserted by Ant removed for clarity):
findbounds: INPUT TEXT: The induction of immediate-early (IE) response genes, such as egr-1, c-fos, and c-jun, occurs rapidly after the activation of T lymphocytes. The process of activation involves calcium mobilization, activation of protein kinase C (PKC), and phosphorylation of tyrosine kinases. p21(ras), a guanine nucleotide binding factor, mediates T-cell signal transduction through PKC-dependent and PKC-independent pathways. The involvement of p21(ras) in the regulation of calcium-dependent signals has been suggested through analysis of its role in the activation of NF-AT. We have investigated the inductions of the IE genes in response to calcium signals in Jurkat cells (in the presence of activated p21(ras)) and their correlated consequences. 150 TOKENS 151 WHITESPACES 5 SENTENCE END TOKEN OFFSETS SENTENCE 1: The induction of immediate-early (IE) response genes, such as egr-1, c-fos, and c-jun, occurs rapidly after the activation of T lymphocytes. SENTENCE 2: The process of activation involves calcium mobilization, activation of protein kinase C (PKC), and phosphorylation of tyrosine kinases. SENTENCE 3: p21(ras), a guanine nucleotide binding factor, mediates T-cell signal transduction through PKC-dependent and PKC-independent pathways. SENTENCE 4: The involvement of p21(ras) in the regulation of calcium-dependent signals has been suggested through analysis of its role in the activation of NF-AT. SENTENCE 5: We have investigated the inductions of the IE genes in response to calcium signals in Jurkat cells (in the presence of activated p21(ras)) and their correlated consequences.
The inputs to the SentenceModel method
boundaryIndices are an array of tokens and an array of
whitespaces. Therefore we must first process the text into token and
whitespace arrays, then identify sentence boundaries. The
SentenceBoundaryDemo.java program uses the class
com.aliasi.tokenizer.IndoEuropeanTokenizerFactory to
provide a tokenizer, and a
com.aliasi.sentences.MedlineSentenceModel
to do the sentence boundary detection:
static final TokenizerFactory TOKENIZER_FACTORY
= new IndoEuropeanTokenizerFactory();
static final SentenceModel SENTENCE_MODEL
= new MedlineSentenceModel();
The TokenizerFactory method tokenizer
returns a a
com.aliasi.tokenizer.Tokenizer. The
tokenize method parses the text into tokens and
whitespaces, adding them to their respective lists:
ArrayList tokenList = new ArrayList();
ArrayList whiteList = new ArrayList();
Tokenizer tokenizer
= TOKENIZER_FACTORY.tokenizer(text.toCharArray(),
0,text.length());
tokenizer.tokenize(tokenList,whiteList);
The tokenList and whiteList arrays produced
by the tokenizer are parallel arrays. The whitespace at index
[i] is that which precedes the token at index [i].
The tokenizer returns elements for the whitespace preceding the first token and
the whitespace following the last token. Therefore in the above example we see that
the whitespace array contains 151 elements, while the token array contains 150 elements.
We convert the ArrayList objects into their corresponding String
arrays, and then invoke the boundaryIndices method:
String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
int[] sentenceBoundaries
= SENTENCE_MODEL.boundaryIndices(tokens,whites);
The boundaryIndices method returns an array whose values are the indices of the
elements in the tokens array which are sentence final tokens.
To extract the sentences we iterate through the sentence bounaries array,
keeping track of the indices of the sentence start and end tokens, and printing
out the correct elements from the tokens and whitespaces arrays.
Here is the code to print out the sentences found in the abstract, one per line:
int sentStartTok = 0;
int sentEndTok = 0;
for (int i = 0; i < sentenceBoundaries.length; ++i) {
sentEndTok = sentenceBoundaries[i];
System.out.println("SENTENCE "+(i+1)+": ");
for (int j=sentStartTok; j <= sentEndTok; j++) {
System.out.print(tokens[j]+whites[j+1]);
}
System.out.println();
sentStartTok = sentEndTok+1;
}
The above code block prints every token in the tokens array,
and the whitespace following that token.
Because line breaks count as whitespace, the individual sentences show the same
pattern of spacing and linebreaks as in the input text.
Chunkings and Chunkers
In this section we show how to simplify the task of dealing with
sentences and sentence boundaries, by rewriting the
SentenceBoundaryDemo to use a
com.aliasi.sentences.SentenceChunker.
The rewritten program is SentenceChunkerDemo.java.
To run this program execute the Ant target findchunks as before,
which produces:
> ant findchunks findchunks: INPUT TEXT: The induction of immediate-early (IE) response genes, such as egr-1, c-fos, and c-jun, occurs rapidly after the activation of T lymphocytes. The process of activation involves calcium mobilization, activation of protein kinase C (PKC), and phosphorylation of tyrosine kinases. p21(ras), a guanine nucleotide binding factor, mediates T-cell signal transduction through PKC-dependent and PKC-independent pathways. The involvement of p21(ras) in the regulation of calcium-dependent signals has been suggested through analysis of its role in the activation of NF-AT. We have investigated the inductions of the IE genes in response to calcium signals in Jurkat cells (in the presence of activated p21(ras)) and their correlated consequences. SENTENCE 1: The induction of immediate-early (IE) response genes, such as egr-1, c-fos, and c-jun, occurs rapidly after the activation of T lymphocytes. SENTENCE 2: The process of activation involves calcium mobilization, activation of protein kinase C (PKC), and phosphorylation of tyrosine kinases. SENTENCE 3: p21(ras), a guanine nucleotide binding factor, mediates T-cell signal transduction through PKC-dependent and PKC-independent pathways. SENTENCE 4: The involvement of p21(ras) in the regulation of calcium-dependent signals has been suggested through analysis of its role in the activation of NF-AT. SENTENCE 5: We have investigated the inductions of the IE genes in response to calcium signals in Jurkat cells (in the presence of activated p21(ras)) and their correlated consequences.
The above output is almost identical to that of SentenceBoundaryDemo except that
there is no tokenization information.
This is because the SentenceChunker handles tokenization.
A SentenceChunker is constructed from a
TokenizerFactory and a SentenceModel:
static final TokenizerFactory TOKENIZER_FACTORY
= new IndoEuropeanTokenizerFactory();
static final SentenceModel SENTENCE_MODEL
= new MedlineSentenceModel();
static final SentenceChunker SENTENCE_CHUNKER
= new SentenceChunker(TOKENIZER_FACTORY,
SENTENCE_MODEL);
The SentenceChunker method chunk produces a
com.aliasi.chunk.Chunking over the text.
A Chunking is a set of
com.aliasi.chunk.Chunk objects
over a shared CharSequence.
The chunkSet method returns the set of (sentence) chunks,
and the charSequence method returns the underlying
character sequence.
Chunking chunking
= SENTENCE_CHUNKER.chunk(text.toCharArray(),
0,text.length());
Set sentences = chunking.chunkSet();
String slice = chunking.charSequence().toString();
We use the start and end index information from each chunk to print the text of the sentence in the abstract:
int i = 1;
for (Iterator it = sentences.iterator();
it.hasNext(); ) {
Chunk sentence = (Chunk)it.next();
int start = sentence.start();
int end = sentence.end();
System.out.println("SENTENCE "+(i++)+":");
System.out.println(slice.substring(start,end));
}
SAX Filters
The LingPipe
com.aliasi.sentences.SentenceAnnotateFilter class
provides of means of adding sentence annotation to some or all text elements
in an XML document.
In this way, an XML-based processing pipeline can use
sentence boundary information.
The SAXSentenceFilterDemo.java
program shows how to use this filter with a SAX parser. It takes a
file of MEDLINE citations, and finds all sentences within
<ArticleTitle> and
<AbstractText> elements, wrapping each sentence in
a <sent> element. The demo program uses medsamp2006.xml, a small sample
file of MEDLINE data in plain XML as the input to this demo. The
output is written to the file SentenceFilterDemoOutput.xml
(this file only exists after the demo is run). To run this
program execute the Ant target saxfilter:
> ant saxfilter
The SentenceAnnotateFilter is constructed with a sentence model and
tokenizer factory and optionally, a list of the XML elements whose content should
be annotated. If no list is provided, then all text elements will be annotated.
TokenizerFactory tokenizerFactory
= new IndoEuropeanTokenizerFactory();
SentenceModel sentenceModel
= new MedlineSentenceModel();
String[] elts = new String[] {
MedlineCitationSet.ARTICLE_TITLE_ELT,
MedlineCitationSet.ABSTRACT_TEXT_ELT };
SentenceAnnotateFilter sentenceFilter
= new SentenceAnnotateFilter(sentenceModel,
tokenizerFactory,
elts);
The SAX parser sends its inputs through a chain of handlers (filters). In this demo, we chain together the following:
-
com.aliasi.xml.GroupCharactersFilter- concatenates adjacent characters within an element's content into a single call. -
com.aliasi.sentences.SentenceAnnotateFilter- annotates sentences in a text element, wrapping each sentence in a<sent>element. -
com.aliasi.xml.SAXWriter- writes the annotated XML to an output file.
Filter chains are set up by setting the successive filter as the handler of the filter that precedes it. Therefore, we start by creating the last filter in the chain, and then set it as the handler to its precedessor.
FileOutputStream fileOut
= new FileOutputStream("SentenceFilterDemoOutput.xml");
SAXWriter writer
= new SAXWriter(fileOut,Strings.UTF8);
sentenceFilter.setHandler(writer);
GroupCharactersFilter handler
= new GroupCharactersFilter(sentenceFilter);
The first filter in the chain is set as the handler to the
XMLReader:
XMLReader xmlReader = XMLReaderFactory.createXMLReader(); xmlReader.setContentHandler(handler); xmlReader.setDTDHandler(handler); xmlReader.setErrorHandler(handler); xmlReader.setEntityResolver(handler);
Finally, we get the name of the input file of XML data from the
command line, use the utility util.Files.fileToURLName(String)
to convert it to a URL, and then parse its contents with the XML
reader:
File file = new File(args[0]); String url = Files.fileToURLName(file); InputSource inSource = new InputSource(url); xmlReader.parse(inSource);
The SAXWriter, which is part of the handler chain for the XML reader, will stream the output to the file as the input document is being parsed. In fact, this filter can be used to parse the larger 20GB compressed MEDLINE documents using very little memory.
Evaluating Sentence Models
In this section we show how to evaluate a sentence model.
To evaluate a sentence model, we need a reference corpus of text which has a set of sentence boundary markers. For MEDLINE data, we can use the GENIA XML corpus as the gold standard. The GENIA XML corpus is a set of 2000 MEDLINE abstracts which have been annotated for sentence boundaries and biomedical terms ("cons" elements). Here is a sample abstract from this corpus (prettified with whitespace):
<abstract> <sentence> <cons lex="Toremifene" sem="G#other_organic_compound">Toremifene</cons> exerts multiple and varied effects on the <cons lex="gene_expression" sem="G#other_name">gene expression</cons> of <cons lex="human_peripheral_mononuclear_cell" sem="G#cell_type">human peripheral mononuclear cells</cons>. </sentence> <sentence> After short-term, in <cons lex="vitro_exposure" sem="G#other_name">vitro exposure</cons> to <cons lex="therapeutical_level" sem="G#other_name">therapeutical levels</cons>, distinct changes in <cons lex="(AND P-glycoprotein_expression steroid_receptors_expression p53_expression Bcl-2_expression)" sem="(AND G#other_name G#other_name G#other_name G#other_name)"><cons lex="P-glycoprotein*">P-glycoprotein</cons>, <cons lex="steroid_receptor*">steroid receptors</cons>, <cons lex="p53*">p53</cons> and <cons lex="Bcl-2*">Bcl-2</cons> <cons lex="*expression">expression</cons></cons> take place. </sentence> <sentence> In view of the increasing use of <cons lex="antiestrogen" sem="G#lipid">antiestrogens</cons> in <cons lex="cancer_therapy" sem="G#other_name">cancer therapy</cons> and <cons lex="cancer_prevention" sem="G#other_name">prevention</cons>, there is obvious merit in <cons lex="long-term_in_vivo_study" sem="G#other_name">long-term in vivo studies</cons> to be conducted. </sentence> </abstract>
In order to run parts 2 and 3 of this tutorial, the GENIA corpus must be downloaded directly from the GENIA project website. To do this:
- Go to the GENIA corpus download page.
- Choose item: "GENIA Corpus/GPML Ver 3.0", fill in all other required form data, and submit.
- Save the file "GENIAcorpus3.02.tgz"
-
Uncompress and untar this archive (
tar -zxvf GENIAcorpus3.02.tgz). It should contain the following files:- GENIAcorpus3.02.xml
- gpml.css
- gpml.css.legend.html
- gpml.dtd
- gpml.readme.html
- Move or copy these files into the directory "lingpipe/demos/data"
To use this corpus in an evaluation, we parse each abstract into a
com.aliasi.chunk.Chunking object.
The LingPipe API provides a SAX parser with this functionality:
com.aliasi.corpus.parsers.GeniaSentenceParser.
The Chunking.charSequence holds the plain text of the abstract,
and the Chunking.chunkSet holds the sentence boundary information.
We use this as the reference chunking against which to evaluate the performance
of a sentence model by using a SentenceChunker, as in the
SentenceChunkerDemo, above:
first we create a SentenceChunker for the sentence model
we wish to evaluate. We invoke the Chunk method on the
text of the abstract (the charSequence of the reference chunking).
This gives us a response chunking.
To evaluate the response chunking against the reference chunking
we compare the members of the respective chunkSet objects,
that is, we compare the set of sentences that we know to be in the abstract
with the set of sentences found by the sentence model, using a 4-way classification:
- True Positives (TP): sentences in the reference chunking and in the response chunking.
- False Positives (FP): sentences in the response chunking which are not in the reference chunking.
- False Negatives (FN): sentences in the reference chunking which are not in the response chunking.
- True Negatives (TN): this number is always zero. It is the number of items which are neither in the reference chunking nor in the response chunking. Since we only collect the sentences from the GENIA corpus and the response chunking, we have no true negatives.
The LingPipe API provides a
com.aliasi.sentences.SentenceEvaluator
which creates this evaluation for us. From the javadoc:
A SentenceEvaluator handles reference chunkings by
constructing a response chunking and adding them to a sentence
evaluation. The resulting evaluation may be retrieved through the
method evaluation() at any time.
This evaluator class implements the ChunkHandler
interface.
The chunkings passed to the handle(Chunking)
method are treated as reference chunkings.
Their character sequence is extracted using Chunking#charSequence()
and the contained sentence chunker is used to produce a
response chunking over the character sequence.
The resulting pair of chunkings is passed to the contained sentence evaluation.
Running the evaluation is straigtforward:
we create a GeniaSentenceParser instance
and a SentenceEvalutator instance, and
then set the SentenceEvaluator as the default handler
for the GeniaSentenceParser.
The GeniaSentenceParser parses each abstract into a reference chunking,
and then invokes the handle(Chunking) method of the
the SentenceEvaluator.
The SentenceEvaluator creates the response chunking from the
reference chunking, and adds the pair of reference, response chunkings to the
evaluation, so that parsing and evaluation are carried out in tandem.
The SentenceEvaluator object contains a
com.aliasi.sentences.SentenceEvaluation object,
which contains all the evaluation cases, (the pairs of reference and response chunkings),
and the evaluation metrics, which are updated as each new case is added to the evaluation.
The SentenceEvaluation contains a
com.aliasi.chunk.ChunkingEvaluation object,
which evaluates the sentences qua chunkings.
The SentenceEvaluation also evaluates the sentence model solely
in terms of the sentence end boundaries.
As we saw in the first part of this tutorial,
the SentenceModel doesn't identify sentence initial tokens,
only the sentence-final tokens.
Implicit in this model is the assumption that all tokens belong to a sentence,
therefore once we have found the end token in a sentence, we know that the start
token of the next sentence must be the following token.
Evaluations which score chunking errors and evaluations which score sentence end boundary errors
yield different counts of the errors made by the sentence model.
Consider the case where the sentence model fails to identify a sentence boundary in a sentence:
ref: ------------X-------------+------ text: See Spot run. Run spot run. (...) resp: --------------------------+------ pos: 11111111112222222222 pos: 012345668901234567890123456789
The reference chunking will contain two Chunk objects,
with start and end values of (0,13), (14,27) respectively. The
response chunking will contain one Chunk object, which
start and end values (0,27). The ChunkingEvaluation will
add the two reference chunking chunks to the set of false negatives,
and one response chunking chunk to the set of false positives. The
SentenceEvaluation will compare sets of end boundaries.
The reference chunking end boundaries set contains the values 13 and
27, while the response chunking contains only 27, therefore the
SentenceEvaluation counts the missed sentence boundary at
position 13 as a single false negative. This approach to counting
errors has two advantages: the statistics returned by counting only
end boundary errors are better, since the overall number of false
positives and false negatives is lower; and the sets of false
positives and negatives contain only examples where the sentence-final
boundary was incorrect. This latter point is relevant for the
developer who is building or tuning the sentence model and will be
covered in detail in the third part of this tutorial.
The SentenceModelEvaluator.java
program shows how to construct and run an evaluator, and report the
results of the evaluation. This program runs the
GeniaSentenceParser over the GENIA XML corpus, and prints
out the result of the evaluation.
To run this program execute the Ant target evaluate.
> ant evaluate
evaluate:
Chunking Evaluation statistics
Total=16623
True Positive=16350
False Negative=134
False Positive=139
True Negative=0
Positive Reference=16484
Positive Response=16489
Negative Reference=139
Negative Response=134
Accuracy=0.9835769716657643
Recall=0.9918709051201164
Precision=0.9915701376675359
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9917204985897552
(...)
Sentence Evaluation end boundary statistics
Total=16542
True Positive=16431
False Negative=53
False Positive=58
True Negative=0
Positive Reference=16484
Positive Response=16489
Negative Reference=58
Negative Response=53
Accuracy=0.9932898077620602
Recall=0.9967847609803446
Precision=0.9964825034871733
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9966336093167136
(...)
The accuracy, precision, and recall statistics are derived from the counts of True Positive (TP), False Negative (FP), False Positive (FP), and True Negative (TN) sentences as follows:
- Precision is TP/(TP+FP).
- Recall is TP/(TP+FN).
- Accuracy is just (TP+TN)/(TP+FP+FN+TN). Because there are no TNs, accuracy reduces to the Jaccard measure TP/(TP+FP+FN).
Note: The evalaute ant task assumes that the GENIA corpus
has been dowloaded per instructions above, and that the files
"GENIAcorpus3.02.xml" and "gmpl.dtd" are in the
lingpipe/demos/data directory.
If either of these files are missing, the task will fail with a java.io.FileNotFoundException.
The SentenceModelEvaluator program is straightforward.
First we create a SentenceChunker (as we did in the
SentenceChunkerDemo.java program in section 1.2, above),
and pass it in to the SentenceEvaluator constructor:
TokenizerFactory tokenizerFactory
= new IndoEuropeanTokenizerFactory();
SentenceModel sentenceModel
= new MedlineSentenceModel();
SentenceChunker sentenceChunker
= new SentenceChunker(tokenizerFactory,sentenceModel);
SentenceEvaluator sentenceEvaluator
= new SentenceEvaluator(sentenceChunker);
Then we create a GeniaSentenceParser, and set the
SentenceEvaluator as its handler:
GeniaSentenceParser parser
= new GeniaSentenceParser(sentenceEvaluator);
parser.setHandler(sentenceEvaluator);
The name of GENIA XML corpus file is passed in as a command line argument
to the program.
As the parser parses the corpus, SentenceEvaluator
adds pairs of reference, response chunkings to the evaluation,
therefore the only call that we need to carry out evaluation
is the call to the parser's parse method:
File inFile = new File(args[0]); parser.parse(inFile);
Once the file has been parsed, we obtain the results of the evaluation
from the
com.aliasi.sentences.SentenceEvaluation
object that the SentenceEvaluator contains.
Both the chunking evaluation and the sentence end boundary evaluation
use a
com.aliasi.classify.PrecisionRecallEvaluation object
to tally their results.
This class
contains suite of descriptive statistics for binary classification
tasks.
The toString method returns a formatted representation
of these statistics.
SentenceEvaluation sentenceEvaluation
= sentenceEvaluator.evaluation();
PrecisionRecallEvaluation chunkingStats =
sentenceEvaluation.chunkingEvaluation()
.precisionRecallEvaluation();
System.out.println("Chunking Evaluation statistics");
System.out.println(chunkingStats.toString());
PrecisionRecallEvaluation endBoundaryStats =
sentenceEvaluation.endBoundaryEvaluation();
System.out.println("Sentence Evaluation end boundary statistics");
System.out.println(endBoundaryStats.toString());
The errors made by the sentence model are written to two files:
EvaluatorFalseNegatives.txt
and
EvaluatorFalsePositives.txt.
EvaluatorFalseNegatives.txt contains
a listing of sentences in the reference set (GENIA corpus) which
are not in the response set (the sentence chunking returned
by the MEDLINE sentence model), i.e. these are the sentences where
the sentence model missed an end boundary.
Here is an excerpt from this output file:
17. n has been termed "A/R tolerance." Exposing HUVECs to A/R induces an 18. d by electromobility shift assays. alpha 4 beta 1 ligation alone had 19. oetic cell lines containing Oct2,. CRISP-3 is pre-B cell-specific, N 20. ated by [3H]dexamethasone binding. Serum cortisol and urinary free c 21. ed lower calcemic effects in vivo. Large or polar substitutions on C 22. pha B, encoding the alpha subunit. alpha B is the mouse homologue of
EvaluatorFalsePositives.txt contains sentences in the
response chunking which are not in the reference chunking, i.e.
these are chunks that the sentence model incorrectly identified
a token as a sentence-final token.
Here is an excerpt from this output file:
20. ited neutrophil influx into the E. histolytica-infected intestinal x 21. d demonstrate a lower effect of C. Sub. on Ca2+ transport. Finally, 22. tous bioinactivation mechanism. 4. Fluorescence HPLC showed that SMX 23. d upstream of Cp resulted in a ca. two- to fivefold reduction in Cp 24. p65, prompt rapid apoptosis of T. parva-transformed T cells. Our fi 25. of exposure to MTBE or benzene. 3. Peripheral blood lymphocytes (PBL 26. signal transduction pathways in C. pneumoniae-infected endothelial c
This output is generated by iterating over the set of false negatives and false positives
returned by the SentenceEvaluation object.
The members of this set are
com.aliasi.chunk.ChunkAndCharSeq objects.
A ChunkAndCharSeq object is a composite, containing
a Chunk and the character sequence that contains it.
This allows us to examine the start and end points of the sentence in context,
using the spanStartContext and spanEndContext methods:
int i = 0;
Set falseNegatives
= sentenceEvaluation.falseNegativeEndBoundaries();
OutputStream fnFileOut
= new FileOutputStream("EvaluatorFalseNegatives.txt");
PrintStream falseNegOut =
new PrintStream(fnFileOut);
for (Iterator it = falseNegatives.iterator();
it.hasNext(); ++i ) {
ChunkAndCharSeq sentence
= (ChunkAndCharSeq)it.next();
falseNegOut.println(i + ". "
+ sentence.spanEndContext(34));
}
falseNegOut.close();
int j = 0;
Set falsePositives
= sentenceEvaluation.falsePositiveEndBoundaries();
OuptutStream fpFileOut
= new FileOutputStream("EvaluatorFalsePositives.txt"));
PrintStream falsePosOut =
new PrintStream(fpFileOut);
for (Iterator it = falsePositives.iterator();
it.hasNext(); ++j ) {
ChunkAndCharSeq sentence
= (ChunkAndCharSeq)it.next();
falsePosOut.println(j + ". "
+ sentence.spanEndContext(34));
}
falsePosOut.close();
These files are mainly of interest to the model developer who wishes to identify the kinds of errors made by the sentence model.
Developing and Tuning Sentence Models
In this section we show how to develop and tune a sentence model,
again using the GENIA corpus as a gold standard. The source code for
this demo contains a class DemoSentenceModel.java.
The reader is encouraged to try successive modifications to the
DemoSentenceModel program, and to use the
SentenceModelEvaluator.java program to assess the impact
of these changes on the model's performance.
Like the MedlineSentenceModel, the DemoSentenceModel extends the
com.aliasi.sentences.HeuristicSentenceModel class.
A HeuristicSentenceModel determines sentence
boundaries based on sets of tokens, a pair of flags, and an
overridable method describing boundary conditions, the
bounaryIndices method. The gist of the
HeuristicSentenceModel.bounaryIndices algorithm is
that sentence boundaries are identified by looking at a token together
with the tokens which precede and follow it. If a token is a
sentence-final token, then the sentence boundary is the index of the
character one past the last character in that token. In order for a
token to be a sentence-final token, it must be a member of the set of
sentence-final punctutation tokens, such as periods (.)
and question marks (?). Furthermore, it must be followed
by whitespace, and the following token (if any) must be a legal start
token for a sentence. Sentences containing abbreviations such as
"Mr. Smith" are problematic because a simplistic sentence
model will treat the period following "Mr." as a
sentence-final token. Therefore it is necessary to check the
penultimate token in the sentence, and disallow common abbreviations.
The heuristic sentence model uses three sets of tokens:
- Possible Stops: These are tokens that are allowed to be the final token in a sentence.
- Impossible Penultimates: These are tokens that may
not be the penultimate (second-to-last) token in a sentence.
This set is typically made up of abbreviations or acronyms such as
"Mr". - Impossible Starts: These are tokens that may not
be the first token in a sentence. This set typically includes
punctuation characters that should be attached to the previous
sentence such as end quotes (
'').
A further condition is imposed on sentence initial tokens by method
possibleStart(String[],String[],int,int). This method
checks a given token in sequence of tokens and whitespaces to
determine if it is a possible sentence start.
There are also two flags in the constructor that determine aspects of sentence boundary detection:
- Force Final Boundary: If this flag is set to
true, the final token in any input is taken to be a sentence terminator, whether or not is a possible stop token. This is useful for dealing with truncated inputs, such as those in MEDLINE abstracts. - Balance Parentheses: If parentheses are being balanced,
then as long as there are open parentheses that have not been
closed, the current sentence may not end.
Square brackets
(
"[", "]") and round brackets ("(", ")"), are balanced separately, so that a close square bracket doesn't close an open paren, and visa versa. The heuristic sentence model doesn't keep track of nested parenthesis, and the first close paren following any number of open parens closes all parens, and any extra close parentheses (")") and brackets ("]") are ignored. This approach avoids the pitfall of missing all sentence boundaries past a missing close paren if only one close paren is used to close multpile open parens.
The initial version of the DemoSentenceModel
defines minimal sets of penultimate stops, impossible penultimates,
and impossible starts, and
doesn't override any methods in HeuristicSentenceModel.
Here is its constructor:
public DemoSentenceModel() {
super(POSSIBLE_STOPS,
IMPOSSIBLE_PENULTIMATES,
IMPOSSIBLE_SENTENCE_STARTS,
false, // force final stop
false); // balance parens
}
To evaluate the prefomance of the DemoSentenceModel
we change the SentenceModelEvaluator
to use the DemoSentenceModel instead (at line 36):
SentenceModel sentenceModel = new DemoSentenceModel();
Then we once again execute the Ant target evaluate:
> ant evaluate
evaluate:
(...)
Sentence Evaluation end boundary statistics
Total=16620
True Positive=16344
False Negative=140
False Positive=136
True Negative=0
Positive Reference=16484
Positive Response=16480
Negative Reference=136
Negative Response=140
Accuracy=0.9833935018050541
Recall=0.9915069157971366
Precision=0.9917475728155339
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9916272297051328
(...)
This model preforms quite well, with overall accuracy and F-measures above 99%.
The number of false positives and false negatives is markedly higher than the
corresponding numbers for the MEDLINE sentence model, therefore we examine
the EvaluatorFalseNegatives.txt
and EvaluatorFalsePositives.txt output files.
Here are the first 20 false negatives (sentence boundaries that the
DemoSentenceModel failed to identify:
0. but not IL-5-nonproducing clones. pIL-5(-511)Luc was transcribed by 1. stages of B-cell differentiation. mBob1 interacts with the octamer 2. omains of phospholipase C gamma 1. p38 also forms a complex with the 3. f this positive regulatory element 4. h PKC-dependent signaling systems. gamma B*CaM-K and delta CaM-AI, k 5. (c))-like molecule, IL-13R alpha1. mRNA levels for IL-13R alpha1, bu 6. -jun expression was not modulated. c-myc mRNA expression, constituti 7. the nuclear translocation signals. mNFATc complexed with AP-1 bound 8. regulating cell-cycle progression. p27Kip1 directly inhibits the cat 9. aberrant retinoic acid metabolism 10. f the lytic regulatory gene BZLF 1 11. nally differentiated myeloid cells 12. merized with phosphorylated c-Jun. c-Jun protein isolated from phorb 13. es integration of opposing signals 14. c.beta differs from that of NFATc. alpha in the first NH2-terminal 2 15. ted, regardless of disease status. hLH-2 was mapped to chromosome 9Q 16. (ABSTRACT TRUNCATED AT 250 WORDS) 17. (ABSTRACT TRUNCATED AT 250 WORDS) 18. (ABSTRACT TRUNCATED AT 250 WORDS) 19. o cortisol resistance in monocytes 20. th premature aging syndromes (Down
Roughly have of the above entries are because the MEDLINE abstracts
are sometimes truncated, and these truncated abstracts don't end with
proper punctuation. In the GENIA corpus, these are labeled as
sentences. To handle this, we change the constructor setting the
forceFinalStop argument in the superclass's constructor
to true:
public DemoSentenceModel() {
super(POSSIBLE_STOPS,
IMPOSSIBLE_PENULTIMATES,
IMPOSSIBLE_SENTENCE_STARTS,
true, // force final stop
false); // balance parens
}
Then we once again execute the Ant target evaluate:
> ant evaluate
evaluate:
(...)
Sentence Evaluation end boundary statistics
Total=16620
True Positive=16409
False Negative=75
False Positive=136
True Negative=0
Positive Reference=16484
Positive Response=16545
Negative Reference=136
Negative Response=75
Accuracy=0.9873044524669073
Recall=0.9954501334627518
Precision=0.9917799939558779
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9936116745889976
(...)
This change cuts the number of false negatives from 140 to 75.
The number of false positives remains unchanged.
Now we look at the entries in the file
EvaluatorFalsePositives.txt.
These are places where the DemoSentenceBoundary
mistakenly identified punctuation as a sentence boundary.
Here are the first 20 entries:
0. J., Hinrichs, S. H., Reynolds, R. K., Luciw, P. A., and Jay, G. (19 1. These suggest that prolonged, i.e. 28 day, glucocorticoid therapy ma 2. 20%, respectively in monocytes. 2. Danazol did not alter the degrada 3. of exposure to MTBE or benzene. 3. Peripheral blood lymphocytes (PBL 4. cocorticoids (Cushing's syndrome). Type II corticosteroid receptors 5. .-M.Chen, and D.G.Tenen, Mol.Cell. Biol.14:373-381, 1994). Here we r 6. on of inflammatory cytokine genes. Several other transcription facto 7. f CD19 cross-linking in 1E8 cells. Supershift experiments revealed t 8. 4.0 +/- 0.31 and 4.1 +/- 0.34 vs. 2.9 +/- 0.29 nmol/L, p < .001) an 9. zed an epitope mapping within E1B. When inoculated twice with Ad vec 10. 0.5 (0.2-1.6) fmol/10(7) cells vs. 2.3 +/- 0.9 (1.1-4.4) fmol/10(7) 11. lated at the S-G2/M boundaries. 5. One of the signaling molecules wh 12. ation of GABP factors (E.Flory, A. Hoffmeyer, U.Smola, U.R.Rapp, and 13. 1. Administration of danazol for ove 14. d significantly (1.73 +/- 0.08 vs. 1.16 +/- 0.09 arbitrary units, P 15. the pol gene (E. Verdin, J. Virol. 65:6790-6799, 1991). In the prese 16. genome [Noteborn et al., J. Virol. 65 (1991) 3131-3139] of chicken a 17. y recognized in vitro by donor (D. E.) CD4 T cells in a HLA class II 18. d with the pol gene (E. Verdin, J. Virol. 65:6790-6799, 1991). In th 19. formed cell line from the patient. These observations indicate that 20. S. H., Reynolds, R. K., Luciw, P. A., and Jay, G. (1988) Nature 335
Entry #15 is typical of many of these errors. MEDLINE abstracts frequently contain
citations to other journal articles.
These citations contain many abbreviations,
both of names and journal titles, and the periods are mistakenly identified as
end of sentence markers.
Since these citations are almost always offset by parentheses or brackets,
using the parenthesis balancing feature of the HeuristicSentenceModel
will eliminate this error.
Therefore we change the DemoSentenceModel
constructor again, this time to:
public DemoSentenceModel() {
super(POSSIBLE_STOPS,
IMPOSSIBLE_PENULTIMATES,
IMPOSSIBLE_SENTENCE_STARTS,
true, // force final stop
true); // balance parens
}
Then we once again execute the Ant target evaluate:
> ant evaluate
evaluate:
(...)
Sentence Evaluation end boundary statistics
Total=16538
True Positive=16407
False Negative=77
False Positive=54
True Negative=0
Positive Reference=16484
Positive Response=16461
Negative Reference=54
Negative Response=77
Accuracy=0.9920788487120571
Recall=0.9953288036884251
Precision=0.9967195188627666
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9960236758233421
(...)
This change cuts the number of false positives from 136 to 54. The number of false negatives increases from 75 to 77. Overall the accuracy of the model is improved, so we keep this change in place.
Once again we consider the remaining false negatives in
EvaluatorFalseNegatives.txt.
Here are the first 20 entries:
0. he specific 'pre-activation', i.e. constitutive nuclear translocatio 1. cal use of steroid receptor drugs. --Vegeto, E., Pollio, G., Pellicc 2. n has been termed "A/R tolerance." Exposing HUVECs to A/R induces an 3. rotein expression in FDC clusters. p65 was detected in the cytoplasm 4. -jun expression was not modulated. c-myc mRNA expression, constituti 5. n 2 (IL-2) stimulates IL-2R alpha. transcription, thereby amplifying 6. stages of B-cell differentiation. mBob1 interacts with the octamer 7. ly independent of LEF/TCF factors. beta-Catenin and LEF-1 complexes 8. of the heme biosynthetic pathway. cDNA clones for the human erythro 9. 57.1 dpm mg-1 cytosol protein vs. 227.0 +/- 90.8 dpm mg-1 cytosol p 10. (c))-like molecule, IL-13R alpha1. mRNA levels for IL-13R alpha1, bu 11. distinct genes of the Rel family. p50 is translated as a precursor 12. media has been sustained for 3 mo. with culture doubling times of ab 13. o nephrotoxicity and fibrogenesis? How important are the anti-inflam 14. to the full-length protein as p97. p50B is able to form heteromeric 15. d by alpha-interferon (alpha-IFN). alpha-IFN causes dephosphorylatio 16. in collagen-stimulated platelets. p38 and p63 may provide a docking 17. HLA-Cw*0702, while FCS reduced it. beta 2-m increased the binding to 18. lls (macrophages and neutrophils). mRNA for c-fes has been detected 19. p50.p65 heterodimers was observed. p50.c-rel heterodimers were also 20. inase-associated lipocalin (NGAL). ngal gene expression was found at
Entry #3 shows a remaining problem for this model: there are biological names which are never capitalized, such as "p65", "mRNA", "alpha-IFN", or beta-Catenin", therefore determining a possible sentence start cannot be done on the basis of initial capitalization. Examination of these names shows that most of them contain digits or uppercase letters. Many names contain hyphens, such as "alpha-IFN" and "c-FOS". These names are problematic since the Indo-European tokenizer will break them into a sequence of three tokens: "c", "-", "FOS", therefore is it necessary to look through the next several tokens following a the possible sentence boundary token to determine whether or not what follows is a good sentence start.
The MEDLINE sentence model class overrides the method
possibleStart to allow for names like these. The
MedlineSentenceModel.possibleStart
method allows any sequence of contiguous tokens
containing a non-lowercase character to be a good sentence start.
The arguments to this methods are the arrays of tokens and whitespace
that the tokenizer produces from the text of the abstract, along with
indices into these arrays that give the region of the tokenization that
needs to be checked for a possible start.
Here is a (slightly simplified) version of this method:
protected boolean possibleStart(String[] tokens,
String[] whitespaces,
int start, int end) {
for (int i = start; i < end; i++) {
if (containsDigitOrUpper(tokens[i]))
return true;
if (whitespaces[i+1].length() > 0)
return false;
}
return false;
}
private boolean containsDigitOrUpper(String token) {
int len = token.length();
for (int i=0; i < len; i++) {
if (Character.isUpperCase(token.charAt(i)))
return true;
if (Character.isDigit(token.charAt(i)))
return true;
}
return false;
}
If we cut and paste these two methods into the DemoSentenceModel
and re-run the evaluation, we get the following results:
> ant evaluate
evaluate:
(...)
Sentence Evaluation end boundary statistics
Total=16539
True Positive=16452
False Negative=32
False Positive=55
True Negative=0
Positive Reference=16484
Positive Response=16507
Negative Reference=55
Negative Response=32
Accuracy=0.9947397061491021
Recall=0.9980587236107741
Precision=0.9966680802083965
Rejection Recall=0.0
Rejection Precision=0.0
F(1)=0.9973629171592253
(...)
Once again we have reduced the number of false negatives by half.
This performance is almost as good as that of the LingPipe MEDLINE sentence model
(reported in section 2 of this tutorial).
The interested reader is encourged to examine the code of the MedlineSentenceModel class
to see further possible refinements.
At this point we have acheived very high accuracy against the GENIA corpus.
It is not clear how much futher tuning of the model will be useful for the general
task of processing the MEDLINE citation index.
The GENIA corpus contains only 2000 MEDLINE abstracts, while the number of abstracts
in the MEDINE citation index stands at around 10 million. Continuing to tune and
evaluate the DemoSentenceModel model against the GENIA corpus runs the
risk of overfitting the model to the data, and might actually detract from overall
accuracy when processing new data.
Therefore we conclude this tutorial here.
References
- GENIA Project Home Page
- http://www.nlm.nih.gov/databases/dtd/medsamp2006.xml: The location of the sample file on the NLM site.
- How to License MEDLINE Data; it's free for research and most commercial purposes.