What is Named Entity Recognition?
Named entity recognition (NER) is the process of finding mentions of specified things in running text.
News Entities: People, Locations and Organizations
For instance, a simple news named-entity recognizer for English might find the person mention John J. Smith and the location mention Seattle in the text John J. Smith lives in Seattle.
Biomedical Entities: Genes, Organisms, Malignancies, Chemicals, ...
There are a range of biomedical corpora available to find mentions of various entities. For instance, human fibrinogen and thrombin are mentions of genes in the sentence Native human fibrinogen was brought to coagulation by adding thrombin.
How does LingPipe Recognize Entities?
Like many of the other modules in LingPipe, named entity
recognition involves the supervised training of a statistical model or
more direct methods like dictionary matching or regular expression
matching. All these methods are designed to work together smoothly by
using the same class for annotating the text
com.aliasi.chunk.
In the statistical models, the training data must be labeled with all of the entities of interest and their types. Furthermore, the training data should match the data on which the system will be run. If a system is trained on well-edited newswire data and is then run on blogs, performance will degrade, as it will be tuned to the clues provided by newswire text (e.g. the use of honorifics such as Mr.) and will miss the clues provided in blogs (e.g. e-mail links).
Rule-Based Named Entity Detection
Named entity recognition is particularly easy if it's possible to write a regular expression that captures the intended pattern of entities.
We'll write an entity detector that finds well-formed email addresses. The format of emails is explained, with references to the standards, in:
Briefly, an e-mail address of the form
lingpipe@alias-i.com consists of two parts, the local
part, lingpipe and the domain name,
alias-i.com. These are separated by the at-sign
(@).
The first step is writing a regular expression that matches all and only e-mail addresses. Rather than writing a fully specification-compliant email recognizer, we'll borrown one from:
- Email Matching RegExes at (from
regexlib.com)
The following example, contributed by Bilou McGyver, will work for example purposes:
[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})
Basically, the regular expression is the only substantial part of the
implementation. The complete chunker is in
the file src/EmailRegExChunker.java, which we repeat here:
public class EmailRegexChunker extends RegExChunker {
public EmailRegexChunker() {
super(EMAIL_REGEX,CHUNK_TYPE,CHUNK_SCORE);
}
private final static String EMAIL_REGEX
= "[A-Za-z0-9](([_\\.\\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\\.\\-]?[a-zA-Z0-9]+)*)\\.([A-Za-z]{2,})";
private final static String CHUNK_TYPE = "email";
private final static double CHUNK_SCORE = 0.0;
}
There is also a main method in that class for testing, which
simply applies the chunker to the command-line arguments.
public static void main(String[] args) {
Chunker chunker = new EmailRegExChunker();
for (int i = 0; i < args.length; ++i) {
Chunking chunking = chunker.chunk(args[i]);
System.out.println("input=" + args[0]);
System.out.println("chunking=" + chunking);
Set<Chunk> chunkSet = chunking.chunkSet();
Iterator<Chunk> it = chunkSet.iterator();
while (it.hasNext()) {
Chunk chunk = it.next();
int start = chunk.start();
int end = chunk.end();
String text = args[0].substring(start,end);
System.out.println(" chunk=" + chunk + " text=" + text);
}
}
}
This code first creates the chunker, then analyzes each of the arguments using the chunker. It does this by pulling out the chunking, then pulling the set of chunks out of the chunking, then printing each of these chunks.
The sample we'll use is the following (with positional information displayed below the actual input text):
John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca. 0123456789012345678901234567890123456789012345678901234567890123456789012345 0 1 2 3 4 5 6 7
There is an ant task for running it, the arguments to which are the text that will be analyzed.
% ant email-chunker
...
input=John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca.
chunking=John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca. : [16-36:email@0.0, 57-75:email@0.0]
chunk=16-36:email@0.0 text=john@his.company.com
chunk=57-75:email@0.0 text=foo.bar@123.foo.ca
The output reproduces the input string, then lists the chunks
that were extracted. The first chunk is 16-36:email@0.0,
which indicates there is an email starting at position 16
and ending at position 35 (we use the standard half-open
notation of first character to one plus the last character),
yielding the entity john@his.company.com. The
second entity is foo.bar@123.foo.ca (note the
final period is not part of the email).
Exact Dictionary-Based Chunking
In many applications, it is relatively straightforward to compile a list of names (and aliases) for entities and their types. For instance, westlaw.com has lists of all the registered lawyers in the United States. Such lists are extremely helpful for finding mentions of lawyers in legal documents such as case law or court transcripts. A baseball site such as mlb.com has a list of all current and retired baseball players to work from.
Finding Names in a Dictionary
Sometimes, it is enough to simply find all the mentions of phrases in a dictionary in a text. (See the sidebar for an example.)
A name like "50 Cent" or product name like "xyx120 dvd player" can prove to be very difficult to find in text with statistical recognizers or generic rule-based ones. Sometimes a dictionary may be used in conjunction with a statistical recognizer in order to improve recall for "difficult" names. Names tend to be difficult if they're easily confusible as to type, or if they are confusible with common words.
Another strategy that may be used is to run statistical named entity recognition on a corpus and then use the output to build a dictionary that may be used to reannotate the corpus for chunks that may have been missed in the first pass.
The naive approach of doing a disjunctive regular expression or a substring search quickly falls over in the face of even modestly sized dictionaries. LingPipe provides an implementation of the Aho-Corasick algorithm. The beauty of the Aho-Corasick algorithm is that it finds all matches against a dictionary in linear time independent of the number of matches or size of the dictionary.
Demo Code
The code for the demo is at src/DictionaryChunker.java. The first step of the code simply creates a dictionary from some
hard-coded entries:
static final double CHUNK_SCORE = 1.0;
public static void main(String[] args) {
MapDictionary<String> dictionary = new MapDictionary<String>();
dictionary.addEntry(new DictionaryEntry<String>("50 Cent","PERSON",CHUNK_SCORE));
dictionary.addEntry(new DictionaryEntry<String>("XYZ120 DVD Player","DB_ID_1232",CHUNK_SCORE));
dictionary.addEntry(new DictionaryEntry<String>("cent","MONETARY_UNIT",CHUNK_SCORE));
dictionary.addEntry(new DictionaryEntry<String>("dvd player","PRODUCT",CHUNK_SCORE));
Note that in the DictionaryEntry constructor, the
first argument is the phrase, the second string argument is the type,
and the final double-precision floating point argument is the score
for the chunk. Dictionary entries are themselves always case
sensitive.
We set "DB_ID_1232" as the type for the phrase "XYZ120 DVD Player". This is a simple way to link database IDs to text mentions, but beware of ambiguity. Not all "John Smith"s are the same (see the sidebar on Times Topics). There is no limit to the number of different entity types in a dictionary. The scores will simply be passed along as chunk scores in the dictionary-based chunker.
The next step in the code builds the dictionary-based chunker:
ExactDictionaryChunker dictionaryChunkerTT
= new ExactDictionaryChunker(dictionary,
IndoEuropeanTokenizerFactory.INSTANCE,
true,true);
Note that the dictionary chunker is constructed from a dictionary, a tokenizer factory, and two flags. The tokenizer factory we pass in is an instance of the IndoEuropean tokenizer factory. Because our matching treats tokens as symbols, the chunker needs a tokenizer for matching. Whitespace will be ignored in the matching process. The first flag indicates whether or not all matches are found, even in cases where they overlap. The second flag indicates whether or not the chunker is case sensitive. Thus the dictionary chunker constructed above will find all chunks and will be case sensitive. Three other chunkers covering all the flag settings are constructed next:
ExactDictionaryChunker dictionaryChunkerTF
= new ExactDictionaryChunker(dictionary,
IndoEuropeanTokenizerFactory.INSTANCE,
true,false);
ExactDictionaryChunker dictionaryChunkerFT
= new ExactDictionaryChunker(dictionary,
IndoEuropeanTokenizerFactory.INSTANCE,
false,true);
ExactDictionaryChunker dictionaryChunkerFF
= new ExactDictionaryChunker(dictionary,
IndoEuropeanTokenizerFactory.INSTANCE,
false,false);
After the chunkers are created, the rest of the code simply runs through the command-line arguments and calls each chunker on the resulting text in turn.
for (int i = 0; i < args.length; ++i) {
String text = args[i];
chunk(dictionaryChunkerTT,text);
chunk(dictionaryChunkerTF,text);
chunk(dictionaryChunkerFT,text);
chunk(dictionaryChunkerFF,text);
}
The chunk() method simply applies the chunker
to the text and prints out the result.
static void chunk(ExactDictionaryChunker chunker, String text) {
Chunking chunking = chunker.chunk(text);
for (Chunk chunk : chunking.chunkSet()) {
int start = chunk.start();
int end = chunk.end();
String type = chunk.type();
double score = chunk.score();
String phrase = text.substring(start,end);
System.out.println(" phrase=|" + phrase + "|"
+ " start=" + start
+ " end=" + end
+ " type=" + type
+ " score=" + score);
}
}
The example may be run using a pre-package Ant target,
which we indicate below along with the results:
% cd $LINGPIPE/demos/tutorial/ne
% ant dictionary-chunker
DICTIONARY
[cent:MONETARY_UNIT 1.0, 50 Cent:PERSON 1.0, XYZ120 DVD Player:DB_ID_1232 1.0, dvd player:PRODUCT 1
TEXT=50 Cent is hard to distinguish from 50 cent and just plain cent without case
Chunker. All matches=true Case sensitive=true
phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0
Chunker. All matches=true Case sensitive=false
phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
phrase=|Cent| start=3 end=7 type=MONETARY_UNIT score=1.0
phrase=|50 cent| start=36 end=43 type=PERSON score=1.0
phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0
Chunker. All matches=false Case sensitive=true
phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0
Chunker. All matches=false Case sensitive=false
phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
phrase=|50 cent| start=36 end=43 type=PERSON score=1.0
phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0
TEXT=The product xyz120 DVD player won't match unless it's exact like XYZ120 DVD Player.
Chunker. All matches=true Case sensitive=true
phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0
Chunker. All matches=true Case sensitive=false
phrase=|xyz120 DVD player| start=12 end=29 type=DB_ID_1232 score=1.0
phrase=|DVD player| start=19 end=29 type=PRODUCT score=1.0
phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0
phrase=|DVD Player| start=72 end=82 type=PRODUCT score=1.0
Chunker. All matches=false Case sensitive=true
phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0
Chunker. All matches=false Case sensitive=false
phrase=|xyz120 DVD player| start=12 end=29 type=DB_ID_1232 score=1.0
phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0
Take a moment to compare the different outputs based on the constructor flags. When case sensitivity is enabled, the dictionary entry "50 Cent" does not match the text "50 cent", When all matches are enabled without case sensitivity (the second example), the dictionary entry "cent" matches the second token in "50 Cent". Note that in this case, "50 Cent" and "Cent" are pulled out of the same text, with character spans (0,7) and (3,7) respectively.
Approximate Dictionary-Based Chunking
The class dict.ApproxDictionaryChunker
implements a dictionary-based chunker that performs approximate
matching. An approximate dictionary chunker is populated with
a dictionary, the same way as an exact dictionary chunker. But
when it is run, it doesn't just look for exact matches, but for
all matches within a fixed weighted edit distance threshold.
Running the Demo
The approximate dictionary matching demo may be invoked
from ant using the target approx-chunker:
% ant approx-chunker
Dictionary=[Mdm:Mdm 1.0, P53:P53 1.0, protein 53:P53 1.0]
A protein called Mdm2 binds to p53 and transports it from the nucleus to the cytosol.
Matched Phrase Dict Entry Distance
p53 P53 1.0
Mdm2 Mdm 1.0
p53, also known as protein 53 (TP53), functions as a tumor supressor.
Matched Phrase Dict Entry Distance
53 P53 1.0
p53, P53 2.0
p53 P53 1.0
(TP53 P53 2.0
protein 53 P53 0.0
TP53 P53 1.0
TP53) P53 2.0
protein 53 ( P53 2.0
The output first prints out the dictionary, in this case
consisting of three phrasal entries,
"Mdm",
"P53", and
"protein 53",
with lexical entries consisting of the strings Mdm,
P53 and P53 respecitvely.
The numerical values are defaults, as dictionary entries
are weighted in the interface.
Next comes the analysis of two sample sentences. Each
is first printed, then all of the approximately matching substrings,
along with their distances. Closer (lower distance) values are
better matches. For instance, the dictionary phrase
"p53" matched the substrings
"53",
"p53,", and
"p53" in the first four characters of the
input, with distances 1.0, 2.0, and 1.0 respectively.
Code Walk Through
The code for the demo is in in a single main() method
defined in src/ApproximateChunkerDemo.java.
It starts by setting up a trie-based dictionary:
public static void main(String[] args) {
DictionaryEntry entry1
= new DictionaryEntry<String>("P53", "P53");
DictionaryEntry entry2
= new DictionaryEntry<String>("protein 53", "P53");
DictionaryEntry entry3
= new DictionaryEntry<String>("Mdm","Mdm");
TrieDictionary<String> dict = new TrieDictionary<String>();
dict.addEntry(entry1);
dict.addEntry(entry2);
dict.addEntry(entry3);
...
Next, the chunker is created from a tokenizer factory and a weighted edit distance:
TokenizerFactory tokenizerFactory
= IndoEuropeanTokenizerFactory.INSTANCE;
WeightedEditDistance editDistance
= new FixedWeightEditDistance(0,-1,-1,-1,Double.NaN);
double maxDistance = 2.0;
ApproxDictionaryChunker chunker
= new ApproxDictionaryChunker(dict,tokenizerFactory,
editDistance,maxDistance);
The edit distance assigns a cost of 0 to matches, a cost of -1 to insertions, deletions and substitutions, and does not specify a transposition cost. Transpositions are not available as part of the approximate dictionary chunker. These values may be changed in the fixed weight edit distance implementation, or a custom edit distance may be implemented.
The chunker itself is constructed with the dictionary, which must be a trie-backed dictionary, a tokenizer factory, a weighted edit distance, and critically, a maximum distance at which results are returned.
The tokenizer factory only controls possible chunk boundaries. All
chunks must start at the first character of a token and must end at
the last character of a token (offset by one, of course). By using an
instance of tokenizer.CharacterTokenizerFactory,
the token-sensitivity will be turned off. Note that even with
the character tokenizer factory, chunks will still not start
and end on whitespace.
Finallly, the maximum distance parameter controls the maximum edit distance a returned chunk can be from a dictionary entry. In the case here, we set this at 2, meaning we get all matches within edit distance 2 of entries in the dictionary.
The rest of the code simply applies chunking and prints out the results:
for (String text : args) {
Chunking chunking = chunker.chunk(text);
CharSequence cs = chunking.charSequence();
Set<Chunk> chunkSet = chunking.chunkSet();
for (Chunk chunk : chunkSet) {
int start = chunk.start();
int end = chunk.end();
CharSequence str = cs.subSequence(start,end);
double distance = chunk.score();
String match = chunk.type();
System.out.printf("%15s %15s %8.1f\n",
str, match, distance);
}
}
As with the other chunkers, we simply chunk the text, extract the underlying character sequence and set of chunks, then iterate over them extracting their start and end points, distances (which are scores in the chunk interface), and the matched dictionary entry, which is encoded on the chunk type.
In some applications, a non-overlapping set of results is required. This can be done by eliminating every chunk which overlaps a chunk of higher score, working left to right.
Running a Statistical Named Entity Recognizer
Named entity recognition is particularly simple if a trained model already exists. Like our idol Julia Child, we have an already-baked cake waiting to show you. And like Julia, after we let you see the finished product, we'll show you how to bake it step by step.
First-Best Named Entity Chunking
The following program, src/RunChunker.java, provides a
command that loads a named-entity recognizer as an instance of the
Chunker interface and then applies it to the remaining
command-line arguments, printing out the results.
public static void main(String[] args) throws Exception {
File modelFile = new File(args[0]);
System.out.println("Reading chunker from file=" + modelFile);
Chunker chunker
= (Chunker) AbstractExternalizable.readObject(modelFile);
for (int i = 1; i < args.length; ++i) {
Chunking chunking = chunker.chunk(args[i]);
System.out.println("Chunking=" + chunking);
}
}
The code may be run from an Ant task, using:
% ant run-genetag
or it may be run directly from Java, using:
% java -cp neDemo.java:../../../lingpipe-4.1.0.jar RunChunker \ ../../models/ne-en-bio-genetag.HmmChunker \ "p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells."
Windows note: Replace the colon (:) in the classpath with a semicolon (;) and put everything on one line without any backslashes.
The output is the following:
run-genetag:
[java] Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker
[java] Chunking=p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rha
bdomyosarcoma cells. : [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-92:GENE@-Infinity]
This output repeats the input sentence and then, following the colon,
the list of chunks found. Here we found a chunk running from
character position 0 (inclusive) to character position 3 (exclusive),
with type GENE; this is right, because p53 is a
gene. It also find a gene from psoition 20 to 54, for the expression
insulin-like growth factor II gene, and a final one from 81 to
92 for P4 promoter.
The model used by this chunker is
../../models/ne-en-bio-genetag.HmmChunker.
Like our other models, this one is labeled by task (ne
for named-entity recognition), language (en for English),
genre (bio for biology) and corpus (genetag for
the
GENETAG
corpus), and suffixed with the name of the class of the
serialized object (HmmChunker for
com.aliasi.chunk.HmmChunker).
N-Best Named Entity Chunking
Our next example runs n-best chunking. This means it returns not only its first answer, but also alternative hypotheses, in order of their estimated likelihoods, which are also returned.
NBestChunker chunker
= (NBestChunker) AbstractExternalizable.readObject(modelFile);
for (int i = 1; i < args.length; ++i) {
char[] cs = args[i].toCharArray();
Iterator<ScoredObject<Chunking>> it = chunker.nBest(cs,0,cs.length,MAX_N_BEST);
System.out.println(args[i]);
for (int n = 0; it.hasNext(); ++n) {
ScoredObject so = it.next();
double jointProb = so.score();
Chunking chunking = so.getObject();
System.out.println(n + " " + jointProb
+ " " + chunking.chunkSet());
}
The n-best chunker is reconstituted in the same manner as the
regular chunker. We then iterate over arguments, and for each
argument, apply the method nBest(), which takes as
arguments the command-line argument converted to a character slice
and a maximum for the number of results to return. Then there's
an iterator over the results. The results returned by
an n-best chunker are instances of
ScoredObject, as shown by the cast. The score is
the log (base 2) joint probability of the input character sequence
and the chunking. The object is the chunking itself.
N-best chunking can be run with the following Ant task:
% ant run-genetag-nbest
The results are provided below.
Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells. 0 -182.7325747972808 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-92:GENE@-Infinity] 1 -183.39842051103167 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 81-92:GENE@-Infinity] 2 -185.1133085819252 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 74-92:GENE@-Infinity] 3 -185.73246504074297 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-83:GENE@-Infinity] 4 -185.77915429567608 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 74-92:GENE@-Infinity] 5 -186.39831075449385 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 81-83:GENE@-Infinity] 6 -187.47968583511886 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity] 7 -188.14553154886974 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity]
The chunkings are ordered in descending (log base 2) joint probability. For instance, the first result has a joint estimate of -182.7, whereas the second has one of -183.4. By some simple math, we can see how much more likely the first analysis is than the second:
p1/p2 = 2 ** (log (p1/p2))
= 2 ** (log p1 - log p2)
= 2 ** -182.7-183.4
= 2 ** .7
~ 1.62
That is, the first analysis is only 1.62 times as likely as the second, analysis. The first analysis is more than five times as likely as the third analysis, with a log probability of -185.1.
N-best analyses are particularly useful for long-distance rescoring
systems. The class com.aliasi.chunk.RescoringChunker provides an abstract implementation of rescoring and com.aliasi.chunk.CharLmRescoringChunker provides a concrete implementation.
Confidence Named Entity Chunking
The final example runs chunking in such a way as to return chunks
in order of confidence, which we take to be the probability of
the chunk given the input text, P(chunk|text). The
program is straightforward:
ConfidenceChunker chunker
= (ConfidenceChunker) AbstractExternalizable.readObject(modelFile);
for (int i = 1; i < args.length; ++i) {'
char[] cs = args[i].toCharArray();
Iterator<Chunk> it
= chunker.nBestChunks(cs,0,cs.length,MAX_N_BEST_CHUNKS);
System.out.println(args[i]);
System.out.println("Rank Conf Span Type Phrase");
for (int n = 0; it.hasNext(); ++n) {
Chunk chunk = it.next();
double conf = Math.pow(2.0,chunk.score());
int start = chunk.start();
int end = chunk.end();
String phrase = args[i].substring(start,end);
System.out.println(n + " "
+ Strings.decimalFormat(conf,"0.0000",12)
+ " (" + start
+ ", " + end
+ ") " + chunk.type()
+ " " + phrase);
}
}
The program is mostly like the previous one, and begins by
reconstituting a chunker, this time a ConfidenceChunker.
Then we walk over the arguments, extracting a character array as
before, and then providing it to the chunker, this time calling the
nBestChunks method. With a confidence chunker, the
result iterator is over chunks, so the cast is to (Chunk)
in the body of the iteration. We then compute the confidence by
exponentiating the log (base 2) value so that it is easier to read.
We then print the confidence out formatted, and then the spans, the
type and phrase, which we extract as a substring of the arguments
Confidence chunking can be run with the following Ant task:
% ant run-genetag-conf
Running this program on the same input provides the following results
Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells. Rank Conf Span Type Phrase 0 0.9999 (0, 3) GENE p53 1 0.7328 (81, 92) GENE P4 promoter 2 0.6055 (20, 54) GENE insulin-like growth factor II gene 3 0.3817 (14, 54) GENE human insulin-like growth factor II gene 4 0.1395 (74, 92) GENE active P4 promoter 5 0.0916 (81, 83) GENE P4 6 0.0088 (74, 83) GENE active P4 7 0.0070 (20, 49) GENE insulin-like growth factor II 8 0.0044 (14, 49) GENE human insulin-like growth factor II
The entity recognizer is 99.99% confident that p53 is a mention of a gene. It's only 73.28% confident about P4 promoter being a gene, and even less confident about insulin-line growth factor II gene. The list gets interesting when we see names that overlap, such as the fourth (index 3) result, which prefixes human to the third result, or the fifth result, which prefixes active to the second result. These confidences reflect the uncertainty of the recognizers.
We believe confidence-ranked results will be very useful for information extraction applications where recall and/or confidence in results is important. Rather than just counting instances of first-best results, a system may push the uncertainty of the gene annotator through later processes in a Bayesian fashion. In practice, the results are available for iteration.
Training a Named Entity Recognizer
Download Training Data
The first requirement for training a named entity recognizer is gathering the data. For this demo, we use the GeneTag named entity data, provided by the Unites States National Center for Biotechnology Information (NCBI), a part of the U.S. Library of Medicine, and described in this paper:
- Tanabe, L., N. Xie L. H. Thom, W. Matten, And W. J. Wilbur. 2004. GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 2005, 6(Suppl 1):S3.
The data itself may be downloaded from the following anonymous FTP server:
The data will then need to be untarred (and gunzipped). Assuming it
has been downloaded to directory dataDir:
% cd dataDir % tar -xzf medtag.tar.gz
The file necessary for training will then be at the path
dataDir/medtag/genetag/genetag.tag; this is
what we'll call GeneTagFilePath in the next section.
Running the Training Code
The following code, drawn from src/TrainGeneTag.java provides
all you need to train a named entity recognizer based on the
GeneTag corpus:
static final int MAX_N_GRAM = 8;
static final int NUM_CHARS = 256;
static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior
public static void main(String[] args) throws IOException {
File corpusFile = new File(args[0]);
File modelFile = new File(args[1]);
System.out.println("Setting up Chunker Estimator");
TokenizerFactory factory
= IndoEuropeanTokenizerFactory.INSTANCE;
HmmCharLmEstimator hmmEstimator
= new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
CharLmHmmChunker chunkerEstimator
= new CharLmHmmChunker(factory,hmmEstimator);
System.out.println("Setting up Data Parser");
GeneTagParser parser = new GeneTagParser();
parser.setHandler(chunkerEstimator);
System.out.println("Training with Data from File=" + corpusFile);
parser.parse(corpusFile);
System.out.println("Compiling and Writing Model to File=" + modelFile);
AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
}
The GeneTag corpus is provided in a single file, given as the first command-line argument. The second argument is just the name of the target file into which to write the model.
The first block of code initializes an estimator for a chunker. In
particular, it initializes an instance of
CharLmHmmChunker, which provides chunking based on a
hidden Markov model (HMM).
language models. This particular chunker requires a tokenizer factory
for breaking the input into tokens and a trainable HMM on which to
base the model. For the first, we use our standard Indo-European
tokenizer factory. For the HMM, we use a HmmCharLmEstimator,
which estimates state emissions with sequence character language models.
In this case, we set the maximum n-gram to a constant (8), set the
number of characters (we'll assume 256), and the interpolation parameter
for the language models (8.0).
The second block of code just defines the parser, in this case an
instance of com.aliasi.corpus.parsers.GeneTagParser,
which is the parser for the GeneTag corpus. The second line sets the
handler for the parser to the chunker estimator. This means that
chunkings extracted by the parser will be handed off to the chunker
estimator for training. This pattern is just like that used for XML
SAX parsing, with our parser playing the role of an
org.xml.sax.XMLReader and our handler playing the role of
an org.xml.sax.ContentHandler.
The next block of code just parses the corpus file specified in the
command-line argument. The parsed chunkings are handed to the estimator for
training. The final block of code just compiles the model to a file,
using a utility method in com.aliasi.util.AbstractExternalizable.
This code can be compiled to neDemo.jar from Ant using
the following target:
% ant jar
It may then be run with the Ant target:
% ant -Ddir.genetag=GeneTagDir train-genetag
where GeneTagDir is the path to the directory
the genetag.tag after unpacking, which will be
at the relative path medtag/medtag/genetag from
where the medtag data is unpacked.
This command may be run directly in Java from the directory
$LINGPIPE/demos/tutorial/ne with:
% java -cp ../../../lingpipe-4.1.0.jar;neDemo.jar TrainGeneTag GeneTagDirPath ../../models/ne-en-bio-genetag.HmmChunker
Note that the resulting model is written into the standard location
$LINGPIPE/demos/models.
To evaluate with the model we just built, use:
% ant -Ddir.genetag=GeneTagParentDir eval-genetag
This performs a cross-validation on the data. We discuss cross-validation in the section on Arabic NER below.
Evaluating a Named Entity Recognizer
The standard evaluation setup for a named-entity recognizer (and any other statistical component) involves partitioning the labeled data into training and test segments. If you're playing by old-fashioned hypothesis-testing statistical rules, or if you're competing in a bakeoff that only allows a single submission, you only get one shot at the test data. In the machine-learning community, it's not uncommon to see things like what we're about to do, which is cheat. In slightly more technical terms, we'll be reporting post-hoc results on a so-called development set. Luckily for our systems, the parameters are very stable, meaning that slight perturbations of the parameters only lead to slight changes in performance.
CoNLL Spanish Task
We'll use data for this tutorial that's available from the web. The Conference on Computational Natural Language Learning (ConNLL), is an annual meeting that sponsors a bakeoff. In 2002, the bakeoff involved Spanish and Dutch named-entity recognition. Here's a directory of the relevant CoNLL web pages:
- CoNLL 2002 Main Site
- CoNLL 2002 Shared Task: Language-Independent Named Entity Recognition (I)
- Download Data: ner.tgz
- Workshop Paper on Results
An illustration of the encoding from the first few lines of the Spanish training data is:
El O Abogado B-PER General I-PER del I-PER Estado I-PER , O Daryl B-PER Williams I-PER , O
With this encoding scheme, the phrases El Abogado General del
Estado and Daryl Williams are coded as
persons, with their beginnnig and continuing tokens picked out with
tags B-PER and I-PER respectively.
Unfortunately, there are several formatting errors in the data that must be fixed before our parsers can handle them.
-
esp.train, line 221619, changeI-LOCtoB-LOC -
esp.testa, line 30882, changeI-LOCtoB-LOC -
esp.testb, line 9291, changeI-LOCtoB-LOC
After downloading the data, simply unarchive it. Assuming the
compressed tarball ner.tgz
has been downloaded to a directory conll2002.
% cd conll2002 % tar -xzf ner.tgz
You should then see the following files under conll2002:
ner/data/esp.train.gzner/data/esp.testa.gzner/data/esp.testb.gz
You can further gunzip each of these, to produce:
ner/data/esp.trainner/data/esp.testaner/data/esp.testb
These are the files we use in the evaluation.
Training the Recognizer
First, we need to train a recognizer using the Spanish data. For
that, we have another program in src/TrainConll2002.java.
The relevant details are shown below.
static final int NUM_CHUNKINGS_RESCORED = 64; static final int MAX_N_GRAM = 12; static final int NUM_CHARS = 256; static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior public static void main(String[] args) throws IOException { File modelFile = new File(args[0]); File trainFile = new File(args[1]); File devFile = new File(args[2]); TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE; CharLmRescoringChunker chunkerEstimator = new CharLmRescoringChunker(factory,NUM_CHUNKINGS_RESCORED, MAX_N_GRAM,NUM_CHARS, LM_INTERPOLATION); Conll2002ChunkTagParser parser = new Conll2002ChunkTagParser(); parser.setHandler(chunkerEstimator); parser.parse(trainFile); parser.parse(devFile); AbstractExternalizable.compileTo(chunkerEstimator,modelFile); }
The most significant difference is that we use a different chunker estimator,
this time a com.aliasi.chunk.CharLmRescoringChunker. This chunker
provides a higher-level model that rescores the output of an underlying chunker.
This is why the first static declares how many such underlying n-best chunkings
are rescored. The second notable difference is that it uses a tag parser
designed for the CoNLL data format, the implementation of which can be
found in src/Conll2002ChunkTagParser.java. The only other difference between this new training scheme is that
we train on two files, both the training data and the development data (which were
both available to participants in the bakeoff before the final evaluation).
Assuming that the CoNLL data was installed in the directory conll2002 as described above, the recognizer can be trained with the Ant target:
% ant -Ddata.conll2002.spanish=conll2002/ner/data train-conll2002-spanish
Test Program
The testing program is almost as simple as the run program, and can
be found in src/TestConll2002Chunker.java,
the contents of which minus the imports and print statements are reproduced here.
public static void main(String[] args) throws Exception {
File chunkerFile = new File(args[0]);
File testFile = new File(args[1]);
AbstractCharLmRescoringChunker chunker
= (AbstractCharLmRescoringChunker)
AbstractExternalizable.readObject(chunkerFile);
ChunkerEvaluator evaluator = new ChunkerEvaluator(chunker);
evaluator.setVerbose(true);
Conll2002ChunkTagParser parser
= new Conll2002ChunkTagParser();
parser.setHandler(evaluator);
parser.parse(testFile);
System.out.println(evaluator.toString());
}
We've simply set up an evaluator wrapping our chunker, and set it to provide verbose output (case-by-case results to standard output). We've provided an Ant task to run the evaluation:
% ant -Ddata.conll2002.spanish=conll2002/ner/data test-conll2002-spanish
Incremental Per-Sentence Results
Because we have set the evaluator to be verbose, it prints a report for each sentence it processes. Here's an example:
CHUNKS=(65,77):MISC (118,136):ORG
Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
1 2 3 4 5 6 7 8 9 0 1 2 3
1 1 1 1
REF M........... O.................
RESP M........... O.................
CHUNKS=(65,77):MISC (118,136):ORG
Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
1 2 3 4 5 6 7 8 9 0 1 2 3
1 1 1 1
REF M........... O.................
0 -143 M........... O.................
-----------
1 -156 M........... O....... L.....
2 -158 O.... M........... O.................
3 -159 M.............. O.................
4 -163 M........... O....... O.....
5 -164 M....................... O.................
6 -165 M... M.. O.................
7 -166 M........... P.. O.................
Correct Rank=0
CHUNKS=(65,77):MISC (118,136):ORG
Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
1 2 3 4 5 6 7 8 9 0 1 2 3
1 1 1 1
TRUE (65, 77): MISC 0.9999877514430212
TRUE (118, 136): ORG 0.9998255148751076
false (118, 126): ORG 1.7448512066560313E-4
false (130, 136): LOC 1.7339828700790064E-4
false (0, 5): ORG 2.679282465485203E-5
false (62, 77): MISC 1.1394528562976028E-5
false (130, 136): ORG 1.062746657431117E-6
false (53, 77): MISC 4.978971914365618E-7
The report begins by printing the sentence and underlying chunk boundaries in the reference (truth). Below the reference, the system result is printed as the response.
Next, the n-best results are printed. These again begin with a printout of the input string and the reference tagging. Then, the n-best list is printed up to the size specified on the evaluator (8, in our case). A line is printed below the correct answer, if any, and the rank of the correct answer is printed at the bottom. In this case, our first-best answer (rank 0) was correct. Next to the rank, the score is provided. This is a joint probability estimate presented in log (base 2) form. Thus we see that our first answer at -143 is 213 times more likely than the second analysis at -156.
Finally, the confidence-based results are printed, again after the
input is displayed. The chunks returned by the system are printed
with their character offsets (begin inclusive, end exclusive), their
type, and then their score. The score is the conditional probability
estimate of the chunk given the input. Thus the system is 99.98% sure
that the characters from 118 to 136 refer to an organization. This is
not a good estimate. The variety of chunker we're using for this demo
tends to attenuate the results far too much in the direction of the
top scorers. This is because it estimates confidence from the top
results on the n-best list, which is itself very attenuated in this
example. The simpler chunker,
com.aliasi.chunk.CharLmHmmChunker provides more reliable
confidence estimates than this chunker, the
com.aliasi.chunk.CharLmRescoringChunker.
First-best Results
The run takes a little under a minute on my machine and produces the following first-best results:
FIRST-BEST EVAL Total=4307 True Positive=2674 False Negative=742 False Positive=891 True Negative=0 Positive Reference=3416 Positive Response=3565 Negative Reference=891 Negative Response=742 Accuracy=0.6208497794288368 Recall=0.7827868852459017 Precision=0.7500701262272089 Rejection Recall=0.0 Rejection Precision=0.0 F(1)=0.7660793582581292 Fowlkes-Mallows=3489.70485858045 Jaccard Coefficient=0.6208497794288368 Yule's Q=-1.0 Yule's Y=-1.0 Reference Likelihood=0.7931274669143256 Response Likelihood=0.8277223125145112 Random Accuracy=0.6921288226373674 Random Accuracy Unbiased=0.6927272243084177 kappa=-0.23152230039570446 kappa Unbiased=-0.23392064174187105 kappa No Prevalence=0.24169955885767358 chi Squared=233.8186156392983 phi Squared=0.054288046352286574 Accuracy Deviation=0.0073928430493272
The F(1) score of 76.61% would've tied LingPipe at fourth place overall (out of what would be 14 contestants if you include us). We'd have been in third place for recall and fifth place for precision. You can check out the CoNLL 2002 NER Task Home for the result (they're near the bottom). The best scoring system was by the organizers, Xavier Carreras, Luís Màrquez and Luís Padró, which scored 81.39% f-measure using dictionaries; without dictionaries, their system scored 79.28%, only slightly better than the second place system of Radu Florian at 79.05%, which didn't train on any external data. Links to papers describing all of the systems submitted to the bakeoff are available from CoNLL 2002 NER Task Home.
The organizers computed per-system 95% confidence intervals (using bootstrap resampling) at roughly 1.5%, making systems two and three at 79.1+/-1.5% not significantly different than ours at 76.6+/-1.5%.
Now is the time to recall what we said about cheating. The bakeoff contestants submitted their system runs once and got a score back. We took the default LingPipe settings for Western languages -- 12-grams with a 12.0 interpolation ratio and no pruning. We also took the default of rescoring 128 sentences. We did some post-hoc tests and found if we rescore 64 results instead of 128, F-measure drops a fairly substantial 0.3%, whereas if we rescore 256 results instead of 128, F-measure drops a fairly trivial 0.03%. Rescoring more results doesn't seem to matter much in terms of scores.
N-best Results
Our chunking evaluator also returns a report on evaluating n-best results.
It returns these results as a histogram mapping rank to number of test cases
where the reference result was that rank, with -1 being assigned
to cases whose reference result was not in the n-best list. For the settings
above, the n-best results are (reformatted into three columns):
N-BEST EVAL
0=969
-1=173 15=4 27=1
1=140 14=3 29=1
2=53 20=3 30=1
3=25 21=3 31=1
4=20 26=3 34=1
5=15 32=3 38=1
6=14 12=2 39=1
7=11 16=2 41=1
8=10 19=2 44=1
11=8 23=2 48=1
18=7 24=2 49=1
9=6 28=2 50=1
10=5 36=2 57=1
17=5 45=2 67=1
13=4 22=1 76=1
83=1
This report indicates there are 969 cases for which the system's first-best answer (rank 0) was correct. There were 140 cases when the second-best answer (rank 1) was correct, and so on. Note that the count for -1, namely 173, is the count for the number of cases for which the reference chunking was beyond the 128th-best response chunking.
Confidence Results
The final piece of the report is the confidence report. This provides a view of the results that would be obtained by returning chunks according to their global confidence, without concern about overlap, and with a given threshold in place.
CONFIDENCE EVALUATION Area Under PR Curve (interpolated)=0.7344955594638413 Area Under PR Curve (uninterpolated)=0.7316142729153922 Area Under ROC Curve (interpolated)=0.8111854024938068 Area Under ROC Curve (uninterpolated)=0.8111854024938073 Average Precision=0.8628569727835927 Maximum F(1) Measure=0.756586983074507 BEP (Precision-Recall break even point)=0.7508207485226527 Reciprocal Rank=1.0
The average precision and area under the PR/ROC curves are the standard reports from ranked systems, as used in TREC. See the class documentation for ScoredPrecisionRecallEvaluation for more details and examples of how these are computed and what they mean. The most significant one is perhaps the maximum F(1) measure, as that indicates the maximum score possible by setting a threshold on confidence results and returning all chunks above that confidence. Here, as in many other cases we've evaluated, it's very close to the first-best F(1) measure.
Named Entities in Arabic
LingPipe's basic approach is portable across languages. All that needs to change is sentence detection and tokenization. We've successfully built named-entity recognizers for English, Spanish, Dutch, German, French, Hindi, Chinese, and Arabic.
The ANER Corpus
In this section, we use Yassine Benajiba's
ANER corpus for Arabic, which consists of roughly 150K tokens tagged
in CoNLL column format for person, location, organization and
miscellaneous entity mention types (typesMISC, PERS, ORG,
LOC).
The corpus is available from:
Benajiba provides more details on the corpus and an evaluation of his logistic regression plus part-of-speech based system in the following two papers (the earlier one gives more details about the corpus):
- Benajiba, Y. and P. Rosso 2007. ANERsys 2.0 : Conquering the NER task for the Arabic language by combining the Maximum Entropy with POS-tag information. In IICAI-2007.
- Benajiba, Y., P. Rosso, and Benedí J. M. 2007. ANERsys: An Arabic Named Entity Recognition system based on Maximum Entropy. In CICLing-2007.
Download the data into a directory $ANER and unpack the top-level
zip files, and if you are going to use the dictionaries, unpack
the dictionary level gzip files. You should wind up with a directory
structure with the following structure, where ANERCorp is
the CoNLL-formatted entity corpus and the other files are gazetteers
with one entry per line:
$ANER/ANERCorp $ANER/ANERGazet/AlignedLocGazetteer $ANER/ANERGazet/AlignedOrgGazetteer $ANER/ANERGazet/AlignedPersGazetteer
The data was collected from a variety of sources:
| Source | Ratio |
|---|---|
| http://www.aljazeera.net/ | 34.8% |
| Other newspapers and magazines | 17.8% |
| http://www.raya.com/ | 15.5% |
| http://ar.wikipedia.org/ | 6.6% |
| http://www.alalam.ma/ | 5.4% |
| http://www.ahram.eg.org/ | 5.4% |
| http://www.alittihad.ae/ | 3.5% |
| http://www.bbc.co.uk/arabic/ | 3.5% |
| http://arabic.cnn.com/ | 2.8% |
| http://www.addustour.com/ | 2.8% |
| http://kassioun.org/ | 1.9% |
The contents of the labeled training file looks like:
فرانكفورت B-LOC (د O ب O أ) O أعلن O اتحاد B-ORG صناعة I-ORG السيارات I-ORG في O ألمانيا B-LOC ...
Arabic words are typically printed right-to-left, and
HTML lets you specifiy direction with the cascading style
sheets (CSS) style attribute style="direction:rtl"
on the pre specification. The result looks better if you
read Arabic:
فرانكفورت B-LOC (د O ب O أ) O أعلن O اتحاد B-ORG صناعة I-ORG السيارات I-ORG في O ألمانيا B-LOC ...
The dictionaries (also known as gazetteers when involving locations) are just lists. For instance, the first few lines of the location dictionary are:
==== اسيا ==== اردن الاردن **** عجلون عمان اربد الرمثا الشوبك جرش زرقاء ...
We just trained each line as an entry, including all the punctuation-only lines, which we probably should've stripped out, because the dictionary didn't help accuracy.
Data Munging
Automatic Sentence Chunking
We chunked the corpus into sentences, using periods, exclamation points and question marks as sentence breaks. We made sure not to munge out entities this way.
Tokenization
Because the CoNLL data format removes whitespace information, we reinsered single whitespaces between all tokens, which let us use a simple whitespace-based tokenizer.
Clearly for a real application, a realistic tokenizer and sentence detector would be required. Breaking on spaces and punctuation might work OK.
Cross-Validating Arabic
To run cross-validation on the corpus, set the system
property aner.dir to the top level directory
$ANER:
> ant -Daner.dir=e:\data\ANERCorp\unpacked -Daner.dict=false -Daner.misc=true -Daner.ngram=8 -Daner.rescored=512 aner
the first output of which is a dump of the parameters:
Input Files
NE Corpus: E:\data\ANERCorp\unpacked\ANERCorp
Location Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedLocGazetteer
Organization Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedOrgGazetteer
Person Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedPersGazetteer
Parameters
N-Gram=8
Num chars=1024
Interpolation Ratio=8.0
Including MISC entity type=true
Use dictionary=false
Corpus Statistics
Location Dict Entries=2183
Organization Dict Entries=403
Person Dict Entries=2309
# sentences=4900
# tokens=150286
...
The other system properties determine whether to use a dictionary,
whether to include the miscellaneous category in training and
evaluation, and the n-gram length. There are a few more parameters to
tweak finer-grained parameters; see the aner target in
the build.xml and the source file
for details.
LingPipe's results are very similar to what Yassine achieved with a logistic-regression classifier on top of part-of-speech tagged and morphologically analyzed input.
Finally, note that these results are without a dictionary. Adding a dictionary improved recall, but hurt precision, with a net result of a lower F measure. Many of the dictionary entries look similar, so I'm thinking this is an issue with not understanding the dictionary's structured format (there are no hints in the paper).
The Code
The source code for parsing the corpus, training, and cross-validating
is in src/ANERXVal.java. The only point at
which this code departs from previous examples is in allowing dictionary-based
training. With either an HMM chunker or HMM rescoring chunnker, dictionaries
may be used to train the categories. Here's the relevant code snippet:
static final String LOC_TAG = "LOC";
...
String[] locDict
= FileLineReader.readLineArray(locGazFile,Strings.UTF8);
...
CharLmRescoringChunker chunker = ...;
...
for (String locName : locDict)
if (locName.length() > 0)
chunker.trainDictionary(locName,LOC_TAG);
All that is required is that the locations be fed into the
trainDictionary() method. This method is not abstracted
into an interface, but is available for both the plain HMM chunker
chunk.CharLmHmmChunker and the longer-distance rescoring
version chunk.CharLmRescoringChunker.
The Results
Here's the rest of the output:
Input Files
NE Corpus: C:\carp\data\aner\ANERCorp
Location Gazetteer: C:\carp\data\aner\ANERGazet\AlignedLocGazetteer
Organization Gazetteer: C:\carp\data\aner\ANERGazet\AlignedOrgGazetteer
Person Gazetteer: C:\carp\data\aner\ANERGazet\AlignedPersGazetteer
Parameters
N-Gram=8
Num chars=1024
Interpolation Ratio=8.0
Number of Analyses Rescored=512
Including MISC entity type=true
Use dictionary=false
Corpus Statistics
Location Dict Entries=2183
Organization Dict Entries=403
Person Dict Entries=2309
# sentences=4888
# tokens=150285
NE Types Found=[MISC, PERS, ORG, LOC]
-----------------------------------------------
FOLD= 5 start sent= 4073 end sent= 4888
training labeled data
compiling
evaluating
FOLD=5 LOC P=0.754 R=0.877 F=0.811
FOLD=5 PERS P=0.670 R=0.624 F=0.646
FOLD=5 ORG P=0.625 R=0.504 F=0.558
FOLD=5 MISC P=0.656 R=0.417 F=0.510
FOLD=5 COMBINED P=0.690 R=0.648 F=0.668
-----------------------------------------------
FOLD= 4 start sent= 3258 end sent= 4073
training labeled data
compiling
evaluating
FOLD=4 LOC P=0.769 R=0.835 F=0.801
FOLD=4 PERS P=0.644 R=0.711 F=0.676
FOLD=4 ORG P=0.717 R=0.507 F=0.594
FOLD=4 MISC P=0.610 R=0.413 F=0.493
FOLD=4 COMBINED P=0.705 R=0.666 F=0.685
-----------------------------------------------
FOLD= 3 start sent= 2444 end sent= 3258
training labeled data
compiling
evaluating
FOLD=3 LOC P=0.853 R=0.834 F=0.843
FOLD=3 PERS P=0.714 R=0.690 F=0.702
FOLD=3 ORG P=0.694 R=0.611 F=0.650
FOLD=3 MISC P=0.686 R=0.484 F=0.568
FOLD=3 COMBINED P=0.765 R=0.710 F=0.736
-----------------------------------------------
FOLD= 2 start sent= 1629 end sent= 2444
training labeled data
compiling
evaluating
FOLD=2 LOC P=0.724 R=0.674 F=0.698
FOLD=2 PERS P=0.478 R=0.594 F=0.530
FOLD=2 ORG P=0.507 R=0.565 F=0.535
FOLD=2 MISC P=0.398 R=0.385 F=0.391
FOLD=2 COMBINED P=0.585 R=0.608 F=0.596
-----------------------------------------------
FOLD= 1 start sent= 814 end sent= 1629
training labeled data
compiling
evaluating
FOLD=1 LOC P=0.818 R=0.829 F=0.823
FOLD=1 PERS P=0.657 R=0.660 F=0.658
FOLD=1 ORG P=0.512 R=0.502 F=0.507
FOLD=1 MISC P=0.481 R=0.388 F=0.430
FOLD=1 COMBINED P=0.690 R=0.680 F=0.685
-----------------------------------------------
FOLD= 0 start sent= 0 end sent= 814
training labeled data
compiling
evaluating
FOLD=0 LOC P=0.811 R=0.750 F=0.779
FOLD=0 PERS P=0.638 R=0.663 F=0.650
FOLD=0 ORG P=0.534 R=0.482 F=0.507
FOLD=0 MISC P=0.529 R=0.451 F=0.487
FOLD=0 COMBINED P=0.698 R=0.662 F=0.679
===============================================
COMBINED CROSS-VALIDATION RESULTS
X-Val LOC P=0.788 R=0.789 F=0.789
X-Val PERS P=0.638 R=0.658 F=0.648
X-Val ORG P=0.612 R=0.527 F=0.566
X-Val MISC P=0.557 R=0.426 F=0.483
X-Val COMBINED P=0.689 R=0.663 F=0.676
The results reported by Benajiba for his ANER 2.0 system in the paper above (p. 1821) are:
LOC P=0.917 R=0.822 F=0.867 MISC P=0.723 R=0.557 F=0.630 ORG P=0.480 R=0.450 F=0.464 PERS P=0.563 R=0.486 F=0.521 TOTAL P=0.702 R=0.621 F=0.659
He only evaluated on one fold, but I'm not sure which one, or even if it corresponded to one of our folds. Fold 5 (printed first), is the last 25K tokens of the file.
As with the other applications of character LMs to chunking, the Arabic named-entity recognizer is not particularly sensitive to the parameters used for training.
Which Named Entity Recognizer?
LingPipe actually provides three generic, trainable chunkers which may
be used for named-entity recognizers, all in the package
com.aliasi.chunk. These are:
CharLmHmmChunker: This chunker is based on an encoding of chunking as a tagging problem (slightly richer and more context sensitive than the standard BIO encoding internally). A character language model HMM then handles the tagging, using character language models for each tag (state) in the HMM, and a maximum likelihood bigram transition model. This chunker runs first-best, n-best and confidence output. This is the simplest, but also the least accurate chunker. It's very fast with caching turned on. It also has good recall for its confidence output and fairly reliable confidence estimates.CharLmRescoringChunker: This is the most accurate, but also the slowest chunker. It uses aCharLmHmmChunkerto generate hypotheses which it then rescores using longer-distance character language models. This chunker is fairly slow, especially when rescoring large n-best lists. It also approximates a confidence-based chunker by estimating confidence through the n-best list, but the confidence estimates tend to be highly attenuated by the model.TokenShapeChunker: This is the LingPipe 1.0 chunker rewrapped in LingPipe 2.0 interfaces. It runs with a generative model jointly predicting the next token and tag based on the previous two tokens and previous tag. Unknown words are replaced with shape features, based on an instance ofTokenCategorizer. This chunker must be trained using the classTrainTokenShapeChunker. It's very fast, but only provides first-best output. It's performance is usually halfway between the other two approaches.
Scoring Files versus Files
In this section, we show how to use LingPipe's chunking evaluation framework and chunking parser framework to score a chunking in one file against another. It's easy to extend to multiple parallel files.
Sample File Scoring Data
We've provided MUC-6 formatted reference and response files in the demo data directory supplied with the distribution:
Running File Scoring
We've also provided an ant task through which file scoring may be run:
% cd $LINGPIPE/demos/tutorial/ne % ant eval-files-muc6
This produces the following output:
EVALUATION RESULT Total=7 True Positive=2 False Negative=2 False Positive=3 True Negative=0 Positive Reference=4 Positive Response=5 Negative Reference=3 Negative Response=2 Accuracy=0.2857142857142857 Recall=0.5 Precision=0.4 Rejection Recall=0.0 Rejection Precision=0.0 F(1)=0.4444444444444445 Fowlkes-Mallows=4.47213595499958 Jaccard Coefficient=0.2857142857142857 Yule's Q=-1.0 Yule's Y=-1.0 Reference Likelihood=0.5714285714285714 Response Likelihood=0.7142857142857143 Random Accuracy=0.5306122448979592 Random Accuracy Unbiased=0.5408163265306122 kappa=-0.5217391304347827 kappa Unbiased=-0.5555555555555554 kappa No Prevalence=-0.4285714285714286 chi Squared=2.1 phi Squared=0.3 Accuracy Deviation=0.17074694419062766
The File Scoring Code
The file scorer code is found in src/FileScorer.java.
Tracing from the top-level main method, which is called with
the reference and response file paths as arguments:
public static void main(String[] args) throws IOException {
File refFile = new File(args[0]);
File responseFile = new File(args[1]);
Parser parser = new Muc6ChunkParser();
FileScorer scorer = new FileScorer(parser);
scorer.score(refFile,responseFile);
System.out.println(scorer.evaluation().toString());
}
First, the files are created from the command-line arguments.
Then, we create a parser for the data and use it to construct
the file scorer. The scorer is used to score the reference
file against the response file and the result is printed out.
To generalize this code to handle multiple files, the arguments
would be directories which would be walked in parallel, with
each pair of reference and response files being given to the
scorer via its score method.
The file scorer's constructor simply sets the parser and a chunking evaluation is created and assigned to a member variable.
private final Parser mParser;
private final ChunkingEvaluation mEvaluation
= new ChunkingEvaluation();
public FileScorer(Parser parser) {
mParser = parser;
}
The remaining method does all the work of parsing and adding the cases to the evaluation.
public void score(File refFile, File responseFile) throws IOException {
ChunkingCollector refCollector = new ChunkingCollector();
mParser.setHandler(refCollector);
mParser.parse(refFile);
List<Chunking> refChunkings = refCollector.mChunkingList;
ChunkingCollector responseCollector = new ChunkingCollector();
mParser.setHandler(responseCollector);
mParser.parse(responseFile);
List<Chunking> responseChunkings = responseCollector.mChunkingList;
if (refChunkings.size() != responseChunkings.size())
throw new IllegalArgumentException("chunkings not same size");
for (int i = 0; i < refChunkings.size(); ++i)
mEvaluation.addCase((Chunking) refChunkings.get(i),
(Chunking) responseChunkings.get(i));
}
It relies on the following static class to collect the chunks and store them in order in a list.
private static class ChunkingCollector
implements ObjectHandler<Chunking> {
private final List<Chunking> mChunkingList
= new ArrayList<Chunking>();
public void handle(Chunking chunking) {
mChunkingList.add(chunking);
}
}
The appropriate collector is set as the handler for the parser before
the file is parsed. The collector receives all of the chunkings found
by the parser through its handle(Chunking) method.
Next, the chunkings are retrieved as lists; if they are not
the same length, an exception is thrown. To avoid this exception,
the files must be formatted to match in terms of number of chunkings.
For instance, the MUC-6 parser used in the demo breaks on sentence
elements, with each sentence corresponding to a chunk.
The addCase method further checks that the chunkings
are over the same strings. Try adding space or changing the spelling
in just the response file and check out the exception that's thrown.
The final step is to simply loop over the chunkings and add each as a case to the evaluation.
Chinese and Beyond
New Data Formats
All that's needed to handle a new data format is a new parser for it.
Alternatively, the data can be transcoded into a format that LingPipe
already recognizes. An example of a parser can be found in src/Conll2002ChunkTagParser.java.
Everything else stays the same, and slight variants of the programs in this demo
may be used to build models and evaluate them.
Chinese Named Entities in the Sandbox
We entered LingPipe into the 2006 SIGHAN Chinese named-entity (and word segmentation) bakeoff. Our entry had an F-measure of around 82%, leaving it 3.5% behind the best named-entity recognition entry in F-measure. Full details, including all of the code used to bake our entry, can be found in:
In addition to a parser for the data format (which is basically just like CoNLL's), we also used a different tokenizer, in this case, one that treats every character as a token.