What is Named Entity Recognition?

Named entity recognition (NER) is the process of finding mentions of specified things in running text.

News Entities: People, Locations and Organizations

For instance, a simple news named-entity recognizer for English might find the person mention John J. Smith and the location mention Seattle in the text John J. Smith lives in Seattle.

Biomedical Entities: Genes, Organisms, Malignancies, Chemicals, ...

There are a range of biomedical corpora available to find mentions of various entities. For instance, human fibrinogen and thrombin are mentions of genes in the sentence Native human fibrinogen was brought to coagulation by adding thrombin.

How does LingPipe Recognize Entities?

Like many of the other modules in LingPipe, named entity recognition involves the supervised training of a statistical model or more direct methods like dictionary matching or regular expression matching. All these methods are designed to work together smoothly by using the same class for annotating the text com.aliasi.chunk.

In the statistical models, the training data must be labeled with all of the entities of interest and their types. Furthermore, the training data should match the data on which the system will be run. If a system is trained on well-edited newswire data and is then run on blogs, performance will degrade, as it will be tuned to the clues provided by newswire text (e.g. the use of honorifics such as Mr.) and will miss the clues provided in blogs (e.g. e-mail links).

Rule-Based Named Entity Detection

Named entity recognition is particularly easy if it's possible to write a regular expression that captures the intended pattern of entities.

We'll write an entity detector that finds well-formed email addresses. The format of emails is explained, with references to the standards, in:

Briefly, an e-mail address of the form lingpipe@alias-i.com consists of two parts, the local part, lingpipe and the domain name, alias-i.com. These are separated by the at-sign (@).

The first step is writing a regular expression that matches all and only e-mail addresses. Rather than writing a fully specification-compliant email recognizer, we'll borrown one from:

The following example, contributed by Bilou McGyver, will work for example purposes:

[A-Za-z0-9](([_\.\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\.\-]?[a-zA-Z0-9]+)*)\.([A-Za-z]{2,})

Basically, the regular expression is the only substantial part of the implementation. The complete chunker is in the file src/EmailRegExChunker.java, which we repeat here:

public class EmailRegexChunker extends RegExChunker {

    public EmailRegexChunker() {
        super(EMAIL_REGEX,CHUNK_TYPE,CHUNK_SCORE);
    }

    private final static String EMAIL_REGEX
        = "[A-Za-z0-9](([_\\.\\-]?[a-zA-Z0-9]+)*)@([A-Za-z0-9]+)(([\\.\\-]?[a-zA-Z0-9]+)*)\\.([A-Za-z]{2,})";

    private final static String CHUNK_TYPE = "email";

    private final static double CHUNK_SCORE = 0.0;

}

There is also a main method in that class for testing, which simply applies the chunker to the command-line arguments.

public static void main(String[] args) {
    Chunker chunker = new EmailRegExChunker();
    for (int i = 0; i < args.length; ++i) {
        Chunking chunking = chunker.chunk(args[i]);
        System.out.println("input=" + args[0]);
        System.out.println("chunking=" + chunking);
        Set<Chunk> chunkSet = chunking.chunkSet();
        Iterator<Chunk> it = chunkSet.iterator();
        while (it.hasNext()) {
            Chunk chunk = it.next();
            int start = chunk.start();
            int end = chunk.end();
            String text = args[0].substring(start,end);
            System.out.println("     chunk=" + chunk + "  text=" + text);
        }
    }
}

This code first creates the chunker, then analyzes each of the arguments using the chunker. It does this by pulling out the chunking, then pulling the set of chunks out of the chunking, then printing each of these chunks.

The sample we'll use is the following (with positional information displayed below the actual input text):

John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca.
0123456789012345678901234567890123456789012345678901234567890123456789012345
0         1         2         3         4         5         6         7

There is an ant task for running it, the arguments to which are the text that will be analyzed.

% ant email-chunker
...
input=John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca.
chunking=John's email is john@his.company.com and his friend's is foo.bar@123.foo.ca. : [16-36:email@0.0, 57-75:email@0.0]
    chunk=16-36:email@0.0  text=john@his.company.com
    chunk=57-75:email@0.0  text=foo.bar@123.foo.ca

The output reproduces the input string, then lists the chunks that were extracted. The first chunk is 16-36:email@0.0, which indicates there is an email starting at position 16 and ending at position 35 (we use the standard half-open notation of first character to one plus the last character), yielding the entity john@his.company.com. The second entity is foo.bar@123.foo.ca (note the final period is not part of the email).

Exact Dictionary-Based Chunking

In many applications, it is relatively straightforward to compile a list of names (and aliases) for entities and their types. For instance, westlaw.com has lists of all the registered lawyers in the United States. Such lists are extremely helpful for finding mentions of lawyers in legal documents such as case law or court transcripts. A baseball site such as mlb.com has a list of all current and retired baseball players to work from.

Finding Names in a Dictionary

Sometimes, it is enough to simply find all the mentions of phrases in a dictionary in a text. (See the sidebar for an example.)

A name like "50 Cent" or product name like "xyx120 dvd player" can prove to be very difficult to find in text with statistical recognizers or generic rule-based ones. Sometimes a dictionary may be used in conjunction with a statistical recognizer in order to improve recall for "difficult" names. Names tend to be difficult if they're easily confusible as to type, or if they are confusible with common words.

Another strategy that may be used is to run statistical named entity recognition on a corpus and then use the output to build a dictionary that may be used to reannotate the corpus for chunks that may have been missed in the first pass.

The naive approach of doing a disjunctive regular expression or a substring search quickly falls over in the face of even modestly sized dictionaries. LingPipe provides an implementation of the Aho-Corasick algorithm. The beauty of the Aho-Corasick algorithm is that it finds all matches against a dictionary in linear time independent of the number of matches or size of the dictionary.

Demo Code

The code for the demo is at src/DictionaryChunker.java. The first step of the code simply creates a dictionary from some hard-coded entries:

static final double CHUNK_SCORE = 1.0;

public static void main(String[] args) {

    MapDictionary<String> dictionary = new MapDictionary<String>();
    dictionary.addEntry(new DictionaryEntry<String>("50 Cent","PERSON",CHUNK_SCORE));
    dictionary.addEntry(new DictionaryEntry<String>("XYZ120 DVD Player","DB_ID_1232",CHUNK_SCORE));
    dictionary.addEntry(new DictionaryEntry<String>("cent","MONETARY_UNIT",CHUNK_SCORE));
    dictionary.addEntry(new DictionaryEntry<String>("dvd player","PRODUCT",CHUNK_SCORE));

Note that in the DictionaryEntry constructor, the first argument is the phrase, the second string argument is the type, and the final double-precision floating point argument is the score for the chunk. Dictionary entries are themselves always case sensitive.

We set "DB_ID_1232" as the type for the phrase "XYZ120 DVD Player". This is a simple way to link database IDs to text mentions, but beware of ambiguity. Not all "John Smith"s are the same (see the sidebar on Times Topics). There is no limit to the number of different entity types in a dictionary. The scores will simply be passed along as chunk scores in the dictionary-based chunker.

The next step in the code builds the dictionary-based chunker:

    ExactDictionaryChunker dictionaryChunkerTT
        = new ExactDictionaryChunker(dictionary,
                                     IndoEuropeanTokenizerFactory.INSTANCE,
                                     true,true);

Note that the dictionary chunker is constructed from a dictionary, a tokenizer factory, and two flags. The tokenizer factory we pass in is an instance of the IndoEuropean tokenizer factory. Because our matching treats tokens as symbols, the chunker needs a tokenizer for matching. Whitespace will be ignored in the matching process. The first flag indicates whether or not all matches are found, even in cases where they overlap. The second flag indicates whether or not the chunker is case sensitive. Thus the dictionary chunker constructed above will find all chunks and will be case sensitive. Three other chunkers covering all the flag settings are constructed next:

    ExactDictionaryChunker dictionaryChunkerTF
        = new ExactDictionaryChunker(dictionary,
                                     IndoEuropeanTokenizerFactory.INSTANCE,
                                     true,false);

     ExactDictionaryChunker dictionaryChunkerFT
        = new ExactDictionaryChunker(dictionary,
                                     IndoEuropeanTokenizerFactory.INSTANCE,
                                     false,true);

    ExactDictionaryChunker dictionaryChunkerFF
        = new ExactDictionaryChunker(dictionary,
                                     IndoEuropeanTokenizerFactory.INSTANCE,
                                     false,false);

After the chunkers are created, the rest of the code simply runs through the command-line arguments and calls each chunker on the resulting text in turn.

    for (int i = 0; i < args.length; ++i) {
        String text = args[i];
        chunk(dictionaryChunkerTT,text);
        chunk(dictionaryChunkerTF,text);
        chunk(dictionaryChunkerFT,text);
        chunk(dictionaryChunkerFF,text);
    }

The chunk() method simply applies the chunker to the text and prints out the result.

static void chunk(ExactDictionaryChunker chunker, String text) {
    Chunking chunking = chunker.chunk(text);
    for (Chunk chunk : chunking.chunkSet()) {
        int start = chunk.start();
        int end = chunk.end();
        String type = chunk.type();
        double score = chunk.score();
        String phrase = text.substring(start,end);
        System.out.println("     phrase=|" + phrase + "|"
                           + " start=" + start
                           + " end=" + end
                           + " type=" + type
                           + " score=" + score);
    }
}

The example may be run using a pre-package Ant target, which we indicate below along with the results:

% cd $LINGPIPE/demos/tutorial/ne
% ant dictionary-chunker
DICTIONARY
[cent:MONETARY_UNIT 1.0, 50 Cent:PERSON 1.0, XYZ120 DVD Player:DB_ID_1232 1.0, dvd player:PRODUCT 1


TEXT=50 Cent is hard to distinguish from 50 cent and just plain cent without case

Chunker. All matches=true Case sensitive=true
     phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
     phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
     phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0

Chunker. All matches=true Case sensitive=false
     phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
     phrase=|Cent| start=3 end=7 type=MONETARY_UNIT score=1.0
     phrase=|50 cent| start=36 end=43 type=PERSON score=1.0
     phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
     phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0

Chunker. All matches=false Case sensitive=true
     phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
     phrase=|cent| start=39 end=43 type=MONETARY_UNIT score=1.0
     phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0

Chunker. All matches=false Case sensitive=false
     phrase=|50 Cent| start=0 end=7 type=PERSON score=1.0
     phrase=|50 cent| start=36 end=43 type=PERSON score=1.0
     phrase=|cent| start=59 end=63 type=MONETARY_UNIT score=1.0


TEXT=The product xyz120 DVD player won't match unless it's exact like XYZ120 DVD Player.

Chunker. All matches=true Case sensitive=true
     phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0

Chunker. All matches=true Case sensitive=false
     phrase=|xyz120 DVD player| start=12 end=29 type=DB_ID_1232 score=1.0
     phrase=|DVD player| start=19 end=29 type=PRODUCT score=1.0
     phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0
     phrase=|DVD Player| start=72 end=82 type=PRODUCT score=1.0

Chunker. All matches=false Case sensitive=true
     phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0

Chunker. All matches=false Case sensitive=false
     phrase=|xyz120 DVD player| start=12 end=29 type=DB_ID_1232 score=1.0
     phrase=|XYZ120 DVD Player| start=65 end=82 type=DB_ID_1232 score=1.0

Take a moment to compare the different outputs based on the constructor flags. When case sensitivity is enabled, the dictionary entry "50 Cent" does not match the text "50 cent", When all matches are enabled without case sensitivity (the second example), the dictionary entry "cent" matches the second token in "50 Cent". Note that in this case, "50 Cent" and "Cent" are pulled out of the same text, with character spans (0,7) and (3,7) respectively.

Approximate Dictionary-Based Chunking

The class dict.ApproxDictionaryChunker implements a dictionary-based chunker that performs approximate matching. An approximate dictionary chunker is populated with a dictionary, the same way as an exact dictionary chunker. But when it is run, it doesn't just look for exact matches, but for all matches within a fixed weighted edit distance threshold.

Running the Demo

The approximate dictionary matching demo may be invoked from ant using the target approx-chunker:

% ant approx-chunker

Dictionary=[Mdm:Mdm 1.0, P53:P53 1.0, protein 53:P53 1.0]


 A protein called Mdm2 binds to p53 and transports it from the nucleus to the cytosol.

 Matched Phrase       Dict Entry   Distance
            p53              P53        1.0
           Mdm2              Mdm        1.0


 p53, also known as protein 53 (TP53), functions as a tumor supressor.

 Matched Phrase       Dict Entry   Distance
             53              P53        1.0
           p53,              P53        2.0
            p53              P53        1.0
          (TP53              P53        2.0
     protein 53              P53        0.0
           TP53              P53        1.0
          TP53)              P53        2.0
   protein 53 (              P53        2.0

The output first prints out the dictionary, in this case consisting of three phrasal entries, "Mdm", "P53", and "protein 53", with lexical entries consisting of the strings Mdm, P53 and P53 respecitvely. The numerical values are defaults, as dictionary entries are weighted in the interface.

Next comes the analysis of two sample sentences. Each is first printed, then all of the approximately matching substrings, along with their distances. Closer (lower distance) values are better matches. For instance, the dictionary phrase "p53" matched the substrings "53", "p53,", and "p53" in the first four characters of the input, with distances 1.0, 2.0, and 1.0 respectively.

Code Walk Through

The code for the demo is in in a single main() method defined in src/ApproximateChunkerDemo.java. It starts by setting up a trie-based dictionary:

public static void main(String[] args) {
    DictionaryEntry entry1
        = new DictionaryEntry<String>("P53", "P53");
    DictionaryEntry entry2
        = new DictionaryEntry<String>("protein 53", "P53");
    DictionaryEntry entry3
        = new DictionaryEntry<String>("Mdm","Mdm");
    TrieDictionary<String> dict = new TrieDictionary<String>();
    dict.addEntry(entry1);
    dict.addEntry(entry2);
    dict.addEntry(entry3);
    ...

Next, the chunker is created from a tokenizer factory and a weighted edit distance:

    TokenizerFactory tokenizerFactory
        = IndoEuropeanTokenizerFactory.INSTANCE;

    WeightedEditDistance editDistance
        = new FixedWeightEditDistance(0,-1,-1,-1,Double.NaN);

    double maxDistance = 2.0;

    ApproxDictionaryChunker chunker
        = new ApproxDictionaryChunker(dict,tokenizerFactory,
                                      editDistance,maxDistance);

The edit distance assigns a cost of 0 to matches, a cost of -1 to insertions, deletions and substitutions, and does not specify a transposition cost. Transpositions are not available as part of the approximate dictionary chunker. These values may be changed in the fixed weight edit distance implementation, or a custom edit distance may be implemented.

The chunker itself is constructed with the dictionary, which must be a trie-backed dictionary, a tokenizer factory, a weighted edit distance, and critically, a maximum distance at which results are returned.

The tokenizer factory only controls possible chunk boundaries. All chunks must start at the first character of a token and must end at the last character of a token (offset by one, of course). By using an instance of tokenizer.CharacterTokenizerFactory, the token-sensitivity will be turned off. Note that even with the character tokenizer factory, chunks will still not start and end on whitespace.

Finallly, the maximum distance parameter controls the maximum edit distance a returned chunk can be from a dictionary entry. In the case here, we set this at 2, meaning we get all matches within edit distance 2 of entries in the dictionary.

The rest of the code simply applies chunking and prints out the results:

    for (String text : args) {
        Chunking chunking = chunker.chunk(text);
        CharSequence cs = chunking.charSequence();
        Set<Chunk> chunkSet = chunking.chunkSet();
        for (Chunk chunk : chunkSet) {
            int start = chunk.start();
            int end = chunk.end();
            CharSequence str = cs.subSequence(start,end);
            double distance = chunk.score();
            String match = chunk.type();
            System.out.printf("%15s  %15s   %8.1f\n",
                              str, match, distance);
        }
    }

As with the other chunkers, we simply chunk the text, extract the underlying character sequence and set of chunks, then iterate over them extracting their start and end points, distances (which are scores in the chunk interface), and the matched dictionary entry, which is encoded on the chunk type.

In some applications, a non-overlapping set of results is required. This can be done by eliminating every chunk which overlaps a chunk of higher score, working left to right.

Running a Statistical Named Entity Recognizer

Named entity recognition is particularly simple if a trained model already exists. Like our idol Julia Child, we have an already-baked cake waiting to show you. And like Julia, after we let you see the finished product, we'll show you how to bake it step by step.

First-Best Named Entity Chunking

The following program, src/RunChunker.java, provides a command that loads a named-entity recognizer as an instance of the Chunker interface and then applies it to the remaining command-line arguments, printing out the results.

public static void main(String[] args) throws Exception {
    File modelFile = new File(args[0]);

    System.out.println("Reading chunker from file=" + modelFile);
    Chunker chunker
        = (Chunker) AbstractExternalizable.readObject(modelFile);

    for (int i = 1; i < args.length; ++i) {
        Chunking chunking = chunker.chunk(args[i]);
        System.out.println("Chunking=" + chunking);
    }
}

The code may be run from an Ant task, using:

% ant run-genetag

or it may be run directly from Java, using:

% java -cp neDemo.java:../../../lingpipe-3.9.3.jar RunChunker \
../../models/ne-en-bio-genetag.HmmChunker \
"p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells."

Windows note: Replace the colon (:) in the classpath with a semicolon (;) and put everything on one line without any backslashes.

The output is the following:

run-genetag:
     [java] Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker
     [java] Chunking=p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rha
bdomyosarcoma cells. : [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-92:GENE@-Infinity]

This output repeats the input sentence and then, following the colon, the list of chunks found. Here we found a chunk running from character position 0 (inclusive) to character position 3 (exclusive), with type GENE; this is right, because p53 is a gene. It also find a gene from psoition 20 to 54, for the expression insulin-like growth factor II gene, and a final one from 81 to 92 for P4 promoter.

The model used by this chunker is ../../models/ne-en-bio-genetag.HmmChunker. Like our other models, this one is labeled by task (ne for named-entity recognition), language (en for English), genre (bio for biology) and corpus (genetag for the GENETAG corpus), and suffixed with the name of the class of the serialized object (HmmChunker for com.aliasi.chunk.HmmChunker).

N-Best Named Entity Chunking

Our next example runs n-best chunking. This means it returns not only its first answer, but also alternative hypotheses, in order of their estimated likelihoods, which are also returned.

NBestChunker chunker
    = (NBestChunker) AbstractExternalizable.readObject(modelFile);
for (int i = 1; i < args.length; ++i) {
    char[] cs = args[i].toCharArray();
    Iterator<ScoredObject<Chunking>> it = chunker.nBest(cs,0,cs.length,MAX_N_BEST);
    System.out.println(args[i]);
    for (int n = 0; it.hasNext(); ++n) {
        ScoredObject so = it.next();
        double jointProb = so.score();
        Chunking chunking = so.getObject();
        System.out.println(n + " " + jointProb
                           +  " " + chunking.chunkSet());
    }

The n-best chunker is reconstituted in the same manner as the regular chunker. We then iterate over arguments, and for each argument, apply the method nBest(), which takes as arguments the command-line argument converted to a character slice and a maximum for the number of results to return. Then there's an iterator over the results. The results returned by an n-best chunker are instances of ScoredObject, as shown by the cast. The score is the log (base 2) joint probability of the input character sequence and the chunking. The object is the chunking itself.

N-best chunking can be run with the following Ant task:

% ant run-genetag-nbest

The results are provided below.

Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker
p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells.

0 -182.7325747972808 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-92:GENE@-Infinity]
1 -183.39842051103167 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 81-92:GENE@-Infinity]
2 -185.1133085819252 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 74-92:GENE@-Infinity]
3 -185.73246504074297 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity, 81-83:GENE@-Infinity]
4 -185.77915429567608 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 74-92:GENE@-Infinity]
5 -186.39831075449385 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity, 81-83:GENE@-Infinity]
6 -187.47968583511886 [0-3:GENE@-Infinity, 20-54:GENE@-Infinity]
7 -188.14553154886974 [0-3:GENE@-Infinity, 14-54:GENE@-Infinity]

The chunkings are ordered in descending (log base 2) joint probability. For instance, the first result has a joint estimate of -182.7, whereas the second has one of -183.4. By some simple math, we can see how much more likely the first analysis is than the second:

p1/p2 = 2 ** (log (p1/p2))
      = 2 ** (log p1 - log p2)
      = 2 ** -182.7-183.4
      = 2 ** .7
      ~ 1.62

That is, the first analysis is only 1.62 times as likely as the second, analysis. The first analysis is more than five times as likely as the third analysis, with a log probability of -185.1.

N-best analyses are particularly useful for long-distance rescoring systems. The class com.aliasi.chunk.RescoringChunker provides an abstract implementation of rescoring and com.aliasi.chunk.CharLmRescoringChunker provides a concrete implementation.

Confidence Named Entity Chunking

The final example runs chunking in such a way as to return chunks in order of confidence, which we take to be the probability of the chunk given the input text, P(chunk|text). The program is straightforward:

ConfidenceChunker chunker
    = (ConfidenceChunker) AbstractExternalizable.readObject(modelFile);
for (int i = 1; i < args.length; ++i) {'
    char[] cs = args[i].toCharArray();
    Iterator<Chunk> it
      = chunker.nBestChunks(cs,0,cs.length,MAX_N_BEST_CHUNKS);
    System.out.println(args[i]);
    System.out.println("Rank          Conf      Span    Type     Phrase");
    for (int n = 0; it.hasNext(); ++n) {
        Chunk chunk = it.next();
        double conf = Math.pow(2.0,chunk.score());
        int start = chunk.start();
        int end = chunk.end();
        String phrase = args[i].substring(start,end);
        System.out.println(n + " "
                           + Strings.decimalFormat(conf,"0.0000",12)
                           + "       (" + start
                           + ", " + end
                           + ")       " + chunk.type()
                           + "         " + phrase);
     }
}

The program is mostly like the previous one, and begins by reconstituting a chunker, this time a ConfidenceChunker. Then we walk over the arguments, extracting a character array as before, and then providing it to the chunker, this time calling the nBestChunks method. With a confidence chunker, the result iterator is over chunks, so the cast is to (Chunk) in the body of the iteration. We then compute the confidence by exponentiating the log (base 2) value so that it is easier to read. We then print the confidence out formatted, and then the spans, the type and phrase, which we extract as a substring of the arguments

Confidence chunking can be run with the following Ant task:

% ant run-genetag-conf

Running this program on the same input provides the following results

Reading chunker from file=..\..\models\ne-en-bio-genetag.HmmChunker
p53 regulates human insulin-like growth factor II gene expression through active P4 promoter in rhabdomyosarcoma cells.

Rank      Conf      Span    Type    Phrase
0       0.9999   (0, 3)   GENE     p53
1       0.7328   (81, 92)   GENE     P4 promoter
2       0.6055   (20, 54)   GENE     insulin-like growth factor II gene
3       0.3817   (14, 54)   GENE     human insulin-like growth factor II gene
4       0.1395   (74, 92)   GENE     active P4 promoter
5       0.0916   (81, 83)   GENE     P4
6       0.0088   (74, 83)   GENE     active P4
7       0.0070   (20, 49)   GENE     insulin-like growth factor II
8       0.0044   (14, 49)   GENE     human insulin-like growth factor II

The entity recognizer is 99.99% confident that p53 is a mention of a gene. It's only 73.28% confident about P4 promoter being a gene, and even less confident about insulin-line growth factor II gene. The list gets interesting when we see names that overlap, such as the fourth (index 3) result, which prefixes human to the third result, or the fifth result, which prefixes active to the second result. These confidences reflect the uncertainty of the recognizers.

We believe confidence-ranked results will be very useful for information extraction applications where recall and/or confidence in results is important. Rather than just counting instances of first-best results, a system may push the uncertainty of the gene annotator through later processes in a Bayesian fashion. In practice, the results are available for iteration.

Training a Named Entity Recognizer

Download Training Data

The first requirement for training a named entity recognizer is gathering the data. For this demo, we use the GeneTag named entity data, provided by the Unites States National Center for Biotechnology Information (NCBI), a part of the U.S. Library of Medicine, and described in this paper:

The data itself may be downloaded from the following anonymous FTP server:

The data will then need to be untarred (and gunzipped). Assuming it has been downloaded to directory dataDir:

% cd dataDir
% tar -xzf medtag.tar.gz

The file necessary for training will then be at the path dataDir/medtag/genetag/genetag.tag; this is what we'll call GeneTagFilePath in the next section.

Running the Training Code

The following code, drawn from src/TrainGeneTag.java provides all you need to train a named entity recognizer based on the GeneTag corpus:

static final int MAX_N_GRAM = 8;
static final int NUM_CHARS = 256;
static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

public static void main(String[] args) throws IOException {
    File corpusFile = new File(args[0]);
    File modelFile = new File(args[1]);

    System.out.println("Setting up Chunker Estimator");
    TokenizerFactory factory
      = IndoEuropeanTokenizerFactory.INSTANCE;
    HmmCharLmEstimator hmmEstimator
      = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
    CharLmHmmChunker chunkerEstimator
      = new CharLmHmmChunker(factory,hmmEstimator);

    System.out.println("Setting up Data Parser");
    GeneTagParser parser = new GeneTagParser();
    parser.setHandler(chunkerEstimator);

    System.out.println("Training with Data from File=" + corpusFile);
    parser.parse(corpusFile);

    System.out.println("Compiling and Writing Model to File=" + modelFile);
    AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
}

The GeneTag corpus is provided in a single file, given as the first command-line argument. The second argument is just the name of the target file into which to write the model.

The first block of code initializes an estimator for a chunker. In particular, it initializes an instance of CharLmHmmChunker, which provides chunking based on a hidden Markov model (HMM). language models. This particular chunker requires a tokenizer factory for breaking the input into tokens and a trainable HMM on which to base the model. For the first, we use our standard Indo-European tokenizer factory. For the HMM, we use a HmmCharLmEstimator, which estimates state emissions with sequence character language models. In this case, we set the maximum n-gram to a constant (8), set the number of characters (we'll assume 256), and the interpolation parameter for the language models (8.0).

The second block of code just defines the parser, in this case an instance of com.aliasi.corpus.parsers.GeneTagParser, which is the parser for the GeneTag corpus. The second line sets the handler for the parser to the chunker estimator. This means that chunkings extracted by the parser will be handed off to the chunker estimator for training. This pattern is just like that used for XML SAX parsing, with our parser playing the role of an org.xml.sax.XMLReader and our handler playing the role of an org.xml.sax.ContentHandler.

The next block of code just parses the corpus file specified in the command-line argument. The parsed chunkings are handed to the estimator for training. The final block of code just compiles the model to a file, using a utility method in com.aliasi.util.AbstractExternalizable.

This code can be compiled to neDemo.jar from Ant using the following target:

% ant jar

It may then be run with the Ant target:

% ant -Ddir.genetag=GeneTagDir train-genetag

where GeneTagDir is the path to the directory the genetag.tag after unpacking, which will be at the relative path medtag/medtag/genetag from where the medtag data is unpacked.

This command may be run directly in Java from the directory $LINGPIPE/demos/tutorial/ne with:

% java -cp ../../../lingpipe-3.9.3.jar;neDemo.jar TrainGeneTag GeneTagDirPath ../../models/ne-en-bio-genetag.HmmChunker

Note that the resulting model is written into the standard location $LINGPIPE/demos/models.

To evaluate with the model we just built, use:

% ant -Ddir.genetag=GeneTagParentDir eval-genetag

This performs a cross-validation on the data. We discuss cross-validation in the section on Arabic NER below.

Evaluating a Named Entity Recognizer

The standard evaluation setup for a named-entity recognizer (and any other statistical component) involves partitioning the labeled data into training and test segments. If you're playing by old-fashioned hypothesis-testing statistical rules, or if you're competing in a bakeoff that only allows a single submission, you only get one shot at the test data. In the machine-learning community, it's not uncommon to see things like what we're about to do, which is cheat. In slightly more technical terms, we'll be reporting post-hoc results on a so-called development set. Luckily for our systems, the parameters are very stable, meaning that slight perturbations of the parameters only lead to slight changes in performance.

CoNLL Spanish Task

We'll use data for this tutorial that's available from the web. The Conference on Computational Natural Language Learning (ConNLL), is an annual meeting that sponsors a bakeoff. In 2002, the bakeoff involved Spanish and Dutch named-entity recognition. Here's a directory of the relevant CoNLL web pages:

An illustration of the encoding from the first few lines of the Spanish training data is:

El O
Abogado B-PER
General I-PER
del I-PER
Estado I-PER
, O
Daryl B-PER
Williams I-PER
, O

With this encoding scheme, the phrases El Abogado General del Estado and Daryl Williams are coded as persons, with their beginnnig and continuing tokens picked out with tags B-PER and I-PER respectively.

Unfortunately, there are several formatting errors in the data that must be fixed before our parsers can handle them.

After downloading the data, simply unarchive it. Assuming the compressed tarball ner.tgz has been downloaded to a directory conll2002.

% cd conll2002
% tar -xzf ner.tgz

You should then see the following files under conll2002:

You can further gunzip each of these, to produce:

These are the files we use in the evaluation.

Training the Recognizer

First, we need to train a recognizer using the Spanish data. For that, we have another program in src/TrainConll2002.java. The relevant details are shown below.

static final int NUM_CHUNKINGS_RESCORED = 64;
static final int MAX_N_GRAM = 12;
static final int NUM_CHARS = 256;
static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior

public static void main(String[] args) throws IOException {
    File modelFile = new File(args[0]);
    File trainFile = new File(args[1]);
    File devFile = new File(args[2]);

    TokenizerFactory factory
        = IndoEuropeanTokenizerFactory.INSTANCE;
    CharLmRescoringChunker chunkerEstimator
        = new CharLmRescoringChunker(factory,NUM_CHUNKINGS_RESCORED,
                                     MAX_N_GRAM,NUM_CHARS,
                                     LM_INTERPOLATION);

    Conll2002ChunkTagParser parser
        = new Conll2002ChunkTagParser();
    parser.setHandler(chunkerEstimator);

    parser.parse(trainFile);
    parser.parse(devFile);

    AbstractExternalizable.compileTo(chunkerEstimator,modelFile);
}

The most significant difference is that we use a different chunker estimator, this time a com.aliasi.chunk.CharLmRescoringChunker. This chunker provides a higher-level model that rescores the output of an underlying chunker. This is why the first static declares how many such underlying n-best chunkings are rescored. The second notable difference is that it uses a tag parser designed for the CoNLL data format, the implementation of which can be found in src/Conll2002ChunkTagParser.java. The only other difference between this new training scheme is that we train on two files, both the training data and the development data (which were both available to participants in the bakeoff before the final evaluation).

Assuming that the CoNLL data was installed in the directory conll2002 as described above, the recognizer can be trained with the Ant target:

% ant -Ddata.conll2002.spanish=conll2002/ner/data train-conll2002-spanish

Test Program

The testing program is almost as simple as the run program, and can be found in src/TestConll2002Chunker.java, the contents of which minus the imports and print statements are reproduced here.

public static void main(String[] args) throws Exception {
    File chunkerFile = new File(args[0]);
    File testFile = new File(args[1]);


    AbstractCharLmRescoringChunker chunker
        = (AbstractCharLmRescoringChunker)
        AbstractExternalizable.readObject(chunkerFile);

    ChunkerEvaluator evaluator = new ChunkerEvaluator(chunker);
    evaluator.setVerbose(true);

    Conll2002ChunkTagParser parser
        = new Conll2002ChunkTagParser();
    parser.setHandler(evaluator);

    parser.parse(testFile);

    System.out.println(evaluator.toString());
}

We've simply set up an evaluator wrapping our chunker, and set it to provide verbose output (case-by-case results to standard output). We've provided an Ant task to run the evaluation:

% ant -Ddata.conll2002.spanish=conll2002/ner/data test-conll2002-spanish

Incremental Per-Sentence Results

Because we have set the evaluator to be verbose, it prints a report for each sentence it processes. Here's an example:

     CHUNKS=(65,77):MISC   (118,136):ORG
     Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
     012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
               1         2         3         4         5         6         7         8         9         0         1         2         3
                                                                                                         1         1         1         1

 REF                                                                  M...........                                         O.................
RESP                                                                  M...........                                         O.................


             CHUNKS=(65,77):MISC   (118,136):ORG
             Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
             012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
                       1         2         3         4         5         6         7         8         9         0         1         2         3
                                                                                                                 1         1         1         1
 REF                                                                          M...........                                         O.................
    0   -143                                                                  M...........                                         O.................
  -----------
    1   -156                                                                  M...........                                         O.......    L.....
    2   -158 O....                                                            M...........                                         O.................
    3   -159                                                               M..............                                         O.................
    4   -163                                                                  M...........                                         O.......    O.....
    5   -164                                                      M.......................                                         O.................
    6   -165                                                                  M...     M..                                         O.................
    7   -166                                                                  M........... P..                                     O.................
Correct Rank=0


     CHUNKS=(65,77):MISC   (118,136):ORG
     Dicha concentración se hace para preparar el partido final de la Copa del Rey que disputará el próximo sábado ante el Atlético de Madrid .
     012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567
               1         2         3         4         5         6         7         8         9         0         1         2         3
                                                                                                         1         1         1         1
TRUE  (65, 77): MISC  0.9999877514430212
TRUE  (118, 136): ORG  0.9998255148751076
false (118, 126): ORG  1.7448512066560313E-4
false (130, 136): LOC  1.7339828700790064E-4
false (0, 5): ORG  2.679282465485203E-5
false (62, 77): MISC  1.1394528562976028E-5
false (130, 136): ORG  1.062746657431117E-6
false (53, 77): MISC  4.978971914365618E-7

The report begins by printing the sentence and underlying chunk boundaries in the reference (truth). Below the reference, the system result is printed as the response.

Next, the n-best results are printed. These again begin with a printout of the input string and the reference tagging. Then, the n-best list is printed up to the size specified on the evaluator (8, in our case). A line is printed below the correct answer, if any, and the rank of the correct answer is printed at the bottom. In this case, our first-best answer (rank 0) was correct. Next to the rank, the score is provided. This is a joint probability estimate presented in log (base 2) form. Thus we see that our first answer at -143 is 213 times more likely than the second analysis at -156.

Finally, the confidence-based results are printed, again after the input is displayed. The chunks returned by the system are printed with their character offsets (begin inclusive, end exclusive), their type, and then their score. The score is the conditional probability estimate of the chunk given the input. Thus the system is 99.98% sure that the characters from 118 to 136 refer to an organization. This is not a good estimate. The variety of chunker we're using for this demo tends to attenuate the results far too much in the direction of the top scorers. This is because it estimates confidence from the top results on the n-best list, which is itself very attenuated in this example. The simpler chunker, com.aliasi.chunk.CharLmHmmChunker provides more reliable confidence estimates than this chunker, the com.aliasi.chunk.CharLmRescoringChunker.

First-best Results

The run takes a little under a minute on my machine and produces the following first-best results:

FIRST-BEST EVAL
 Total=4307
 True Positive=2674
 False Negative=742
 False Positive=891
 True Negative=0
 Positive Reference=3416
 Positive Response=3565
 Negative Reference=891
 Negative Response=742
 Accuracy=0.6208497794288368
 Recall=0.7827868852459017
 Precision=0.7500701262272089
 Rejection Recall=0.0
 Rejection Precision=0.0
 F(1)=0.7660793582581292
 Fowlkes-Mallows=3489.70485858045
 Jaccard Coefficient=0.6208497794288368
 Yule's Q=-1.0
 Yule's Y=-1.0
 Reference Likelihood=0.7931274669143256
 Response Likelihood=0.8277223125145112
 Random Accuracy=0.6921288226373674
 Random Accuracy Unbiased=0.6927272243084177
 kappa=-0.23152230039570446
 kappa Unbiased=-0.23392064174187105
 kappa No Prevalence=0.24169955885767358
 chi Squared=233.8186156392983
 phi Squared=0.054288046352286574
 Accuracy Deviation=0.0073928430493272

The F(1) score of 76.61% would've tied LingPipe at fourth place overall (out of what would be 14 contestants if you include us). We'd have been in third place for recall and fifth place for precision. You can check out the CoNLL 2002 NER Task Home for the result (they're near the bottom). The best scoring system was by the organizers, Xavier Carreras, Luís Màrquez and Luís Padró, which scored 81.39% f-measure using dictionaries; without dictionaries, their system scored 79.28%, only slightly better than the second place system of Radu Florian at 79.05%, which didn't train on any external data. Links to papers describing all of the systems submitted to the bakeoff are available from CoNLL 2002 NER Task Home.

The organizers computed per-system 95% confidence intervals (using bootstrap resampling) at roughly 1.5%, making systems two and three at 79.1+/-1.5% not significantly different than ours at 76.6+/-1.5%.

Now is the time to recall what we said about cheating. The bakeoff contestants submitted their system runs once and got a score back. We took the default LingPipe settings for Western languages -- 12-grams with a 12.0 interpolation ratio and no pruning. We also took the default of rescoring 128 sentences. We did some post-hoc tests and found if we rescore 64 results instead of 128, F-measure drops a fairly substantial 0.3%, whereas if we rescore 256 results instead of 128, F-measure drops a fairly trivial 0.03%. Rescoring more results doesn't seem to matter much in terms of scores.

N-best Results

Our chunking evaluator also returns a report on evaluating n-best results. It returns these results as a histogram mapping rank to number of test cases where the reference result was that rank, with -1 being assigned to cases whose reference result was not in the n-best list. For the settings above, the n-best results are (reformatted into three columns):

N-BEST EVAL
0=969
-1=173        15=4          27=1
1=140         14=3          29=1
2=53          20=3          30=1
3=25          21=3          31=1
4=20          26=3          34=1
5=15          32=3          38=1
6=14          12=2          39=1
7=11          16=2          41=1
8=10          19=2          44=1
11=8          23=2          48=1
18=7          24=2          49=1
9=6           28=2          50=1
10=5          36=2          57=1
17=5          45=2          67=1
13=4          22=1          76=1
                            83=1


This report indicates there are 969 cases for which the system's first-best answer (rank 0) was correct. There were 140 cases when the second-best answer (rank 1) was correct, and so on. Note that the count for -1, namely 173, is the count for the number of cases for which the reference chunking was beyond the 128th-best response chunking.

Confidence Results

The final piece of the report is the confidence report. This provides a view of the results that would be obtained by returning chunks according to their global confidence, without concern about overlap, and with a given threshold in place.

CONFIDENCE EVALUATION  
  Area Under PR Curve (interpolated)=0.7344955594638413
  Area Under PR Curve (uninterpolated)=0.7316142729153922
  Area Under ROC Curve (interpolated)=0.8111854024938068
  Area Under ROC Curve (uninterpolated)=0.8111854024938073
  Average Precision=0.8628569727835927
  Maximum F(1) Measure=0.756586983074507
  BEP (Precision-Recall break even point)=0.7508207485226527
  Reciprocal Rank=1.0

The average precision and area under the PR/ROC curves are the standard reports from ranked systems, as used in TREC. See the class documentation for ScoredPrecisionRecallEvaluation for more details and examples of how these are computed and what they mean. The most significant one is perhaps the maximum F(1) measure, as that indicates the maximum score possible by setting a threshold on confidence results and returning all chunks above that confidence. Here, as in many other cases we've evaluated, it's very close to the first-best F(1) measure.

Named Entities in Arabic

LingPipe's basic approach is portable across languages. All that needs to change is sentence detection and tokenization. We've successfully built named-entity recognizers for English, Spanish, Dutch, German, French, Hindi, Chinese, and Arabic.

The ANER Corpus

In this section, we use Yassine Benajiba's ANER corpus for Arabic, which consists of roughly 150K tokens tagged in CoNLL column format for person, location, organization and miscellaneous entity mention types (typesMISC, PERS, ORG, LOC).

The corpus is available from:

Benajiba provides more details on the corpus and an evaluation of his logistic regression plus part-of-speech based system in the following two papers (the earlier one gives more details about the corpus):

Download the data into a directory $ANER and unpack the top-level zip files, and if you are going to use the dictionaries, unpack the dictionary level gzip files. You should wind up with a directory structure with the following structure, where ANERCorp is the CoNLL-formatted entity corpus and the other files are gazetteers with one entry per line:

$ANER/ANERCorp
$ANER/ANERGazet/AlignedLocGazetteer
$ANER/ANERGazet/AlignedOrgGazetteer
$ANER/ANERGazet/AlignedPersGazetteer

The data was collected from a variety of sources:

SourceRatio
http://www.aljazeera.net/34.8%
Other newspapers and magazines17.8%
http://www.raya.com/15.5%
http://ar.wikipedia.org/6.6%
http://www.alalam.ma/5.4%
http://www.ahram.eg.org/5.4%
http://www.alittihad.ae/3.5%
http://www.bbc.co.uk/arabic/3.5%
http://arabic.cnn.com/2.8%
http://www.addustour.com/2.8%
http://kassioun.org/1.9%

The contents of the labeled training file looks like:

فرانكفورت B-LOC
(د O
ب O
أ) O
أعلن O
اتحاد B-ORG
صناعة I-ORG
السيارات I-ORG
في O
ألمانيا B-LOC
...

Arabic words are typically printed right-to-left, and HTML lets you specifiy direction with the cascading style sheets (CSS) style attribute style="direction:rtl" on the pre specification. The result looks better if you read Arabic:

فرانكفورت B-LOC
(د O
ب O
أ) O
أعلن O
اتحاد B-ORG
صناعة I-ORG
السيارات I-ORG
في O
ألمانيا B-LOC
...

The dictionaries (also known as gazetteers when involving locations) are just lists. For instance, the first few lines of the location dictionary are:

====
اسيا
====
اردن
الاردن
****
عجلون
عمان
اربد
الرمثا
الشوبك
جرش
زرقاء
...

We just trained each line as an entry, including all the punctuation-only lines, which we probably should've stripped out, because the dictionary didn't help accuracy.

Data Munging

Automatic Sentence Chunking

We chunked the corpus into sentences, using periods, exclamation points and question marks as sentence breaks. We made sure not to munge out entities this way.

Tokenization

Because the CoNLL data format removes whitespace information, we reinsered single whitespaces between all tokens, which let us use a simple whitespace-based tokenizer.

Clearly for a real application, a realistic tokenizer and sentence detector would be required. Breaking on spaces and punctuation might work OK.

Cross-Validating Arabic

To run cross-validation on the corpus, set the system property aner.dir to the top level directory $ANER:

> ant -Daner.dir=e:\data\ANERCorp\unpacked -Daner.dict=false -Daner.misc=true -Daner.ngram=8 -Daner.rescored=512 aner

the first output of which is a dump of the parameters:

Input Files
    NE Corpus: E:\data\ANERCorp\unpacked\ANERCorp
    Location Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedLocGazetteer
    Organization Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedOrgGazetteer
    Person Gazetteer: E:\data\ANERCorp\unpacked\ANERGazet\AlignedPersGazetteer

Parameters
   N-Gram=8
   Num chars=1024
   Interpolation Ratio=8.0
   Including MISC entity type=true
   Use dictionary=false


Corpus Statistics
    Location Dict Entries=2183
    Organization Dict Entries=403
    Person Dict Entries=2309
    # sentences=4900
    # tokens=150286

...

The other system properties determine whether to use a dictionary, whether to include the miscellaneous category in training and evaluation, and the n-gram length. There are a few more parameters to tweak finer-grained parameters; see the aner target in the build.xml and the source file for details.

LingPipe's results are very similar to what Yassine achieved with a logistic-regression classifier on top of part-of-speech tagged and morphologically analyzed input.

Finally, note that these results are without a dictionary. Adding a dictionary improved recall, but hurt precision, with a net result of a lower F measure. Many of the dictionary entries look similar, so I'm thinking this is an issue with not understanding the dictionary's structured format (there are no hints in the paper).

The Code

The source code for parsing the corpus, training, and cross-validating is in src/ANERXVal.java. The only point at which this code departs from previous examples is in allowing dictionary-based training. With either an HMM chunker or HMM rescoring chunnker, dictionaries may be used to train the categories. Here's the relevant code snippet:

static final String LOC_TAG = "LOC";
...
String[] locDict
    = FileLineReader.readLineArray(locGazFile,Strings.UTF8);
...
CharLmRescoringChunker chunker = ...;
...
for (String locName : locDict)
    if (locName.length() > 0)
        chunker.trainDictionary(locName,LOC_TAG);

All that is required is that the locations be fed into the trainDictionary() method. This method is not abstracted into an interface, but is available for both the plain HMM chunker chunk.CharLmHmmChunker and the longer-distance rescoring version chunk.CharLmRescoringChunker.

The Results

Here's the rest of the output:

Input Files
    NE Corpus: C:\carp\data\aner\ANERCorp
    Location Gazetteer: C:\carp\data\aner\ANERGazet\AlignedLocGazetteer
    Organization Gazetteer: C:\carp\data\aner\ANERGazet\AlignedOrgGazetteer
    Person Gazetteer: C:\carp\data\aner\ANERGazet\AlignedPersGazetteer

Parameters
   N-Gram=8
   Num chars=1024
   Interpolation Ratio=8.0
   Number of Analyses Rescored=512
   Including MISC entity type=true
   Use dictionary=false


Corpus Statistics
    Location Dict Entries=2183
    Organization Dict Entries=403
    Person Dict Entries=2309
    # sentences=4888
    # tokens=150285

NE Types Found=[MISC, PERS, ORG, LOC]

-----------------------------------------------
FOLD= 5  start sent= 4073  end sent= 4888
training labeled data
compiling
evaluating
    FOLD=5       LOC P=0.754 R=0.877 F=0.811
    FOLD=5      PERS P=0.670 R=0.624 F=0.646
    FOLD=5       ORG P=0.625 R=0.504 F=0.558
    FOLD=5      MISC P=0.656 R=0.417 F=0.510

    FOLD=5  COMBINED P=0.690 R=0.648 F=0.668
-----------------------------------------------
FOLD= 4  start sent= 3258  end sent= 4073
training labeled data
compiling
evaluating
    FOLD=4       LOC P=0.769 R=0.835 F=0.801
    FOLD=4      PERS P=0.644 R=0.711 F=0.676
    FOLD=4       ORG P=0.717 R=0.507 F=0.594
    FOLD=4      MISC P=0.610 R=0.413 F=0.493

    FOLD=4  COMBINED P=0.705 R=0.666 F=0.685
-----------------------------------------------
FOLD= 3  start sent= 2444  end sent= 3258
training labeled data
compiling
evaluating
    FOLD=3       LOC P=0.853 R=0.834 F=0.843
    FOLD=3      PERS P=0.714 R=0.690 F=0.702
    FOLD=3       ORG P=0.694 R=0.611 F=0.650
    FOLD=3      MISC P=0.686 R=0.484 F=0.568

    FOLD=3  COMBINED P=0.765 R=0.710 F=0.736
-----------------------------------------------
FOLD= 2  start sent= 1629  end sent= 2444
training labeled data
compiling
evaluating
    FOLD=2       LOC P=0.724 R=0.674 F=0.698
    FOLD=2      PERS P=0.478 R=0.594 F=0.530
    FOLD=2       ORG P=0.507 R=0.565 F=0.535
    FOLD=2      MISC P=0.398 R=0.385 F=0.391

    FOLD=2  COMBINED P=0.585 R=0.608 F=0.596
-----------------------------------------------
FOLD= 1  start sent=  814  end sent= 1629
training labeled data
compiling
evaluating
    FOLD=1       LOC P=0.818 R=0.829 F=0.823
    FOLD=1      PERS P=0.657 R=0.660 F=0.658
    FOLD=1       ORG P=0.512 R=0.502 F=0.507
    FOLD=1      MISC P=0.481 R=0.388 F=0.430

    FOLD=1  COMBINED P=0.690 R=0.680 F=0.685
-----------------------------------------------
FOLD= 0  start sent=    0  end sent=  814
training labeled data
compiling
evaluating
    FOLD=0       LOC P=0.811 R=0.750 F=0.779
    FOLD=0      PERS P=0.638 R=0.663 F=0.650
    FOLD=0       ORG P=0.534 R=0.482 F=0.507
    FOLD=0      MISC P=0.529 R=0.451 F=0.487

    FOLD=0  COMBINED P=0.698 R=0.662 F=0.679

===============================================
COMBINED CROSS-VALIDATION RESULTS
     X-Val       LOC P=0.788 R=0.789 F=0.789
     X-Val      PERS P=0.638 R=0.658 F=0.648
     X-Val       ORG P=0.612 R=0.527 F=0.566
     X-Val      MISC P=0.557 R=0.426 F=0.483

     X-Val  COMBINED P=0.689 R=0.663 F=0.676

The results reported by Benajiba for his ANER 2.0 system in the paper above (p. 1821) are:

LOC    P=0.917  R=0.822 F=0.867
MISC   P=0.723  R=0.557 F=0.630
ORG    P=0.480  R=0.450 F=0.464
PERS   P=0.563  R=0.486 F=0.521

TOTAL  P=0.702  R=0.621 F=0.659

He only evaluated on one fold, but I'm not sure which one, or even if it corresponded to one of our folds. Fold 5 (printed first), is the last 25K tokens of the file.

As with the other applications of character LMs to chunking, the Arabic named-entity recognizer is not particularly sensitive to the parameters used for training.

Which Named Entity Recognizer?

LingPipe actually provides three generic, trainable chunkers which may be used for named-entity recognizers, all in the package com.aliasi.chunk. These are:

Scoring Files versus Files

In this section, we show how to use LingPipe's chunking evaluation framework and chunking parser framework to score a chunking in one file against another. It's easy to extend to multiple parallel files.

Sample File Scoring Data

We've provided MUC-6 formatted reference and response files in the demo data directory supplied with the distribution:

Running File Scoring

We've also provided an ant task through which file scoring may be run:

% cd $LINGPIPE/demos/tutorial/ne
% ant eval-files-muc6

This produces the following output:

EVALUATION RESULT
  Total=7
  True Positive=2
  False Negative=2
  False Positive=3
  True Negative=0
  Positive Reference=4
  Positive Response=5
  Negative Reference=3
  Negative Response=2
  Accuracy=0.2857142857142857
  Recall=0.5
  Precision=0.4
  Rejection Recall=0.0
  Rejection Precision=0.0
  F(1)=0.4444444444444445
  Fowlkes-Mallows=4.47213595499958
  Jaccard Coefficient=0.2857142857142857
  Yule's Q=-1.0
  Yule's Y=-1.0
  Reference Likelihood=0.5714285714285714
  Response Likelihood=0.7142857142857143
  Random Accuracy=0.5306122448979592
  Random Accuracy Unbiased=0.5408163265306122
  kappa=-0.5217391304347827
  kappa Unbiased=-0.5555555555555554
  kappa No Prevalence=-0.4285714285714286
  chi Squared=2.1
  phi Squared=0.3
  Accuracy Deviation=0.17074694419062766

The File Scoring Code

The file scorer code is found in src/FileScorer.java. Tracing from the top-level main method, which is called with the reference and response file paths as arguments:

public static void main(String[] args) throws IOException {
    File refFile = new File(args[0]);
    File responseFile = new File(args[1]);

    Parser parser = new Muc6ChunkParser();
    FileScorer scorer = new FileScorer(parser);
    scorer.score(refFile,responseFile);

    System.out.println(scorer.evaluation().toString());
}

First, the files are created from the command-line arguments. Then, we create a parser for the data and use it to construct the file scorer. The scorer is used to score the reference file against the response file and the result is printed out. To generalize this code to handle multiple files, the arguments would be directories which would be walked in parallel, with each pair of reference and response files being given to the scorer via its score method.

The file scorer's constructor simply sets the parser and a chunking evaluation is created and assigned to a member variable.

private final Parser mParser;

private final ChunkingEvaluation mEvaluation
    = new ChunkingEvaluation();

public FileScorer(Parser parser) {
    mParser = parser;
}

The remaining method does all the work of parsing and adding the cases to the evaluation.

public void score(File refFile, File responseFile) throws IOException {
    ChunkingCollector refCollector = new ChunkingCollector();
    mParser.setHandler(refCollector);
    mParser.parse(refFile);
    List<Chunking> refChunkings = refCollector.mChunkingList;

    ChunkingCollector responseCollector = new ChunkingCollector();
    mParser.setHandler(responseCollector);
    mParser.parse(responseFile);
    List<Chunking> responseChunkings = responseCollector.mChunkingList;

    if (refChunkings.size() != responseChunkings.size())
        throw new IllegalArgumentException("chunkings not same size");

    for (int i = 0; i < refChunkings.size(); ++i)
        mEvaluation.addCase((Chunking) refChunkings.get(i),
                            (Chunking) responseChunkings.get(i));
}

It relies on the following static class to collect the chunks and store them in order in a list.

private static class ChunkingCollector 
    implements ObjectHandler<Chunking> {

    private final List<Chunking> mChunkingList 
        = new ArrayList<Chunking>();
    public void handle(Chunking chunking) {
        mChunkingList.add(chunking);
    }
}

The appropriate collector is set as the handler for the parser before the file is parsed. The collector receives all of the chunkings found by the parser through its handle(Chunking) method.

Next, the chunkings are retrieved as lists; if they are not the same length, an exception is thrown. To avoid this exception, the files must be formatted to match in terms of number of chunkings. For instance, the MUC-6 parser used in the demo breaks on sentence elements, with each sentence corresponding to a chunk. The addCase method further checks that the chunkings are over the same strings. Try adding space or changing the spelling in just the response file and check out the exception that's thrown.

The final step is to simply loop over the chunkings and add each as a case to the evaluation.

Chinese and Beyond

New Data Formats

All that's needed to handle a new data format is a new parser for it. Alternatively, the data can be transcoded into a format that LingPipe already recognizes. An example of a parser can be found in src/Conll2002ChunkTagParser.java. Everything else stays the same, and slight variants of the programs in this demo may be used to build models and evaluate them.

Chinese Named Entities in the Sandbox

We entered LingPipe into the 2006 SIGHAN Chinese named-entity (and word segmentation) bakeoff. Our entry had an F-measure of around 82%, leaving it 3.5% behind the best named-entity recognition entry in F-measure. Full details, including all of the code used to bake our entry, can be found in:

In addition to a parser for the data format (which is basically just like CoNLL's), we also used a different tokenizer, in this case, one that treats every character as a token.