About the Named Entity Demo

Named entity recognition finds mentions of things in text. The interface in LingPipe provides character offset representations as chunkings.

Genre-Specific Models

Named entity recognizers in LingPipe are trained from a corpus of data. The examples below extract mentions of people, locations or organizations in English news texts, and mentions of genes and other biological entities of interest in biomedical research literature.

Language-Specific Models

Although we're only providing English data here, there is training data available (usually for research purposes only) in a number of languages, including Arabic, Chinese, Dutch, German, Greek, Hindi, Japanese, Korean, Portuguese and Spanish. Many of these training sets may be purchased for commercial applications. There are additional biology-based corpora, most of which are available with unrestricted licensing.

LingPipe's Recognizers

LingPipe provides three statistical named-entity recognizers:

com.aliasi.chunk. Size 1st-bestn-bestconfidence
speedaccuracyspeedaccuracyspeedaccuracy
TokenShapeChunker small fastmedium n/a
CharLmHmmChunker medium fastlow mediummedium slowhigh
CharLmRescoringChunker very large slowhigh slowerhigh slowestlow

Sentence Annotation Included

The demos use the appropriate sentence models. See the Sentence Demo for more information.

Named Entity XML Markup

First-best output

Entities are marked as in MUC, with an ENAMEX element with attribute TYPE indicating the kind of entity.

N-best output

Each analysis is marked with a tag analysis, with attribute jointLog2Prob providing the joint log (base 2) probability of the analysis, and rank providing the rank on the n-best list (numering from zero), e.g. <analysis jointLog2Prob="-39.9" rank="5">. Within each analysis, tokens are tagged as for first-best output.

Per tag confidence output

Each token and its analyses are wrapped with an nBestEntities element. Following the text content is a sequence of ENAMEX elements, marking types, conditional probabilities of the entity given the text, and start/end position markers, as well as confidence rank and the text of the entity.

Named Entity Demo on the Web

The demos are hosted on the web at the following URLs:

English News: MUC6 Corpus (CharLmRescoringChunker)

http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_news_muc6/textInput.html

English Biomedical Text: GeneTag Corpus (CharLmHmmChunker)

http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_bio_genetag/textInput.html

English Biomedical Text: GENIA Corpus (TokenShapeChunker)

http://lingpipe-demos.com:8080/lingpipe-demos/ne_en_bio_genia/textInput.html

For detailed information about using web demos, including web form, file upload and web service instructions, see the web demo instructions

Named Entity Demo via GUI

To launch the demo in a GUI, first change directories to the command directory and then invoke the demo batch script. Note: Parameters are set in the GUI, not as arguments to the launch script.

Windows Operating System

English News: MUC6 Corpus (CharLmRescoringChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> gui_ne_en_news_muc6.bat 

English Biomedical: GeneTag Corpus (CharLmChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> gui_ne_en_bio_genetag.bat 

English Biomedical: GENIA Corpus (TokenShapeChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> gui_ne_en_bio_genia.bat 

Unix-like Operating Systems

English News: MUC6 Corpus (CharLmRescoringChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> sh gui_ne_en_news_muc6.sh

English Biomedical: GeneTag Corpus (CharLmChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> sh gui_ne_en_bio_genetag.sh

English Biomedical: GENIA Corpus (TokenShapeChunker)

> cd %LINGPIPE_HOME%\demos\generic\bin
> sh gui_ne_en_bio_genia.sh

For detailed information about running demos in a GUI, see the GUI demo instructions

Named Entity Demo via Shell Command

Shell commands may be run over single files, all of the files in a directory, or using standard input/output.

Running over a Directory

English News: MUC6 Corpus (CharLmRescoringChunker)

> cd $LINGPIPE/demos/generic/bin
> cmd_ne_en_news_muc6.bat -inDir=../../data/testdir -outDir=/testout

English Biomedical: GeneTag Corpus (CharLmChunker)

> cd $LINGPIPE/demos/generic/bin
> cmd_ne_en_bio_genetag.bat -inDir=../../data/testdir -outDir=/testout

English Biomedical: GENIA Corpus (TokenShapeChunker)

> cd $LINGPIPE/demos/generic/bin
> cmd_ne_en_bio_genia.bat -inDir=../../data/testdir -outDir=/testout

Running a Single File

English News: MUC6 Corpus (CharLmRescoringChunker)

> cd $LINGPIPE/demos/generic/bin
> cmd_ne_en_news_muc6.bat -inFile=../../data/testdir/foo.txt -outFile=foo.out.xml

The other genres are handled the same way, with different suffixes in place of news_muc6.

Running through a Pipe (Standard input/output)

English News: MUC6 Corpus (CharLmRescoringChunker)

> cd demos/generic/bin
> echo See Spot. See Spot run. | cmd_ne_en_news_muc6.bat

The other genres are handled the same way, with different suffixes in place of general_brown.

Running in Unix-like Operating Systems

For unix-like operating systems such as Unix, Solaris, Linux, or Macintosh OS X:

For detailed information about running demos from the command line, see the command line demo instructions

Named Entity Demo Scripts

The following scripts are available in $LINGIPE/demos/generic/bin for running the demo. Note that each script comes in four flavors, distinguishing command line from GUI, and the Windows DOS shell from the Unix shell.

Language Genre Corpus Mode Windows DOS Unix/Linux/Mac sh
English General MUC 6 Command cmd_ne_en_news_muc6.bat cmd_ne_en_news_muc6.sh
GUI gui_ne_en_news_muc6.bat gui_ne_en_news_muc6.sh
English Biomedical GeneTag Command cmd_ne_en_bio_genetag.bat cmd_ne_en_bio_genetag.sh
GUI gui_ne_en_bio_genetag.bat gui_ne_en_bio_genetag.sh
English Biomedical GENIA Command cmd_ne_en_bio_genia.bat cmd_ne_en_bio_genia.sh
GUI gui_ne_en_bio_genia.bat gui_ne_en_bio_genia.sh

Named Entity Demo Parameters

The following is a complete list of parameters for the demo.

Demo-Specific Parameters

The following parameter is specific to the named entity demo (though also found in the part-of-speech demo).

Parameter Description Usage Constraints
resultType Form of results Values determine output:
  • firstBest
  • nBest
  • conf

General Demo Parameters

These parameters apply to every version (web/GUI/command) of every demo.

Parameter Description Usage Constraints
inCharset Input character set Optional. Defaults to platform default.
outCharset Output character set
contentType Input content type May be one of:
  • text/plain
  • text/html
  • text/xml
Defaults to text/plain.
removeElts Element tags to remove Optional. May only be used with contentType=text/html or contentType=text/xml. Each value may be comma-separated list. If neither of these are specified, all text content is processed.
includeElts Elements to annotate

Command-Line Only Parameters

These parameters apply to every command-line demo, but are not relevant for the GUI or web versions of the demos.

Parameter Description Usage Constraints
inFile Readable input file May not be used with inDir. If either is not specified, defaults to standard input or output.
outFile Writeable output file
inDir Readable input directory May not be used with inFile or outFile. If used, inDir and outDir must both be specified.
outDir Writeable output directory