What is LingPipe?
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Feature Overview
LingPipe's information extraction and data mining tools:
- track mentions of entities (e.g. people or proteins);
- link entity mentions to database entries;
- uncover relations between entities and actions;
- classify text passages by language, character encoding, genre, topic, or sentiment;
- correct spelling with respect to a text collection;
- cluster documents by implicit topic and discover significant trends over time; and
- provide part-of-speech tagging and phrase chunking.
Architecture
LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:
- Java API with source code and unit tests;
- multi-lingual, multi-domain, multi-genre models;
- training with new data for new tasks;
- n-best output with statistical confidence estimates;
- online training (learn-a-little, tag-a-little);
- thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
- character encoding-sensitive I/O.
Latest Release: LingPipe 3.9.0
Minor Release
The latest release of LingPipe is LingPipe 3.9.0. This release replaces LingPipe 3.8.2, with which it is backward compatible other than for the MEDLINE parser and object framework, which were updated for the 2010 DTDs.
Upgrade Path to LingPipe 4.0
LingPipe 3.9 is scheduled to be the final 3.x release. LingPipe 4.0 is going to remove all currently deprecated classes and methods, and may introduce other changes such as package and class names.
Conditional Random Fields
The main addition is of conditional random fields (CRFs) and related support packages.
CRF Tutorial
Details may be found in the
Taggers and Chunkers
There is a new package
com.aliasi.crf with an implementation of
first-order chain CRFs, with applications to tagging
and chunking through classes crf.ChainCrf,
and crf.ChainCrfChunker.
Tag Package
To support CRFs, we added a new package
com.aliasi.tag, which contains tagger interfaces which
are implemented by chain CRFs and retrofitted for HMMs.
First-Best Tagging
The first tagging interface, tag.Tagger,
is for first-best taggers. The class tag.Tagging
represents a complete tagging of a generic list of tokens, including
the tokens and the tags.
N-Best Tagging
The second tagging interface, tag.NBestTagger,
is for taggers that are able to return the top N best taggings.
The extension tag.ScoredTagging
represents a tagging with a real-valued score attached, implementing
util.Scored for convenience.
Marginal Probability Tagging
The third tagging interface, tag.MarginalTagger,
represents the full results of running a forward-backward-type algorithm
to deduce marginal conditional probabilities of tags for a token given
the entire input token sequence.
The interface tag.TagLattice
represents the result of marginal tagging.
String Tagging
The class tag.StringTagging
extends Tagging, adding additional information
for the result of tagging a string, such as offsets for tokens and underlying
character sequence.
Tagger Evaluation
Evaluation for taggers is now implemented in the tag package,
with specific classes for each of the three varieties of tagger,
tag.TaggerEvaluator,
tag.NBestTaggerEvaluator, and
tag.MarginalTaggerEvaluator.
HMM Retrofitting
We also retrofitted HMMs to implement the tagging interfaces.
HMM Deprecations
All of the decoding methods that are not part of the tagging
interface have been deprecated. The HMM-specific result class
hmm.TagWordLattice is also deprecated in favor of the new
marginal tagging class. Finally, the two evaluators are deprecated in
favor of the general tagger evaluators. All of the tutorials have
been updated to reflect the deprecation.
Tag/Chunk Transcoding
There is a new interface
chunk.TagChunkCodec,
which specifies conversions between chunkings and taggings. There
is an implemention, chunk.BioTagChunkCodec, based on the BIO encoding.
2010 MEDLINE DTD Update
The com.aliasi.medline package was updated for
the 2010 DTDs from the U.S. National Library of Medicine.
The tutorial and included demo data for MEDLINE were also updated.
Tokenized LM Serialization
We added methods to serialize tokenized LMs using the same kind of format as the Google N-gram corpus. We also implemented input methods to read in corpora in that format. Details may be found in the LM Tutorial in the section "Scaling Token Language Models".
Tokenization Encapsulation
There is a new class
token.Tokenization
that encapsulates the information in a tokenized string including
the string and token start/end boundaries.
Cross-Validating Object Handler Corpus
There is a new class corpus.XValidatingObjectCorpus
that is like the cross-validating classifier corpus, only for general
object handlers. This is useful for cross-validating chunkers,
particularly.
Chunking Merges
We added a static method
ChunkingImpl.merge(Chunking,Chunking), which merges two chunkings
into a single chunking. This is useful to combine dictionary output with
statistical chunker outputs, or to combine two statistical chunkers' outputs.
Handler Deprecations
The classes corpus.StringArrayHandler,
corpus.IntArrayHandler, and
corpus.ChunkHandler have been deprecated. They may be
replaced with instances of ObjectHandler<String[]>,
ObjectHandler<int[]>,
and ObjectHandler<Chunking> respectively.
Classes that depended on these interfaces, such as
lm.TokenizedLM, lm.TrieIntSeqCounte, a host
of classes in the chunk package, and
corpus.ChunkTagHandlerAdapter have been generalized to
the relevant object handler interfaces.
Bug Fixes
There were various bug fixes as reported on the LingPipe Newsgroup.