What is LingPipe?
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Feature Overview
LingPipe's information extraction and data mining tools:
- track mentions of entities (e.g. people or proteins);
- link entity mentions to database entries;
- uncover relations between entities and actions;
- classify text passages by language, character encoding, genre, topic, or sentiment;
- correct spelling with respect to a text collection;
- cluster documents by implicit topic and discover significant trends over time; and
- provide part-of-speech tagging and phrase chunking.
Architecture
LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:
- Java API with source code and unit tests;
- multi-lingual, multi-domain, multi-genre models;
- training with new data for new tasks;
- n-best output with statistical confidence estimates;
- online training (learn-a-little, tag-a-little);
- thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
- character encoding-sensitive I/O.
Latest Release: LingPipe 3.5.1
Patch Release
The latest release of LingPipe is LingPipe 3.5.1. This release replaces LingPipe 3.5.0, with which it is fully backward compatible.
Bug Fixes
Fast Cache
The performance (speed and size) bug in util.FastCache has been patched. In addition, there's a new cache implementation util.HardFastCache that does not use soft references to reduce load on the garbage collector.
XML Element Stack Filter
The class xml.ElementStackFilter was patched to deal with implementations
of SAX that reuse the same attributes element on each callback. Now
the element stack filter copies the attributes to a local version for
later access.
Missing Demo Files and Libraries
A demo input file for evaluating spell checking was missing and is now included.
Generic Demo XML Dependency
The NekoHTML package we use in our demos to parse HTML input depends on the XercesJ XML libraries, so we included them along with the appropriate classpaths on all of the commands and in the web service demos.
Convenience Methods and Generic Specifications
Version 3.5.1 includes a few new convenience methods in some
classes such as util.Streams, util.AbstractCommand,
util.Arrays,
xml.TextAccumulatorHandler,
stats.LogisticRegression. and all
vector.Vector implementations.
The hmm.HmmDecoder cache specification was
given generic specifications. (This change is fully backward
compatible; the implementation did not change and generic specifications are
optional.)
Generic Model Interface
Version 3.5.1 introduces a generic stats.Model interface. The language model implementations have been retrofitted to implement this interface.