What is LingPipe?
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Feature Overview
LingPipe's information extraction and data mining tools:
- track mentions of entities (e.g. people or proteins);
- link entity mentions to database entries;
- uncover relations between entities and actions;
- classify text passages by language, character encoding, genre, topic, or sentiment;
- correct spelling with respect to a text collection;
- cluster documents by implicit topic and discover significant trends over time; and
- provide part-of-speech tagging and phrase chunking.
Architecture
LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:
- Java API with source code and unit tests;
- multi-lingual, multi-domain, multi-genre models;
- training with new data for new tasks;
- n-best output with statistical confidence estimates;
- online training (learn-a-little, tag-a-little);
- thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
- character encoding-sensitive I/O.
Latest Release: LingPipe 3.8.1
Intermediate Release
The latest release of LingPipe is LingPipe 3.8.1. This release replaces LingPipe 3.8.0, with which it is fully backward compatible. It is also backward compatible with Lingpipe 3.7 other than for new dictionary interface return types.
Upgrade Path to LingPipe 4.0
LingPipe 3.8 is scheduled to be the final 3.x release. LingPipe 4.0 is going to introduce some major changes:
com.lingpipe. The most dramatic is that we are going to change the package path fromcom.aliasitocom.lingpipeand change the web site to match.- Deprecation. Every interface, class, method and field that is deprecated in 3.8 will be removed in 4.0. Each deprecation is documented for how to remove it.
- License. The current plan is to release 4.0 under the AGPLv3 license.
3.8.1: Logistic Regression Efficiency and Bug Fix
Object allocations inside of logistic regression were removed, which improves speed and lowers thread usage. This patch combines the efficiency of 3.7.0 with the cleaner refactored code of 3.8.0.
We also patched a bug that affected regularized, multinomial logistic regression and prevented any but the first coefficient vector from being regularized.
3.8.1: Tokenizer End Positions
We added an end position method to the tokenizer abstract base class and implemented it in all of the built-in LingPipe tokenizers.
This uncovered bugs in the token start position calculations for the Indo-European tokenizer factory which was not considering offset starting positions. We also added start position implementations for all tokenizers that previous threw unsupported operation exceptions.
Filtered tokenizers return the start/end position of the underlying token. For instance, stemming tokenizers return the start and end position of the underlying token which was stemmed. This allows text highlighting to work properly for tokenizers that modify tokens.
Dictionary Interface Return Types
Methods that returned generic arrays in
dictionary.Dictionary have been deprecated,
and new methods that return lists of dictionary entries
have been added.
Any class implementing Dictionary must
be updated to convert the array returns into lists.
Automatic Text Completion
We've added a new auto-completion class
spell.AutoCompleter which takes a corpus of phrases with
weights and provides an online auto-completion service, returning the
top N matches against the specified prefixes. Like Google's
auto-completer, it allows spelling correction as it goes.
There's a new section in the spelling tutorial that covers how to use the auto-completer as an API and in a Swing GUI combo-box-like way.
New Feature Extraction Package and Utilities
We've introduced a
com.aliasi.features package for 3.8, which
contains feature extractor utilities for combining
multiple extractors, converting to z-scores, etc. We will
continue to build this out with new utilities.
Primitive Integer Iterators
The util.Iterators package now contains a primitive
iterator abstract base class util.Iterators.PrimitiveInt
for iterating over integers without the overhead of auto-boxing.
Traditional Naive Bayes with EM
There's a new, much more traditional implementation of naive Bayes
text classifiers in classify.TradNaiveBayes. This class
is very fast (compared to the current naive Bayes classifiers) for
training and for run time.
The new naive Bayes implementation allows length normalization of inputs and takes two standard uniform Dirichlet (additive) priors for tokens in categories and categories themselves.
It also supports semi-supervised learning through expectation maximization.
EM Tutorial
There's a new tutorial for using naive Bayes with expectation maximization for semi-supervised learning, which involves training with a mixture of labeled and unlabeled data.
Improved K-Means(++)
We've completely overhauled our k-means clustering implementation. It's now several of orders of magnitude faster, provides intermediate reports, and supports k-means++ initialization.
Classifiers Settable in Classifier Evaluations
To facilitate cross-validation evaluations with a single classifier evaluator, we made the contained classifier mutable and provided a set method.
Short Priority Queue
There class util.ShortPriorityQueue implements
priority queues in a way that's optimized for short queues.
This class is ideal for collecting small n-best lists. We developed
it for the auto-completer.
Tokenizer Factories and Filters
The tokenizer package has been completely overhauled to supply serializable tokenizer factory and filter implementations for all of the built-in tokenizer functionality such as stop-listing, stemming, case normalization, etc.
Tokenized LM Serialization
There was a hack in the tokenized LMs where the fully qualified name of a tokenizer factory was written and then reflection using a nullary constructor was used to reconstitute a tokenizer factory. We are still supporting this for backward compatibility, but the preferred approach is to now have a serializable tokenizer factory. If a tokenizer factory is serializable, it will now be serialized and deserialized in the usual way.
Report Logger
The io.Reporter class is for use in long-running
methods, such as classifier training or singular value decomposition,
to report back to their clients. It's designed to print out
timestamps and to allow formatted prints to a range of outputs
(standard out, file, etc.).
The direct use of print writers as reporters has been deprecated in
favor of the reporter class. Methods that previously took
PrintWriter arguments for reporting have been updated to
take reporters. We maintain backward compatibility by wrapping the
print writers in debug-level reporters.
Line-Oriented Reader
The io.FileLinereader class supports line-based
reading of files in a single go with simpler loops than
java.io.BufferedReader.
Sandbox now in Subversion
We've converted from CVS to Subversion, including the sandbox projects, which are now available via anonymous Subversion.
Deprecated Generic Array Returns
Generic array returns in the dictionary, HMM, language model, and utility package were deprecated in favor of generic lists.
Deprecated Priority Queue Interface, Bounded Priority Queue implements Queue
We modified util.BoundedPriorityQueue so that it
implements Java's util.Queue and
util.SortedSet interfaces. The interface
util.PriorityQueue has been deprecated in favor
of Java's Queue interface.
Strict Math Converter
After experiencing test failures on Athlon processors in non-strict math mode, we've included an Ant task in the top-level build.xml file to convert the code to strict math and back again using scripts.
Annotations Everywhere
We added @Override, @SuppressWarnings,
and @Deprecated annotations where appropriate.
Updated Jars and Dependencies
JUnit 4.5
We finally upgraded our tests to use the annotations that have been around since JUnit 4.0.
Updated Demo Jars
We updated the demo jars for the latest Lucene relase (2.4.1), the latest Luke release (0.9.2), and the latest NekoHTML release (1.9.11).
Lint Free
Base LingPipe code (rooted at com.aliasi) and all
of the demos (generic and tutorial) have been
cleaned up so there are no more warnings during compilation, other
than for one warning due to util.BoundedPriorityQueue extending
the now-deprecated priority queue interface util.PriorityQueue.
Bug Fixes for Logistic Regression
We fixed a bug that under-regularized a feature by a number of epochs equal to the number of epochs minus the last epoch in which it was seen.
We reimplemented softmax so there are no more
overflows and NaN coefficient estimates.
There may be infinite log likelihood estimates, but
that's just the limit of double-precision arithmetic
and not an unrecoverable error.
Bug Fixes for Corpora
We fixed a bug in file-based corpora that improperly switched between default and user-specified character encodings.