What is LingPipe?

LingPipe is a suite of Java libraries for the linguistic analysis of human language.

Feature Overview

LingPipe's information extraction and data mining tools:

Architecture

LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:

Latest Release: LingPipe 3.8.1

Intermediate Release

The latest release of LingPipe is LingPipe 3.8.1. This release replaces LingPipe 3.8.0, with which it is fully backward compatible. It is also backward compatible with Lingpipe 3.7 other than for new dictionary interface return types.

Upgrade Path to LingPipe 4.0

LingPipe 3.8 is scheduled to be the final 3.x release. LingPipe 4.0 is going to introduce some major changes:

3.8.1: Logistic Regression Efficiency and Bug Fix

Object allocations inside of logistic regression were removed, which improves speed and lowers thread usage. This patch combines the efficiency of 3.7.0 with the cleaner refactored code of 3.8.0.

We also patched a bug that affected regularized, multinomial logistic regression and prevented any but the first coefficient vector from being regularized.

3.8.1: Tokenizer End Positions

We added an end position method to the tokenizer abstract base class and implemented it in all of the built-in LingPipe tokenizers.

This uncovered bugs in the token start position calculations for the Indo-European tokenizer factory which was not considering offset starting positions. We also added start position implementations for all tokenizers that previous threw unsupported operation exceptions.

Filtered tokenizers return the start/end position of the underlying token. For instance, stemming tokenizers return the start and end position of the underlying token which was stemmed. This allows text highlighting to work properly for tokenizers that modify tokens.

Dictionary Interface Return Types

Methods that returned generic arrays in dictionary.Dictionary have been deprecated, and new methods that return lists of dictionary entries have been added.

Any class implementing Dictionary must be updated to convert the array returns into lists.

Automatic Text Completion

We've added a new auto-completion class spell.AutoCompleter which takes a corpus of phrases with weights and provides an online auto-completion service, returning the top N matches against the specified prefixes. Like Google's auto-completer, it allows spelling correction as it goes.

There's a new section in the spelling tutorial that covers how to use the auto-completer as an API and in a Swing GUI combo-box-like way.

New Feature Extraction Package and Utilities

We've introduced a com.aliasi.features package for 3.8, which contains feature extractor utilities for combining multiple extractors, converting to z-scores, etc. We will continue to build this out with new utilities.

Primitive Integer Iterators

The util.Iterators package now contains a primitive iterator abstract base class util.Iterators.PrimitiveInt for iterating over integers without the overhead of auto-boxing.

Traditional Naive Bayes with EM

There's a new, much more traditional implementation of naive Bayes text classifiers in classify.TradNaiveBayes. This class is very fast (compared to the current naive Bayes classifiers) for training and for run time.

The new naive Bayes implementation allows length normalization of inputs and takes two standard uniform Dirichlet (additive) priors for tokens in categories and categories themselves.

It also supports semi-supervised learning through expectation maximization.

EM Tutorial

There's a new tutorial for using naive Bayes with expectation maximization for semi-supervised learning, which involves training with a mixture of labeled and unlabeled data.

Improved K-Means(++)

We've completely overhauled our k-means clustering implementation. It's now several of orders of magnitude faster, provides intermediate reports, and supports k-means++ initialization.

Classifiers Settable in Classifier Evaluations

To facilitate cross-validation evaluations with a single classifier evaluator, we made the contained classifier mutable and provided a set method.

Short Priority Queue

There class util.ShortPriorityQueue implements priority queues in a way that's optimized for short queues. This class is ideal for collecting small n-best lists. We developed it for the auto-completer.

Tokenizer Factories and Filters

The tokenizer package has been completely overhauled to supply serializable tokenizer factory and filter implementations for all of the built-in tokenizer functionality such as stop-listing, stemming, case normalization, etc.

Tokenized LM Serialization

There was a hack in the tokenized LMs where the fully qualified name of a tokenizer factory was written and then reflection using a nullary constructor was used to reconstitute a tokenizer factory. We are still supporting this for backward compatibility, but the preferred approach is to now have a serializable tokenizer factory. If a tokenizer factory is serializable, it will now be serialized and deserialized in the usual way.

Report Logger

The io.Reporter class is for use in long-running methods, such as classifier training or singular value decomposition, to report back to their clients. It's designed to print out timestamps and to allow formatted prints to a range of outputs (standard out, file, etc.).

The direct use of print writers as reporters has been deprecated in favor of the reporter class. Methods that previously took PrintWriter arguments for reporting have been updated to take reporters. We maintain backward compatibility by wrapping the print writers in debug-level reporters.

Line-Oriented Reader

The io.FileLinereader class supports line-based reading of files in a single go with simpler loops than java.io.BufferedReader.

Sandbox now in Subversion

We've converted from CVS to Subversion, including the sandbox projects, which are now available via anonymous Subversion.

Deprecated Generic Array Returns

Generic array returns in the dictionary, HMM, language model, and utility package were deprecated in favor of generic lists.

Deprecated Priority Queue Interface, Bounded Priority Queue implements Queue

We modified util.BoundedPriorityQueue so that it implements Java's util.Queue and util.SortedSet interfaces. The interface util.PriorityQueue has been deprecated in favor of Java's Queue interface.

Strict Math Converter

After experiencing test failures on Athlon processors in non-strict math mode, we've included an Ant task in the top-level build.xml file to convert the code to strict math and back again using scripts.

Annotations Everywhere

We added @Override, @SuppressWarnings, and @Deprecated annotations where appropriate.

Updated Jars and Dependencies

JUnit 4.5

We finally upgraded our tests to use the annotations that have been around since JUnit 4.0.

Updated Demo Jars

We updated the demo jars for the latest Lucene relase (2.4.1), the latest Luke release (0.9.2), and the latest NekoHTML release (1.9.11).

Lint Free

Base LingPipe code (rooted at com.aliasi) and all of the demos (generic and tutorial) have been cleaned up so there are no more warnings during compilation, other than for one warning due to util.BoundedPriorityQueue extending the now-deprecated priority queue interface util.PriorityQueue.

Bug Fixes for Logistic Regression

We fixed a bug that under-regularized a feature by a number of epochs equal to the number of epochs minus the last epoch in which it was seen.

We reimplemented softmax so there are no more overflows and NaN coefficient estimates. There may be infinite log likelihood estimates, but that's just the limit of double-precision arithmetic and not an unrecoverable error.

Bug Fixes for Corpora

We fixed a bug in file-based corpora that improperly switched between default and user-specified character encodings.