What is LingPipe?

LingPipe is a suite of Java libraries for the linguistic analysis of human language.

Feature Overview

LingPipe's information extraction and data mining tools:

Architecture

LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:

Latest Release: LingPipe 3.9.0

Minor Release

The latest release of LingPipe is LingPipe 3.9.0. This release replaces LingPipe 3.8.2, with which it is backward compatible other than for the MEDLINE parser and object framework, which were updated for the 2010 DTDs.

Upgrade Path to LingPipe 4.0

LingPipe 3.9 is scheduled to be the final 3.x release. LingPipe 4.0 is going to remove all currently deprecated classes and methods, and may introduce other changes such as package and class names.

Conditional Random Fields

The main addition is of conditional random fields (CRFs) and related support packages.

CRF Tutorial

Details may be found in the

Taggers and Chunkers

There is a new package com.aliasi.crf with an implementation of first-order chain CRFs, with applications to tagging and chunking through classes crf.ChainCrf, and crf.ChainCrfChunker.

Tag Package

To support CRFs, we added a new package com.aliasi.tag, which contains tagger interfaces which are implemented by chain CRFs and retrofitted for HMMs.

First-Best Tagging

The first tagging interface, tag.Tagger, is for first-best taggers. The class tag.Tagging represents a complete tagging of a generic list of tokens, including the tokens and the tags.

N-Best Tagging

The second tagging interface, tag.NBestTagger, is for taggers that are able to return the top N best taggings.

The extension tag.ScoredTagging represents a tagging with a real-valued score attached, implementing util.Scored for convenience.

Marginal Probability Tagging

The third tagging interface, tag.MarginalTagger, represents the full results of running a forward-backward-type algorithm to deduce marginal conditional probabilities of tags for a token given the entire input token sequence.

The interface tag.TagLattice represents the result of marginal tagging.

String Tagging

The class tag.StringTagging extends Tagging, adding additional information for the result of tagging a string, such as offsets for tokens and underlying character sequence.

Tagger Evaluation

Evaluation for taggers is now implemented in the tag package, with specific classes for each of the three varieties of tagger, tag.TaggerEvaluator, tag.NBestTaggerEvaluator, and tag.MarginalTaggerEvaluator.

HMM Retrofitting

We also retrofitted HMMs to implement the tagging interfaces.

HMM Deprecations

All of the decoding methods that are not part of the tagging interface have been deprecated. The HMM-specific result class hmm.TagWordLattice is also deprecated in favor of the new marginal tagging class. Finally, the two evaluators are deprecated in favor of the general tagger evaluators. All of the tutorials have been updated to reflect the deprecation.

Tag/Chunk Transcoding

There is a new interface chunk.TagChunkCodec, which specifies conversions between chunkings and taggings. There is an implemention, chunk.BioTagChunkCodec, based on the BIO encoding.

2010 MEDLINE DTD Update

The com.aliasi.medline package was updated for the 2010 DTDs from the U.S. National Library of Medicine.

The tutorial and included demo data for MEDLINE were also updated.

Tokenized LM Serialization

We added methods to serialize tokenized LMs using the same kind of format as the Google N-gram corpus. We also implemented input methods to read in corpora in that format. Details may be found in the LM Tutorial in the section "Scaling Token Language Models".

Tokenization Encapsulation

There is a new class token.Tokenization that encapsulates the information in a tokenized string including the string and token start/end boundaries.

Cross-Validating Object Handler Corpus

There is a new class corpus.XValidatingObjectCorpus that is like the cross-validating classifier corpus, only for general object handlers. This is useful for cross-validating chunkers, particularly.

Chunking Merges

We added a static method ChunkingImpl.merge(Chunking,Chunking), which merges two chunkings into a single chunking. This is useful to combine dictionary output with statistical chunker outputs, or to combine two statistical chunkers' outputs.

Handler Deprecations

The classes corpus.StringArrayHandler, corpus.IntArrayHandler, and corpus.ChunkHandler have been deprecated. They may be replaced with instances of ObjectHandler<String[]>, ObjectHandler<int[]>, and ObjectHandler<Chunking> respectively.

Classes that depended on these interfaces, such as lm.TokenizedLM, lm.TrieIntSeqCounte, a host of classes in the chunk package, and corpus.ChunkTagHandlerAdapter have been generalized to the relevant object handler interfaces.

Bug Fixes

There were various bug fixes as reported on the LingPipe Newsgroup.