What is LingPipe?

LingPipe is a suite of Java libraries for the linguistic analysis of human language.

Feature Overview

LingPipe's information extraction and data mining tools:

Architecture

LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:

Latest Release: LingPipe 3.6.0

Intermediate Release

The latest release of LingPipe is LingPipe 3.6.0. This release replaces LingPipe 3.5.1, with which it is backward compatible other than for some unused methods being removed from the MEDLINE package. In addition to minor API additions in the forms of utility methods, and some clarifications in the Javadoc for various methods and classes, the following substantial changes were made:

Hyphenation and Syllabification Tutorial

There's a new hyphenation and syllabification tutorial, with evaluations on a range of publicly available and for-a-fee datasets.

Spell Checking Loader Improvements

We increased the speed and reduced the memory requirements for loading a compiled model into the spell checker. Now it shouldn't take any overhead beyond the size of the compiled model, and loading should be about twice as fast.

Length-Based Tokenizer Filter

We added a class tokenizer.LengthStopFilterTokenizer, which filters tokens out of a tokenizer that are longer than a specified maximum length.

MEDLINE Package Update

The com.aliasi.medline package was updated to reflect the fact that there are no book entries in MEDLINE. (There are Book elements in NLM DTDs used for MEDLINE, but there are no books in the data itself.)

The changes were removing the medline.Book class, the methods inBook(), inJournal(), and book(), and in() from the class medline.Article, and the element constant BOOK_ELT from medline.MedlineCitationSet.

The MEDLINE tutorials were also updated to remove all the branching logic for books.

MEDLINE Tutorial Update

We've updated the MEDLINE tutorial with much better downloaders and cleaned up the broken links and erroneous property specifications in the read-me so that they match the build file.

LingMed Sandbox Project

We've included a link from the LingPipe Sandbox to the code we use here for our back-end updating, storage, and indexing of bio-medical resources such as MEDLINE, Entrez-Gene, OMIM and GO. It contains extensive documentation and build files, but has lots of moving parts ranging from MySQL to RMI to Log4J.

The project includes a robust downloader to keep MEDLINE up to date, as well as index construction that may be used remotely through Lucene's RMI integration. There's a generic abstraction layer that supports object-relation mapping and querying through MySQL and object-document mapping and search through Lucene.

The LingMed sandbox project also includes a basic version of our gene linkage application, which links mentions of genes and proteins to Entrez-Gene using name matching and context matching.

Citations Web Page

We added a new page for citations of LingPipe, including papers, books, classes, and patents.