What is LingPipe?
LingPipe is a suite of Java libraries for the linguistic analysis of human language.
Feature Overview
LingPipe's information extraction and data mining tools:
- track mentions of entities (e.g. people or proteins);
- link entity mentions to database entries;
- uncover relations between entities and actions;
- classify text passages by language, character encoding, genre, topic, or sentiment;
- correct spelling with respect to a text collection;
- cluster documents by implicit topic and discover significant trends over time; and
- provide part-of-speech tagging and phrase chunking.
Architecture
LingPipe's architecture is designed to be efficient, scalable, reusable, and robust. Highlights include:
- Java API with source code and unit tests;
- multi-lingual, multi-domain, multi-genre models;
- training with new data for new tasks;
- n-best output with statistical confidence estimates;
- online training (learn-a-little, tag-a-little);
- thread-safe models and decoders for concurrent-read exclusive-write (CREW) synchronization; and
- character encoding-sensitive I/O.
Latest Release: LingPipe 3.6.0
Intermediate Release
The latest release of LingPipe is LingPipe 3.6.0. This release replaces LingPipe 3.5.1, with which it is backward compatible other than for some unused methods being removed from the MEDLINE package. In addition to minor API additions in the forms of utility methods, and some clarifications in the Javadoc for various methods and classes, the following substantial changes were made:
Hyphenation and Syllabification Tutorial
There's a new hyphenation and syllabification tutorial, with evaluations on a range of publicly available and for-a-fee datasets.
Spell Checking Loader Improvements
We increased the speed and reduced the memory requirements for loading a compiled model into the spell checker. Now it shouldn't take any overhead beyond the size of the compiled model, and loading should be about twice as fast.
Length-Based Tokenizer Filter
We added a class tokenizer.LengthStopFilterTokenizer,
which filters tokens out of a tokenizer that are longer than a
specified maximum length.
MEDLINE Package Update
The com.aliasi.medline package was updated to
reflect the fact that there are no book entries in MEDLINE.
(There are Book elements in NLM DTDs used for MEDLINE,
but there are no books in the data itself.)
The changes were removing the medline.Book class, the
methods inBook(), inJournal(), and
book(), and in() from the class
medline.Article, and the element constant BOOK_ELT
from medline.MedlineCitationSet.
The MEDLINE tutorials were also updated to remove all the branching logic for books.
MEDLINE Tutorial Update
We've updated the MEDLINE tutorial with much better downloaders and cleaned up the broken links and erroneous property specifications in the read-me so that they match the build file.
LingMed Sandbox Project
We've included a link from the LingPipe Sandbox to the code we use here for our back-end updating, storage, and indexing of bio-medical resources such as MEDLINE, Entrez-Gene, OMIM and GO. It contains extensive documentation and build files, but has lots of moving parts ranging from MySQL to RMI to Log4J.
The project includes a robust downloader to keep MEDLINE up to date, as well as index construction that may be used remotely through Lucene's RMI integration. There's a generic abstraction layer that supports object-relation mapping and querying through MySQL and object-document mapping and search through Lucene.
The LingMed sandbox project also includes a basic version of our gene linkage application, which links mentions of genes and proteins to Entrez-Gene using name matching and context matching.
Citations Web Page
We added a new page for citations of LingPipe, including papers, books, classes, and patents.