LingPipe Customers

After its October 2003 release, LingPipe was adopted by commercial customers ranging from the defense and health industries to Web 2.0 startups. LingPipe is also widely used in the academic community for both research and teaching.

Customers are listed in the following sections:

Commercial Customers

Thomson Legal and Regulatory

Thomson provides the Westlaw legal search engine. Several projects in development.

EdgarOnline

EdgarOnline provides financial reports and news search and data aggregation service. Project in development.

Technorati

Technorati provides a blog search, tagging and syndication service. Project in development.

Nielsen Buzzmetrics

The Buzzmetrics group at Nielsen measures consumer-generated media. We did language model driven sentiment analysis for brands in blog data. Project deployed.

U. S. Department of Defense

Provided the ThreatTracker application based on LingPipe. We deployed several applications for training and evaluation. We estimate that over 200 intelligence analysts were trained in the use of ThreatTracker.

Mitre

We released a daily Osama bin Laden tracker for Mitre's MiTAP product which was used world wide by intelligence analysts and other government offices for years. Derivative products included a "Top Ten Terrorist Suspects" tracker as well as a product that tracked infectious disease outbreaks in FBIS and open source data feeds. The system was in production for two years.

Endeca

endeca logo

We supply search faceting technology based on noun phrase extraction in French and English for Endeca.

In 2009 we joined the Endeca Extend Program, which exposes LingPipe functionality as a plug-in to Endeca's faceted search engine.

Research Patrons

U. S. National Library of Medicine

Alias-i pitched an improved ThreatTracker-like application for bioinformatics and received a two year small business innovation research (SBIR) grant to develop one from NLM, a part of the National Institutes of Health (NIH).

This project has driven most of LingPipe's API development recently, including confidence ranked entities and part-of-speech as well as approximate dictionary matched extraction and classification.

Application deliverables include LingBlast, a cross-document coreference application that linked 40,000 human genes to 15 million MEDLINE abstracts using language models to improve search results over expanded aliases, the relation extraction mechanism described in the database tutorial. Development of LingArray, a literature assay and visualization tool similar to micro-arrays is the final deliverable.

U. S. Defense Advanced Research Project Agency

Three years of seed funding came through DARPA's Information Processing Technology Office through the Translingual Information Extraction and Summarization (TIDES) program. Alias-i explored applications of cross-document coreference resolution to search, tracking and relationship mining. (Coreference involves linking mentions of objects in text to their real world referents and/or to other mentions with the same referent.)

Pre-release versions of LingPipe were deployed as part of our ThreatTracker product. Check out a screen shot of our translingual ThreatTracker prototype.

Academic Research

We are amazed at the quantity, creativity and quality of work done with LingPipe in fields ranging from bioinformatics to blog processing experiments. Below are some selected publications with a brief description of how LingPipe was used and/or quotes.

MITRE, Brandeis University

LingPipe is used in concert with another named entity recognition tool (Carafe) to do de-identification of patient records. De-identification means remove all patient specific information from the text.

University of North Texas

LingPipe is used for sentence detection for the TREC conference, and perhaps coreference resolution (not entirely clear from paper).

German Research Center for AI (DFKI)

"LingPipe, which is a software package from Alias-i, consists of several language processing modules: a statistical named entity recognizer, a heuristic sentence splitter, and a heuristic within-document co-reference resolution system. LingPipe comes with a English language model. The types of NE covered by LingPipe are locations, persons and organizations. We have re-trained LingPipe so as to cover more named entities types: DATE for English, and both DATE and NUMBER for German. We extended the co-reference resolution algorithm to count for German pronouns as well. A large Gazetteer of named entity instances has been used for both languages and for English a PERSON Gazetteer with gender attributes has been integrated for a better co-reference resolution."

University of Hildesheim

"LingPipe was used as a basic tool. Lingpipe applies a statistical machine learning approach to named entity recognition and categorization. For training LingPipe, we used one annotated corpus for each language: German: Frankfurter Rundschau with 36 Million word forms (Source: Linguistic Data Consortium, LDC); English: Reuters News (810.000 news texts)"

Thomson Legal and Regulatory

Thomson Legal and Regulatory used LingPipe to do pronoun resolution for a summarization system they built. LingPipe was not doing the right thing of the box so they extended it to handle longer distance anaphora--that is of course a selling point for LingPipe.

Cambridge University

" The Lingpipe NER module achieves high precision by only generalizing to unseen names in lexical contexts which are clearly indicative of gene names in the training data.... We tested the performance of LingPipe on both annotations using standard definitions of Recall/Precision/F-score achieving 0.8086/0.7485/0.7774 and 0.8423/0.8483/0.8453, respectively."

"Morgan et al, evaluating on the the first set of annotations, reported that they achieved 0.71/0.78/0.75. Comparing the systems, our performance is a little better, especially in terms of recall. "

" On unseen tokens, compared to Morgan et al. our performance is significantly higher (0.619 on merged and 0.5365 on morgan compared to 0.33 F-score), which can be attributed to the treatment of unseen tokens by LingPipe. "

" For each token classified, we estimated the entropy of the distribution of Equation 1 computed by LingPipe, which gave us an indication of how (un)certain the classifier was of its decision. We observed that many of the recall errors occurred in cases in which the HMM model classified a token with entropy close to 1, i.e. with high uncertainty. We post-processed the output of the classifier by re-annotating as genes unseen tokens that were classified as ordinary words with entropy higher than a specified threshold. "

The very satisfying bit of the paper is that they get in the guts of LingPipe and improve it below; this is why we give it to researchers. In return, we helped them integrate LingPipe into their partially word-annotated XML pipeline.

University of New South Wales

Used LingPipe for named entity recognition for the geographic part of the CLEF cross-lingual information retrieval evaluation.

LingPipe in Education

LingPipe is widely used as the basis for assignments or even whole courses. These are mostly upper undergraduate and beginning graduate courses in search, natural language processing or data mining.

The following is a list of some courses we know about. Please feel free to submit more.

University of Washington

Illinois Institute of Technology

University of Amsteram

A project-oriented masters course that used LingPipe as a significant component. They produced an interesting system report, as well.