We're transitioning from online documentation to a textbook format. All of the latest demos, as well as rewritten versions of some existing tutorial are all in the book.

Reference and Link

Here's the reference:

Because we're developing the book independently of the LingPipe release cycle, we're keeping it on its own page:

Publishing Plans

Because of our non-standard license, we won't be able to get it published by a standard outlet like O'Reilly as Steven Bird et al. did with the NLTK Book. We plan to self publish a paper version and make it available through Amazon and other e-outlets if anyone still reads paper by the time the book's done. The upside of this is that we'll continue to distribute the PDF for free.

New Tutorial Code

There is also an extensive set of demo code that goes along with the book. That's also available from the book home page, as linked above. There are many more simple examples in the books than are in our tutorials. We also explain some additional libraries, like the International Components for Unicode (ICU) for analyzing encodings and normalizing character sequences.

Table of Contents

For the latest information, see the LingPipe book home page.

At the time of the current LingPipe release, we've completed the following chapters, totalling a little over 450 pages printed in a relatively compact programming text format. Note that this includes some extensive introductions to the relevant features of Java, specifically characters, encodings, strings, regular expressions, and I/O. These introductions to text processing in Java go well beyond anything I've seen in other introductions to Java. There is also much more thorough discussion of the underlying mathematical basis of our models, though that has been moved to an appendix for each chapter to make the rest of the chapter less demanding for those without strong algorithm and statistics backgrounds.

  1. Getting Started
  2. Characters and Strings
  3. Regular Expressions
  4. Input and Output
  5. Handlers, Parsers and Corpora
  6. Tokenziation
  7. Suffix Arrays
  8. Symbol Tables
  9. Character Language Models
  10. Classifiers and Evaluation
  11. Naive Bayes Classifiers
  12. Latent Dirichlet Allocation

There are some even more basic appendices.

  1. Mathematics and Statistics
  2. Corpora
  3. Java Basics
  4. The Lucene Search Library
  5. Further Reading

We're currently working on the tagging and chunking chapters. We'll start with HMMs and language model chunkers. We'll introduce logistic regression classifiers before we get to CRF-based taggers and chunkers.