What are (Chain) Conditional Random Fields?

LingPipe implements first-order chain conditional random fields (CRF). A chain conditional random field is a model for labeling sequences of tokens with tags drawn from a finite set. Typical applications include part-of-speech tagging and by coding chunks as sequences of tags, named-entity and other chunking problems, such as sentence detection.

Logistic Regression Refresher

Chain CRFs extend the classification strategy embodied in logistic regression to sequence tagging. For more info on logistic regression, see our:

Logistic regression involves classifying objects based on feature vectors extracted from the objects by a chain CRF-specific feature extractor. Each category k is assigned a coefficient vector β[k]. Then for a given feature vector x extracted from an object, the dot product β[k]*x provides an unnormalized probability on the log scale,

      Pr(cat = k|x) = β[k] * x / Z(x),

where

      Z(x) = Σk β[k] * x.

In the LingPipe implementation, we assume that β[0] = 0; this does not lose any generality and uniquely identifies the model.

Conditional Random Fields

A sequence tagging problem, such as part-of-speech tagging, can be viewed as a sequence classification problem where the categories are sequences of tags. If there are K different tags, a sequence of length N has up to K**N possible sequences of tags (though some may be eliminated on structural grounds). CRFs carry out logistic regression over these possible sequences.

First-Order Chain Feature Extraction

In order to efficiently compute the best tagging(s) given an input sequence, the features are restricted to depend only on local features of the output tags. In first-order chain CRFs, features may only depend on pairs of output tags. Although we can think of this as providing one tag of context in predicting the next tag, very much like a first-order HMM, CRFs are critically different in that they are normalized over the entire input sequence.

Suppose we have an input sequence of length N, x[1],...,x[N]. And suppose we have K output categories coded as integers 1,...,K. We extract features for each position based on the position in the input (1,...,N) and for each previous tag (1,...,K). If feats() is the feature extractor for a given position and previous tag, it is run with all of the following arguments:

   feats(pos=1,prevTag=null),
   feats(pos=2,prevTag=1), ..., feats(pos=2,prevTag=K),
   ...,
   feats(pos=N,prevTag=1), ..., feats(pos=N,prevTag=K).

For additional efficiency, we factor the feature extractor into two parts, one of which pays attention to the previous tag, and one of which doesn't. The reason for this is that the computation for the non-previous-tag-dependent features can be shared, and these are most of the features used.

Coefficient Vectors by Tag

A CRF consists of a coefficient vector β[k] for each of K output tags, 1,...,K.

Just as with logistic regression, training CRFs constitutes fitting coefficient vectors based on training data.

Linear Predictor

Like the case of logistic regression, we compute unnormalized non-negative scores for each outcome, here a tag sequence, then normalize by dividing by the sum of these values. Consider the sequence of output tags k[1],...,k[N]. The value for this output for input sequence x[1],...,x[N] is given by:

   φ(k[1],...,k[N] | x[1],...,x[N])
     = feats(1,null) * β[k[1]]
     + feats(2,k[1]) * β[k[2]]
     + ... + feats(N,k[N-1]) * β[k[N]]

We then compute the probability of the output by normalizing, just as in logistic regression, setting

   p(k[1],...,k[N] | x[1],...,x[N])
      = φ(k[1],...,k[N] | x[1],...,x[N]) / Z(x[1],...,x[N]),

where

   Z(x[1],...,x[N])
      = Σk[1] in 1:K ... Σk[N] in 1:K φ(k[1],...,k[N] | x[1],...,x[N])

Global Normalization and Label Bias

One of the motivations of conditional random fields was to avoid the label-bias problem found in hidden Markov models and the obvious generalization of logistic regression to sequences (known in natural language processing somewhat confusingly as "maximum entropy Markov models" (MEMM)). MEMMs are like conditional random fields, but normalize their probabilities on a per-tag basis rather than over the whole sequence.

Label bias arises when the choice of an upstream (earlier) tag resets probabilities, allowing a choice of bad downstream tags based on the upstream tag. By normalizing all at once, longer-distance effects are taken into account, even though the features are only extracted locally. This is because we're finding a globally optimum set of features, and it doesn't have to be normalized on a tag-by-tag basis. Therefore, if there is no good downstream tags, they'll be downweighted, whereas in HMMs, even if there is no weight in training, they'll get normalized on a per-tag basis in such a way they'll still be viable tags.

Overview of LingPipe's CRF APIs

LingPipe codes up first-order chain CRFs in the package com.aliasi.crf, with applications to tagging and to chunking. In the rest of the tutorial we will go into detail with code samples; in this preliminary section we just provide an overview of what's to come.

CRFs for Tagging

Tagging Package

The general interfaces for taggers, which assign labels to each symbol in a sequence of symbols, is in com.aliasi.tag. It contains interfaces for first-best sequence taggers, n-best sequence taggers, and for computing marginal probabilities of tags for symbols given a sequence. This package also contains evaluation tools for tagging.

Chain CRF Class

LingPipe's implementation of chain CRFs as sequence taggers is in the package com.aliasi.crf.ChainCrf.

CRFs for Chunking

Chunking Package

The general interface for chunkers, which extract chunks from input strings, is found in com.aliasi.chunk. The chunking package contains interfaces for first-best chunking, n-best chunking, and for extracting chunks in order of marginal probabilities given the input text. There are also evaluation tools for chunkers.

Chain CRF Chunking Class

LingPipe's implementation of implementation for chunkers is in com.aliasi.crf.ChainCrfChunker. It relies on the implementation of linear chain CRFs in the class crf.ChainCrf.

The CRF chunker translates translates between taggings and chunkings using an instance of com.aliasi.chunk.TagChunkCodec. For instance, we will consider the standard begin/in/out (BIO) encoding of chunks as tags in this tutorial.

Static Estimators (Factories)

Both the tagging and chunking CRFs are estimated using a static factory method.

Corpus Implementations

Training in both cases is based on a corpus of training data implementing com.aliasi.corpus.Corpus. For tagging, it is a corpus of tag.Tagging objects; for chunking, it is a corpus of chunk.Chunking objects.

Feature Extractors

In order to estimate coefficients, the training corpus needs to be converted to vector format. Feature extraction for chain CRFs is more complex than for something like logistic regression because of the need for repeated extractions from the same context for different positions in the inputand different previous tags.

There is a special interface for chain CRF feature extractors, crf.ChainCrfFeatureExtractor. A feature extractor takes an input sequence of tokens and set of possible tags, and produces an instance of crf.ChainCrfFeatures.

The returned features object in turn provides feature maps for nodes, based on the input tokens and position, and edges, based on the input tokens, position, and a previous tag. It is constructed once for a given input and then used to populate the lattice required for decoding with feature vectors. This allows operations performed on the input sequence to be cached and reused. For instance, part-of-speech tags would be computed once at features object construction time and then reused in calls to get node or edge feature maps.

Stochastic Gradient Implementation

Estimation is based on stochastic gradient descent, the same method used for logistic regression estimation. The difference is that for CRFs, the forward-backward algorithm must be used to compute expectations; otherwise, the algorithms are identical.

There is only one underlying implementation of CRF estimation, and it is found in the crf.ChainCrf; the chunking CRF just delegates its factory call to the chain CRF estimation factory after converting its inputs from chunks to tags.

Training involves setting a range of parameters similar to those described in LingPipe's logistic regression tutorial, including parameters for the learning rate and annealing rate.

Unlike logistic regression, which uses lazy prior updates for a purely stochastic gradient, the CRF stochastic gradient implementation blocks the prior updates. This means that the gradient for the prior is only applied every so often, typically every 10 to 100 items or so. The larger the block size, the faster the processing. We've found block sizes of around 50 provide a good balance between speed and stochasticity, returning virtually the same results as block sizes of 1.

Feature Caching

Because of the lattice required for forward-backward decoding, CRFs require N node feature vectors and K*(N-1) edge feature vectors, where K is the number of possible tags and N is the number of tokens.

Coefficient Priors

Coefficients in CRFs are given priors in exactly the same way as coefficients in logistic regression, using an instance of stats.RegressionPrior. The most commonly found priors are implemented, including Gaussian (L2), Laplace (L1), Cauchy (student T), and maximum likelihood (uniform).

Learning and Annealing Rates

The stochastic gradient descent allows constant learning rates, or allows learning rates to be annealed (gradually lowered over time). Annealing is technically required to ensure convergence, but often can be eliminated in practice where resulting estimates need only be close enough.

Annealing is handled through an instance of stats.AnnealingSchedule. The common annealing schedules are implemented, including constant learning rates (no annealing), exponential learning rate decay, and inverse learning rate decay.

Reporting During Estimation

Just as for logistic regression, an instance of io.Reporter is used to provide feedback during estimation. The verbosity of the reporter, and the destination of its reports may be set programmatically in the same way as is typically done for loggers.

CRF Taggers

We'll first describe tagging using CRFs with a very simple toy example for part-of-speech tagging. The objects being tagged will be simple sequences of strings, so the corpus for training needs to be specified so the handlers deal with Tagging<String> instances.

Corpus Implementation

Here's a simple hard-coded corpus, as implemented in src/TinyPosCorpus.java. First, the specification of what it implements:

public class TinyPosCorpus
    extends Corpus<ObjectHandler<Tagging<String>>> {
...

This corpus is going to use a handler that handles objects of type Tagging<String>. We hard code the words and tags with a simple array:

    static final String[][][] WORDS_TAGSS = new String[][][] {
        { { "John", "ran", "." }, { "PN", "IV", "EOS" } },
        { { "Mary", "ran", "." }, { "PN", "IV", "EOS" } },
        { { "John", "jumped", "!" }, { "PN", "IV", "EOS" } },
        { { "The", "dog", "jumped", "!" }, { "DET", "N", "IV", "EOS" } },
        { { "The", "dog", "sat", "." }, { "DET", "N", "IV", "EOS" } },
        { { "Mary", "sat", "!" }, { "PN", "IV", "EOS" } },
        { { "Mary", "likes", "John", "." }, { "PN", "TV", "PN", "EOS" } },
        { { "The", "dog", "likes", "Mary", "." }, { "DET", "N", "TV", "PN", "EOS" } },
        { { "John", "likes", "the", "dog", "." }, { "PN", "TV", "DET", "N", "EOS" } },
        { { "The", "dog", "ran", "." }, { "DET", "N", "IV", "EOS", } },
        { { "The", "dog", "ran", "." }, { "DET", "N", "IV", "EOS", } }
    };

The idea here is that each lien represents first the tokens, then the tags, both as arrays of strings. That allows the visitor method for training to simply iterate over the array:

    public void visitTrain(ObjectHandler<Tagging<String>> handler) {
        for (String[][] wordsTags : WORDS_TAGSS) {
            String[] words = wordsTags[0];
            String[] tags = wordsTags[1];
            Tagging<String> tagging
                = new Tagging<String>(Arrays.asList(words),
                                      Arrays.asList(tags));
            handler.handle(tagging);
        }
    }

We won't use the corpus for testing, so the test visitor is implemented as a no-op (not shown).

Feature Extractors

We'll implement the simplest possible feature extractors for a CRF, which only record the current token and previous tag respectively. This is the same set of features used in a traditional hidden Markov model (HMM).

The code for feature extractors is in src/SimpleChainCrfFeatureExtractor.java.

Chain CRF Feature Extractor

The top-level shell is an implementation of crf.ChainCrfFeatureExtractor over string input tokens. This is declared as

public class SimpleChainCrfFeatureExtractor
    implements ChainCrfFeatureExtractor<String>,
               Serializable {
...

This declares our node feature extractor to operate over token sequences consisting of strings. Also note that it is declared to be serializable, which will allow it to be serialized along with the rest of the CRF after training.

There is only a single method, which returns a new features instance,

...
    public ChainCrfFeatures<String> extract(List<String> tokens,
                                            List<String> tags) {
        return new SimpleChainCrfFeatures(tokens,tags);
    }
...

Chain CRF Features

The chain CRF features object implementation does the work. It's declared as a static inner class because nothing on the outside needs to know about its identity other than as an extension of the abstract base class crf.ChainCrfFeatures. The implementation is thus declared as

...
    static class SimpleChainCrfFeatures
        extends ChainCrfFeatures<String> {
...

and the constructor simply sends the list of tokens and tags up to the superclass for management.

...
        public SimpleChainCrfFeatures(List<String> tokens,
                                      List<String> tags) {
            super(tokens,tags);
        }
...

The real work is done in the method that returns node features,

...
        public Map<String,Integer> nodeFeatures(int n) {
            return Collections
                .singletonMap("TOK_" + token(n),
                              Integer.valueOf(1));
        }
...

and the method that returns edge features,

...
        public Map<String,Integer> edgeFeatures(int n, int k) {
            return Collections
                .singletonMap("PREV_TAG_" + tag(k),
                              Integer.valueOf(1));
        }
    }
}

The node features consist of a mapping from a single key representing the token, which is retrieved using the method token(int) on the abstract base class, crf.ChainCrfFeatures. The feature is represented by the prefix TOK_ followed by the token string itself. The value here is 1.

The edge feature map is represented and extracted in the same way.

The java.util.Collections convenience method singletonMap is useful for this operation. It's particularly convenient due to the availability of type inference on methods, which allows us to drop the type of the returned map. In general, an instance of LingPipe's util.ObjectToDoubleMap might be used.

Note that we have used a more specific return type, namely Map<String,Intger>, which represents a feature vector as a mapping from string-based feature names to numerical values. This is legal because Integer extends the specified wildcard type, (? extends Number).

Note that we have left out the serialVersionUID declaration for the top-level class; it is only necessary for forward compatibility of serialized versions (and to stop the lint tool from griping during compile).

Estimating CRF Taggers

The trainer for the simple POS tagging CRF is in src/SimplePosTrain.java.

We need to marshal several arguments before training a CRF tagger. In the simple trainer, we just create them all in the static main() method, starting with the corpus and feature extractors we just described:

public class SimplePosTrain {

    public static void main(String[] args) throws IOException {

        Corpus<ObjectHandler<Tagging<String>>> corpus
            = new TinyPosCorpus();

        ChainCrfFeatureExtractor<String> featureExtractor
            = new SimpleChainCrfFeatureExtractor();

        boolean addIntercept = true;

        int minFeatureCount = 1;

        boolean cacheFeatures = false;

        boolean allowUnseenTransitions = true;
...

We just create a corpus instance and feature extractor instance. We set the minimum count for features to not be pruned from the corpus to 1. We also set the flag for automating adding an intercept feature to feature vectors (with feature identifier 0 and value 1.0). There is a second flag which tells the estimator not to cache the feature vectors (this saves space, but is slower).

The next step is to set a flag to allow unseen transitions in the output taggings, which means that we are not restricted to taggings with starts, ends, or transitions that were seen in the training data. For instance, even if the category NN was never seen starting a sentence in the training data, or an (ADJ,NN) tag pair was never seen in the training data in that order, we will allow them as potential outputs if their estimated scores are high enough. Setting this flag to false can speed up processing for first-best, n-best and marginal taggings, because it eliminates vector multiplications for illegal tag positions or transitions.

Next, we need to set up the prior. Here, we'll use a Gaussian prior with a prior variance of 10.0, which is pretty diffuse (meaning it doesn't force features very hard back to the prior mean of 0.0).

...
        double priorVariance = 4.0;
        boolean uninformativeIntercept = true;
        RegressionPrior prior
            = RegressionPrior.gaussian(priorVariance,
                                       uninformativeIntercept);
        int priorBlockSize = 3;
...

We have also set a prior block size of 3, which means that we update the coefficients by the prior gradient every three instances. Appropriate scaling of the learning rate based on the block size is carried out automatically.

Next, we set up the learning rate annealing schedule. We'll use an exponential decay and a fairly high initial learning rate:

...
        double initialLearningRate = 0.05;
        double learningRateDecay = 0.995;
        AnnealingSchedule annealingSchedule
            = AnnealingSchedule.exponential(initialLearningRate,
                                            learningRateDecay);
...

We then set the three search parameters, which indicate the minimum and maximum number of epochs, as well as the minimum amount by which the log likelihood must improve (relatively) per epoch to go beyond the minimum number of epochs.

...
        double minImprovement = 0.00001;
        int minEpochs = 2;
        int maxEpochs = 2000;
...

Finally, we set up a reporter which reports at the debug level to the standard output, so we can watch what happens during training on the console:

...
        Reporter reporter
            = Reporters.stdOut().setLevel(LogLevel.DEBUG);
...

At last, we are in a position to the call the estimation method, which takes all of the previously defined variables as arguments:

        ChainCrf<String> crf
            = ChainCrf.estimate(corpus,
                                featureExtractor,
                                addIntercept,
                                minFeatureCount,
                                allowUnseenTransitions,
                                prior,
                                annealingSchedule,
                                minImprovement,
                                minEpochs,
                                maxEpochs,
                                reporter);

The remainder of the method simply writes the model to a file in serialized form using the utility method in util.AbstractExternalizable:

        File modelFile = new File(args[0]);
        AbstractExternalizable.serializeTo(crf,modelFile);

Because Java serialization is used to store the model, it may be written to any output stream by wrapping the stream in an object output stream and writing the CRF to the stream.

Running the Trainer

It is all set up in Ant to run with the following command (from the tutorial/crf directory):

lingpipe\trunk\demos\tutorial\crf> ant simple-pos-train
Estimating
      :00 ChainCrf.estimate Parameters
      :00 featureExtractor=SimpleChainCrfFeatureExtractor@158b649
      :00 addInterceptFeature=true
      :00 minFeatureCount=1
      :00 cacheFeatureVectors=false
      :00 allowUnseenTransitions=true
      :00 prior=GaussianRegressionPrior(Variance=4.0, noninformativeIntercept=true)
      :00 annealingSchedule=Exponential(initialLearningRate=0.05, base=0.995)
      :00 minImprovement=1.0E-5
      :00 minEpochs=2
      :00 maxEpochs=2000
      :00 priorBlockSize=3
      :00 Computing corpus tokens and features
      :00 Corpus Statistics
      :00 Num Training Instances=11
      :00 Num Training Tokens=42
      :00 Num Dimensions After Pruning=17
      :00 Tags={0=PN, 1=IV, 2=EOS, 3=DET, 4=N, 5=TV}
      :00 epoch=    0 lr=0.050000000 ll=   -59.7809 lp=  -223.3503 llp=  -283.1311 llp*=  -283.1311
      :00 epoch=    1 lr=0.049750000 ll=   -49.2419 lp=  -223.5537 llp=  -272.7956 llp*=  -272.7956
      :00 epoch=    2 lr=0.049501250 ll=   -40.7240 lp=  -223.8438 llp=  -264.5677 llp*=  -264.5677
...
      :00 epoch=  108 lr=0.029097972 ll=    -3.5685 lp=  -230.4528 llp=  -234.0213 llp*=  -234.0213
      :00 epoch=  109 lr=0.028952482 ll=    -3.5671 lp=  -230.4540 llp=  -234.0211 llp*=  -234.0211
      :00 epoch=  110 lr=0.028807720 ll=    -3.5658 lp=  -230.4552 llp=  -234.0210 llp*=  -234.0210
      :00 Converged with rollingAverageRelativeDiff=9.966186465104444E-6
      :00 Feat Extraction Time=:00
      :00 Forward Backward Time=:00
      :00 Update Time=:00
      :00 Prior Update Time=:00
      :00 Loss Time=:00

Compiling to file=simplePos.ChainCrf
BUILD SUCCESSFUL
Total time: 1 second

The first part of the print out simply reports the input parameters.

As with logistic regression training, when the reporter is set to debug level reporting, each epoch's score is reported. Like for logistic regression, we report the epochs' learning rate (lr), log likelihoods (ll), log priors (lp), the sum of log likelihood and log prior (llp), which is the objective function, and the maximum value of the objective so far (llp*).

The log likelihood is the probability of the training data given the current parameters. The log prior is the log probability of the current parameters given the prior. The sum of the log prior and log likelihood defines the objective function whose gradient is climbed during estimation (using stochastic gradient descent).

Maximum likelihood coefficients can be computed by setting the prior to be uninformative. Otherwise, coefficients will balance minimizing the prior (by being as small as possible) and the likelihood function (by being as close to the maximum likelihood solution [as a group] as possible).

Training for Convergence

The goal of training is to get the SGD algorithm to converge to a minimum of the sum of the log likelihood and prior (noted llp in the verbose output). As with logistic regression, first exploration should involve a large number of epochs and a relatively modest training rate. The training rate should be increased as high as possible while maintaining stability of the estimate. It should also be annealed as quickly as possible while maintaining convergence.

Convergence to a high level of accuracy (many decimal places) is usually not necessary. Rarely is there enough training data to support the extra precision anyway. Thus held out evaluations can be used to determine whether models are close enough.

Running CRF Taggers

It's straightforward to take a compiled CRF tagger and run it on new data. We provide a sample implementation in src/SimplePosTag.java that displays first-best, n-best, and marginal tagging results.

Running the Simple POS Tag Demo

Once a model is trained (using ant simple-pos-train, as described above), it may be used to decode input sequences of symbols. We've provided an ant target that does this:

lingpipe\trunk\demos\tutorial\crf> ant simple-pos-run

FIRST BEST
John/PN ran/IV ./EOS

5 BEST CONDITIONAL
Rank log p(tags|tokens)  Tagging
0    -0.19285257035395986 John/PN ran/IV ./EOS
1    -3.358109848458719 John/N ran/IV ./EOS
2    -3.931239459222259 John/EOS ran/IV ./EOS
3    -4.006593072681378 John/PN ran/EOS ./EOS
4    -4.593808019587663 John/DET ran/IV ./EOS

MARGINAL TAG PROBABILITIES
Token .. Tag log p(tag|pos,tokens)
John
     PN -0.11804521988882044
     IV -4.157694169296432
     EOS -3.7176228157431774
     DET -3.991055779078051
     N -3.2772606247513387
     TV -4.186642002447862
ran
     PN -4.136730600811648
     IV -0.08419970812374089
     EOS -3.4907943870346685
     DET -4.912971648808058
     N -4.424359623821775
     TV -4.201827184540062
.
     PN -4.881890987444048
     IV -5.41613925820661
     EOS -0.029776364673478994
     DET -5.12562023731763
     N -5.028825282359939
     TV -5.334679320548158

BUILD SUCCESSFUL
Total time: 0 seconds

First, there's first-best output, which is guaranteed to be the single most likely tag sequence for the input given the model.

Next, there's n-best output, which returns the top whole sequence analyses. Here it's run in conditional probability form (which is a bit more expensive than the unnormalized version), which reports (natural) log conditional probabilities of the tag sequence given the input symbols. The probabilities are on the natural log scale. For instance, exp(-0.19)=0.83, so the first best whole string analysis, John/PN ran/IV ./EOS, has a probability of 0.83 according to the model.

Finally, we report marginal outputs. For every token in the input, we report the (natural log) conditional probability the token gets each tag. These values are computed relative to the entire input sequence. These are also reported as natural log probabilities.

Code Walkthrough

The code is in src/SimplePosTag.java. It's all in a single main method declared as:

public class SimplePosTag {

    public static void main(String[] args)
        throws ClassNotFoundException, IOException {
...

Note that the method throws class-not-found and I/O exceptions, both of which may arise in general from deserialization.

Next, we read the model file in from file using the readObject() method from util.AbstractExternalizable:

...
        File modelFile = new File(args[0]);
        @SuppressWarnings("unchecked")
        ChainCrf<String> crf
            = (ChainCrf<String>)
            AbstractExternalizable.readObject(modelFile);
...

We need to suppress the warning from a cast to stop the compiler from griping, but it's unavoidable given Java's untyped serialization. If someone supplies the wrong file, the cast will break (if not something sooner).

Now that we have the CRF, we run through the arguments and convert them to lists of symbols by using string's split() command to get an input we can analyze:

...
        for (int i = 1; i < args.length; ++i) {
            String arg = args[i];
            List<String> tokens = Arrays.asList(arg.split(" "));
...

Typically, this splitting would be done with a tokenization factory, but we're trying to keep the demo simple.

Next, we can extract the first-best output in one line and print it:

...
            Tagging<String> tagging = crf.tag(tokens);
            System.out.println(tagging);

...

The tag.Tagging implementation also supplies methods to access the inptu tokens and output tags programmatically.

We access the n-best outputs through an iterator over scored taggings. Scored taggings are just like taggings, only with a score; they implement util.Scored for convenience. We access them in the usual way for an iterator:

...
            int maxNBest = 5;
            Iterator<ScoredTagging<String>> it
                = crf.tagNBestConditional(tokens,maxNBest);
            for (int rank = 0; rank < maxNBest && it.hasNext(); ++rank) {
                ScoredTagging<String> scoredTagging = it.next();
                System.out.println(rank + "    " + scoredTagging);
            }
...

We set the maximum number of n-best results to be 5 and also check it in the loop. The lower the maximum n best, the less memory the iterator will use. We just access the scored tagging and print it.

Finally, we access the marginal tag probabilities through a lattice return result. We can access them programatically as shown in the code:

...
            ForwardBackwardTagLattice<String> fbLattice
                = crf.tagMarginal(tokens);
            for (int n = 0; n < tokens.size(); ++n) {
                System.out.println(tokens.get(n));
                for (int k = 0; k < fbLattice.numTags(); ++k) {
                    String tag = fbLattice.tag(k);
                    double prob = fbLattice.logProbability(n,k);
                    System.out.println("     " + tag + " " + prob);
                }
            }

Note that the tags are accessed by number, with the lattice supplying both the size of the tag set through numTags(), and dereferencing tag identifiers to string symbols using tag(). The log probability of token n receiving tag k is then retrieved using the logProbability() method in the lattice.

CRF Chunkers

In this section, we'll show you how to build a chunker based on CRFs. It's very much like building a tagger, with the additional step of transcoding a chunking problem as a tagging problem.

Transcoding Chunkers at Taggers

Just as with HMMs, we attack the chunking problems for CRFs by transcoding chunking problems as tagging problems. For CRF chunkers, the transcoding is encapsulated in an encoder/decoder interface chunk.TagChunkCodec.

Encoding Chunkings as Taggings

The codec interface supplies an encoder method, toStringTagging(), that converts a chunking to a tagging implementing tag.StringTagging. This is typically accomplished by means of a tokenizer factory that is consistent with the chunking (i.e. all chunkings start and end on token boundaries).

Decoding Taggings to Chunkings

The reverse mapping from a string tagging to a chunking is carried out by the codec method toChunking().

Lattice Decoding

In order to allow efficient n-best chunks to be returned, the codec must also implement a chunk iterator given a tag lattice implementation along with arrays indicating where the tokens start and end.

Encodability Tests

The codecs also implement methods that determine if a chunking is codable or a tagging is decodable, as well as to determine if a sequence of tags is legal, and to return the entires et of tags.

BIO Codec

As of LingPipe 3.9, there is only one tag-chunk codec implementation, chunk.BioTagChunkCodec. This codec uses the standard begin/in/out encoding of chunkings as taggings. It is constructed using a tokenizer factory, which it uses to break up the chunking's text. The tokenizer should be consistent with the chunker in the sense of each chunk starting or ending on a token start or end position.

The BIO codec is most easily understood with an example. Suppose we have the chunking:

Char Sequence
John J. Smith lives near Washington.
0123456789012345678901234567890123456
          1         2         3

Chunk Set
(0,13):PER
(25,35):LOC

consisting of a person chunk covering "John Smith" and a location chunk covering "Washington". Next, suppose we use the Indo-European tokenizer factory to tokenize the input. The tokens (with start and end [plus 1] character offsets) and the tags produced by the BIO tag/chunk codec are:

TokenStartEndTag
John04 B_PER
J516 I_PER
.67 I_PER
Smith813 I_PER
lives1419 O
near2024 O
Washington2535 B_LOC
.3536 O

The token/tag sequences are then provided as input to CRF training.

Feature Extraction

Feature extraction for CRF chunkers is the same as that for taggers, requiring an implementation of the chain CRF feature extractor. The tokens are those supplied by the codec's tokenizer and the tags are supplied by the codec coding of chunkings as taggings.

For our simple demo, we'll use exactly the same feature extractor as we did for the tagging example.

Estimating CRF Chunkers

Training Corpus

We implement a hard-coded training corpus as we did for the part-of-speech tagging case. The code is in src/TinyEntityCorpus.java. The declaration shows it requires a chunking handler:

public class TinyEntityCorpus
    extends Corpus<ObjectHandler<Chunking>> {
...

As with our toy POS corpus, it hard codes an array of chunkings:

    static final Chunking[] CHUNKINGS
        = new Chunking[] {
        chunking(""),
        chunking("The"),
        chunking("John ran.",
                 chunk(0,4,"PER")),
        chunking("Mary ran.",
                 chunk(0,4,"PER")),
        chunking("The kid ran."),
        chunking("John likes Mary.",
                 chunk(0,4,"PER"),
                 chunk(11,15,"PER")),
        chunking("Tim lives in Washington",
                 chunk(0,3,"PER"),
                 chunk(13,23,"LOC")),
        chunking("Mary Smith is in New York City",
                 chunk(0,10,"PER"),
                 chunk(17,30,"LOC")),
        chunking("New York City is fun",
                 chunk(0,13,"LOC")),
        chunking("Chicago is not like Washington",
                 chunk(0,7,"LOC"),
                 chunk(20,30,"LOC"))
    };

Two static factory convenience methods allow the array of chunkings to be relatively compactly defined:

    static Chunking chunking(String s, Chunk... chunks) {
        ChunkingImpl chunking = new ChunkingImpl(s);
        for (Chunk chunk : chunks)
            chunking.add(chunk);
        return chunking;
    }

    static Chunk chunk(int start, int end, String type) {
        return ChunkFactory.createChunk(start,end,type);
    }

Training Code

The code for training a CRF chunker is very much like that for training the POS tagger, consisting of a long sequence of variables used for estimation, followed by a call to the estimation method, followed by a serialization to a specified file. In fact, we use similar parameters for the entity trainer.

The source code for training is in src/SimpleEntityTrain.java.

public class SimpleEntityTrain {

    public static void main(String[] args) throws IOException {
        Corpus<ObjectHandler<Chunking>> corpus
            = new TinyEntityCorpus();

        TokenizerFactory tokenizerFactory
            = IndoEuropeanTokenizerFactory.INSTANCE;
        boolean enforceConsistency = true;
        TagChunkCodec tagChunkCodec
            = new BioTagChunkCodec(tokenizerFactory,
                                   enforceConsistency);
...

Note that we use the new chunking corpus, and create an instance of the BIO tag/chunk codec using an Indo-European tokenizer factory. The flag to enforce consistency will cause training to raise an exception if there is a training chunking that is not consistent with the specified tokenizer factory.

The other arguments are roughly the same as for the POS tagger trainer:

...
        int minFeatureCount = 1;

        boolean cacheFeatures = true;

        boolean addIntercept = true;

        double priorVariance = 4.0;
        boolean uninformativeIntercept = true;
        RegressionPrior prior
            = RegressionPrior.gaussian(priorVariance,
                                       uninformativeIntercept);
        int priorBlockSize = 3;

        double initialLearningRate = 0.05;
        double learningRateDecay = 0.995;
        AnnealingSchedule annealingSchedule
            = AnnealingSchedule.exponential(initialLearningRate,
                                            learningRateDecay);

        double minImprovement = 0.00001;
        int minEpochs = 10;
        int maxEpochs = 5000;

        Reporter reporter
            = Reporters.stdOut().setLevel(LogLevel.DEBUG);
...

The call to the factory method for estimation is also similar, but requires the tag-chunk codec and a tokenizer factory for handling inputs. These should be the same tokenizer factory to ensure consistency.

...
        ChainCrfChunker crfChunker
            = ChainCrfChunker.estimate(corpus,
                                       tagChunkCodec,
                                       tokenizerFactory,
                                       featureExtractor,
                                       addIntercept,
                                       minFeatureCount,
                                       cacheFeatures,
                                       prior,
                                       priorBlockSize,
                                       annealingSchedule,
                                       minImprovement,
                                       minEpochs,
                                       maxEpochs,
                                       reporter);
...

The code to serialize the model to a file is the same as before:

...
        File modelFile = new File(args[0]);
        AbstractExternalizable.serializeTo(crfChunker,modelFile);
    }
}

Running the Trainer

There's an ant task for training to a specified file after changing directories to the CRF tutorial directory:

lingpipe\trunk\demos\tutorial\crf> ant simple-entity-train

Estimating
      :00 Training chain CRF chunker
      :00 Converting chunk corpus to tag corpus using codec.
      :00 ChainCrf.estimate Parameters
      :00 featureExtractor=SimpleChainCrfFeatureExtractor@d8a7efd
      :00 addInterceptFeature=true
      :00 minFeatureCount=1
      :00 cacheFeatureVectors=true
      :00 allowUnseenTransitions=false
      :00 prior=GaussianRegressionPrior(Variance=4.0, noninformativeIntercept=true)
      :00 annealingSchedule=Exponential(initialLearningRate=0.05, base=0.995)
      :00 minImprovement=1.0E-5
      :00 minEpochs=10
      :00 maxEpochs=5000
      :00 priorBlockSize=3
      :00 Computing corpus tokens and features
      :00 Corpus Statistics
      :00 Num Training Instances=10
      :00 Num Training Tokens=36
      :00 Num Dimensions After Pruning=26
      :00 Tags={0=O, 1=B_PER, 2=B_LOC, 3=I_PER, 4=I_LOC}
      :00 Caching Feature Vectors
      :00 epoch=    0 lr=0.050000000 ll=   -26.7467 lp=  -290.7354 llp=  -317.4821 llp*=  -317.4821
      :00 epoch=    1 lr=0.049750000 ll=   -24.3225 lp=  -290.7784 llp=  -315.1009 llp*=  -315.1009
      :00 epoch=    2 lr=0.049501250 ll=   -22.3622 lp=  -290.8411 llp=  -313.2033 llp*=  -313.2033
      :00 epoch=    3 lr=0.049253744 ll=   -20.6893 lp=  -290.9193 llp=  -311.6085 llp*=  -311.6085
...
...
      :00 epoch=  109 lr=0.028952482 ll=    -4.4427 lp=  -295.3488 llp=  -299.7915 llp*=  -299.7915
      :00 epoch=  110 lr=0.028807720 ll=    -4.4355 lp=  -295.3555 llp=  -299.7911 llp*=  -299.7911
      :00 epoch=  111 lr=0.028663681 ll=    -4.4285 lp=  -295.3621 llp=  -299.7906 llp*=  -299.7906
      :00 Converged with rollingAverageRelativeDiff=9.845569372249269E-6
      :00 Feat Extraction Time=:00
      :00 Forward Backward Time=:00
      :00 Update Time=:00
      :00 Prior Update Time=:00
      :00 Loss Time=:00

Compiling to file=simpleEntity.ChainCrfChunker

BUILD SUCCESSFUL
Total time: 1 second

The main difference here is that we turned the learning rate down a bit and let it train longer. That'll usually produce a better solution (in terms of the objective function of log likelihood plus log prior; the likelihood won't always go up, especially with a string prior).

Also note that in this case, we ran up to the maximum number of epochs rather than converging. This is mainly because we had a lower convergence threshold in this example. These are all parameters to tune to get best performance out of the conditional random field estimator.

Running CRF Chunkers

Before we walk through the code to do chunking, here is what the output looks like from the built-in ant task we provide for a demo:

lingpipe\trunk\demos\tutorial\crf> ant simple-entity-run

FIRST BEST
John Smith lives in New York. : [0-10:PER@-Infinity, 20-28:LOC@-Infinity]

10 BEST CONDITIONAL
Rank log p(tags|tokens)  Tagging
0    -1.4770389901734227 John Smith lives in New York. : [0-10:PER@-Infinity, 20-28:LOC@-Infinity]
1    -2.2003980711615068 John Smith lives in New York. : [0-10:PER@-Infinity, 20-29:LOC@-Infinity]
2    -2.5871005572290713 John Smith lives in New York. : [0-10:PER@-Infinity]
3    -2.6727698300919283 John Smith lives in New York. : [0-4:PER@-Infinity, 20-28:LOC@-Infinity]
4    -2.817671646241407 John Smith lives in New York. : [0-10:PER@-Infinity, 20-23:LOC@-Infinity]
5    -3.3114207441453383 John Smith lives in New York. : [0-10:PER@-Infinity, 24-28:PER@-Infinity]
6    -3.3961289110800124 John Smith lives in New York. : [0-4:PER@-Infinity, 20-29:LOC@-Infinity]
7    -3.5368388002299245 John Smith lives in New York. : [0-10:PER@-Infinity, 20-23:PER@-Infinity]
8    -3.782831397147575 John Smith lives in New York. : [0-4:PER@-Infinity]
9    -3.937601248049126 John Smith lives in New York. : [0-10:PER@-Infinity, 17-28:LOC@-Infinity]

MARGINAL CHUNK PROBABILITIES
Rank Chunk Phrase
0 0-10:PER@-0.42743033470330616 John Smith
1 20-28:LOC@-1.0601439103952472 New York
2 0-4:PER@-1.4068707794176127 John
3 20-23:LOC@-2.348813706950642 New
4 24-28:PER@-2.725454312173648 York
5 20-23:PER@-3.0679808609391594 New
6 11-16:PER@-3.488779736670839 lives
7 20-28:PER@-3.5393559022676406 New York
8 24-28:LOC@-3.569822426270683 York
9 17-28:LOC@-3.5881505770573288 in New York

BUILD SUCCESSFUL
Total time: 1 second

We first display the first-best output as a chunking. The spans, such as 0-4, indicate the characters covered, such as "John", and the characters after the colon, such as "PER", indicate the entity type, such as a person. This first-best answer is actually an error; the correct span should be 0-10 to generate the phrase "John Smith". Note that an error isn't the same thing as a bug! The training data never contained the phrase "John Smith", but it had lots of evidence that "John" makes a good single-word person entity.

The second part of the ouptut provides the 10 best analyses of the whole phrase, along with their conditional (natural log) probability. In this case, we see the system isn't particularly confident of any of its analyses. These are also on the natural log scale. For instance, the estimated probability of the first-best analysis being correct is exp(-1.48)=0.23, which is not much confidence in the first-best analysis.

Finally, we present the n-best chunks. This is different than the n-best whole anlayses, though scores are also reported as conditional (natural log) probabilities. Thus the first best chunk, John Smith as type PER (person), has estimated probability exp(-0.43)=0.65. We can keep enumerating chunks until we get into very low confidence outputs. Here, we've only gone out to considering the phrase in New York as a location, the probability of which is estimated as exp(-3.59)=0.02.

CRF Chunking Code Walk Through

Now that we have a CRF chunker, we can use it to produce first-best, n-best, or individual chunks in order of likelihood. These calls are also very similar to the part-of-speech tagging example.

The code is in src/SimpleEntityChunk.java. Like the POS tagging code, it starts by deserializing the trained chunker and casting it to the appropriate type:

public class SimpleEntityChunk {
    public static void main(String[] args)
        throws ClassNotFoundException, IOException {

        File modelFile = new File(args[0]);
        @SuppressWarnings("unchecked")
        ChainCrfChunker crfChunker
            = (ChainCrfChunker)
            AbstractExternalizable.readObject(modelFile);
...

As with the POS tagger, we iterate over the remaining arguments (the first is the model file), analyzing each one.

...
        for (int i = 1; i < args.length; ++i) {
            String arg = args[i];
            char[] cs = arg.toCharArray();
...

Here, we'll need the array of characters to deal with the chunking interfaces.

For first-best output, we just apply the chunking method and print the result:

...
            Chunking chunking = crfChunker.chunk(arg);
            System.out.println(chunking);
...

This is the same interface method that is implemented by the non-CRF chunkers, such as the HMM chunker, rescoring chunker, and token-shape chunker.

For n-best output, we need to iterate over scored chunkings, represented using LingPipe's utility class util.ScoredObject, which wraps an object and provides a score.

...
            int maxNBest = 10;
            Iterator<ScoredObject<Chunking>> it
                = crfChunker.nBestConditional(cs,0,cs.length,maxNBest);
            for (int rank = 0; rank < maxNBest && it.hasNext(); ++rank) {
                ScoredObject<Chunking> scoredChunking = it.next();
                System.out.println(rank
                                   + "    " + scoredChunking.score()
                                   + " " + scoredChunking.getObject());
            }
...

Note the use of the scored object methods score() and getObject() to return the score for the chunking and the chunking itself.

As with the POS example, using the nBestConditional() method of the CRF chunker normalizes the scores to (natural) log conditional probabilities of chunking given the input. In principle, if all possibilities are enumerated, the sum of their (naturally exponentiated) scores would be 1.0. In practice, the combinatorics usually prevent such an enumeration; instead, the forward-backward algorithm is used to compute the sum for normalization. As with the tagging CRFs, it's slightly more computationally expensive to compute the conditional probabilities because the normalizing factor needs to be computed.

The generation of chunks is a bit different than with POS tagging. Rather than accessing the tag lattice and computing individual tag marginal probabilities, we get an iterator of chunks, the scores of which are the marginal probabilities of the chunk given the input.

...
            int maxNBestChunks = 10;
            Iterator<Chunk> nBestChunkIt
                = crfChunker.nBestChunks(cs,0,cs.length,maxNBestChunks);
            for (int n = 0; n < maxNBestChunks && nBestChunkIt.hasNext(); ++n) {
                Chunk chunk = nBestChunkIt.next();
                System.out.println(n
                                   + " " + chunk
                                   + " " + arg.substring(chunk.start(),chunk.end()));
            }
        }
    }
}

Note that we use the start and end of the chunk to pull the phrase corresponding to the chunk out of the original input.

Richer Feature Extractor

In this section, we'll show how to create a realistic, though not quite state of the art, set of features for CRFs. In the next section, we provide evaluation on the CoNLL 2003 English named entity mention detection task.

The features will include normalized tokens, part-of-speech tags, word-shape features, position features, and token prefixes and suffixes, all in a window of 1 token around the current token. More sophisticated feature extractors might include a larger window, more normalization, dictionary-based features, and cluster-based features.

The Code

The code for the feature extractor may be found in src/ChunkerFeatureExtractor.java. In the rest of this section, we'll walk through it.

The code begins with the requisite feature extractor definition:

public class ChunkerFeatureExtractor
    implements ChainCrfFeatureExtractor<String>,
               Serializable {

    static final long serialVersionUID = 123L;
...

Note that it is also declared to implement java.io.Serializable, and thus has a static serial version ID declared to stop the compiler from griping (the identity of this doesn't really matter for a first version with no backward compatibility requirements).

HMM for Part-of-Speech Tagging

We've also hard-coded a path to an HMM implementing part-of-speech tagging:

...
    static final File POS_HMM_FILE
        = new File("../../models/pos-en-general-brown.HiddenMarkovModel");
...

This HMM is included in the $LINGPIPE/demos/models directory, and details on how it was built and other ways in which it can be used may be found in the part-of-speech tagging tutorial.

There is a local variable to hold the tagger, and a constructor which reads in the model and constructs an HMM part-of-speech tagger with a cache.

...
    private final Tagger<String> mPosTagger;

    public Muc6FeatureExtractor()
        throws ClassNotFoundException, IOException {

        @SuppressWarnings("unchecked") 
        HiddenMarkovModel posHmm
            = (HiddenMarkovModel)
            AbstractExternalizable
            .readObject(POS_HMM_FILE);

        FastCache<String,double[]> emissionCache
            = new FastCache<String,double[]>(100000);
        mPosTagger = new HmmDecoder(posHmm,null,emissionCache);
    }
...

We are able to declare the tagger as final because we have built a custom externalizer for our feature extractor, which we'll discuss later. Reconstituting the HMM from file requires an unchecked cast; that's just part of serialization. We then build a cache and use it to build a decoder that caches log emission likelihoods, which is all we need for first-best decoding; hence the linear cache is null. Also see the part-of-speech Tagging tutorial for detailed information on how to configure HMMs for runtime use, including caching.

The exceptions may be thrown by the read object utility method in util.AbstractExternalizable.

Feature Extraction Method

The extraction method itself just returns a new features object created from the tags and tokens.

    public ChainCrfFeatures<String> extract(List<String> tokens,
                                            List<String> tags) {
        return new ChunkerFeatures(tokens,tags);
    }

Chain CRF Features Object Declaration

The ChunkerFeatures class is defined to extend a string-based CRF features object:

    class ChunkerFeatures extends ChainCrfFeatures<String> {
    ...

It is defined as a (non-static) inner class so that it can maintain access to things like the part-of-speech tagger found in the containing class, ChunkerFeatureExtractor.

Computing Part-of-Speech Tags

The part-of-speech tags are computed in the construct and saved ina private final tagging object:

...
        private final Tagging<String> mPosTagging;

        public ChunkerFeatures(List<String> tokens,
                               List<String> tags) {
            super(tokens,tags);
            mPosTagging = mPosTagger.tag(tokens);
        }
....

The call to super(tokens,tags) is required because it's the only constructor for the superclass, ChainCrfFeatures<String>. The superclass saves the tokens and tags objects and makes them available to the subclass through public methods.

The tokens are the actual tokens for this input. The tags, on the other hand, are the full set of tags used by the CRF, in symbol table order.

The constructor then creates the part-of-speech tagging for the tokens using the tagger mPosTagger from its containing class. The computation is put in the constructor so that the results may be reused for the set of nodes and edges for which feature maps are generated.

Edge Features

We'll start with the edge feature extraction method, because it's simpler:

        public Map<String,? extends Number> edgeFeatures(int n, int k) {
            ObjectToDoubleMap<String> feats
                = new ObjectToDoubleMap<String>();
            feats.set("PREV_TAG_" + tag(k),
                      1.0);
            feats.set("PREV_TAG_TOKEN_CAT_"  + tag(k)
                      + "_" + tokenCat(n-1),
                      1.0);
            return feats;
        }

This is the method that's called to generate edge features for token position n in the situation where the previous tag has identifier k. The features are put into an instance of LingPipe's util.ObjectToDoubleMap, which saves a type declaration for the result.

There are two features that we return here. The first is the previous tag, which we provide with suffix PREV_TAG_. The name of the previous tag is returned by tag(k); the method is implemented by the superclass, which has access to the set of stored tags. We set the value of this, and all other features, to 1.0. This is not a requirement of our CRF implementation, though it's common to find systems that support only binary (0 or 1) feature values.

The second feature is a "shape" feature for the token corresponding to the previous tag (which has position n-1, because the current tag has position n). The shape feature is computed with an instance of tokenizer.IndoEuropeanTokenCategorizer. The tokenCat() method that retrieves it is implemented as:

...
        public String tokenCat(int n) {
            return IndoEuropeanTokenCategorizer
                   .CATEGORIZER
                   .categorize(token(n));
        }
...

Normalized Token Features

As is usual with CRFs built for chunking, the node features, based only on the input tokens, are much richer than the edge features. They're based on the same convenience object-to-double mapping:

...
        public Map<String,? extends Number> nodeFeatures(int n) {
            ObjectToDoubleMap<String> feats
                = new ObjectToDoubleMap<String>();
..

The first thing the method does is compute a bunch of values it will reuse, starting with flags indicating whether the current token is the beginning of a sentence or an end of sentence:

...
            boolean bos = n == 0;
            boolean eos = (n + 1) >= numTokens();
...

Next, we compute the token categories, tokens, and part-of-speech tags for the current position, previous position, and next position of the input:

...
            String tokenCat = tokenCat(n);
            String prevTokenCat = bos ? null : tokenCat(n-1);
            String nextTokenCat = eos ? null : tokenCat(n+1);

            String token = normedToken(n);
            String prevToken = bos ? null : normedToken(n-1);
            String nextToken = eos ? null : normedToken(n+1);

            String posTag = mPosTagging.tag(n);
            String prevPosTag = bos ? null : mPosTagging.tag(n-1);
            String nextPosTag = eos ? null : mPosTagging.tag(n+1);
...

The previous and next methods check if we're at the begin or end of the sentence and return null accordingly. The part-of-speech tagging is taken from the saved part-of-speech taggings computed in the constructor.

The token methods provide some light normalization of tokens to compress all numbers to the same kind of value. That method is:

        public String normedToken(int n) {
            return token(n).replaceAll("\\d+","*$0*").replaceAll("\\d","D");
        }

This just takes every sequence of numbers and replaces it with *D...D*. For instance, 12/3/08 is converted to *DD*/*D*/*DD*.

We then set feature values for preceding, current and following tokens. First, a flag indicating whether it's begin or end of speech or an internal node:

...
            if (bos)
                feats.set("BOS",1.0);
            if (eos)
                feats.set("EOS",1.0);
            if (!bos && !eos)
                feats.set("!BOS!EOS",1.0);
...

Next, we include the tokens, token categories, and their parts of speech:

...
            feats.set("TOK_" + token, 1.0);
            if (!bos)
                feats.set("TOK_PREV_" + prevToken,1.0);
            if (!eos)
                feats.set("TOK_NEXT_" + nextToken,1.0);

            feats.set("TOK_CAT_" + tokenCat, 1.0);
            if (!bos)
                feats.set("TOK_CAT_PREV_" + prevTokenCat, 1.0);
            if (!eos)
                feats.set("TOK_CAT_NEXT_" + nextToken, 1.0);

            feats.set("POS_" + posTag,1.0);
            if (!bos)
                feats.set("POS_PREV_" + prevPosTag,1.0);
            if (!eos)
                feats.set("POS_NEXT_" + nextPosTag,1.0);
...

Finally, we add prefix and suffix features, which add features for each suffix and prefix (up to a pre-specified length):

...
            for (String suffix : suffixes(token))
                feats.set("SUFF_" + suffix,1.0);
            if (!bos)
                for (String suffix : suffixes(prevToken))
                    feats.set("SUFF_PREV_" + suffix,1.0);
            if (!eos)
                for (String suffix : suffixes(nextToken))
                    feats.set("SUFF_NEXT_" + suffix,1.0);

            for (String prefix : prefixes(token))
                feats.set("PREF_" + prefix,1.0);
            if (!bos)
                for (String prefix : prefixes(prevToken))
                    feats.set("PREF_PREV_" + prefix,1.0);
            if (!eos)
                for (String prefix : prefixes(nextToken))
                    feats.set("PREF_NEXT_" + prefix,1.0);
...

After this, we just return the feature mapping generated.

The prefix function is simply implemented with a list:

    static int MAX_PREFIX_LENGTH = 4;
    static List<String> prefixes(String s) {
        int numPrefixes = Math.min(MAX_PREFIX_LENGTH,s.length());
        if (numPrefixes == 0)
            return Collections.emptyList();
        if (numPrefixes == 1)
            return Collections.singletonList(s);
        List<String> result = new ArrayList<String>(numPrefixes);
        for (int i = 1; i <= Math.min(MAX_PREFIX_LENGTH,s.length()); ++i)
            result.add(s.substring(0,i));
        return result;
    }

It'd be faster to unfold all this code to avoid the object creation, but profiling doesn't show it to be a particular bottleneck in the overall estimation process (it may be different at runtime, where you might want a faster version for increased throughput). Suffixes are generated similarly.

Custom Serialization

Feature extractors need to be serializable if they are part of a CRF that is going to get serialized. Which means they better be serializable for all practical purposes.

In order to avoid having the part-of-speech tagger serialized along with the feature extractor (which won't work, because the compiled extractor isn't itself serializable [perhaps it should be!]). The serialization code involves LingPipe's util.AbstractExternalizable abstract base class for serialization proxies:

    static class Externalizer extends AbstractExternalizable {
        static final long serialVersionUID = 4321L;
        private final ChunkerFeatureExtractor mExtractor;
        public Externalizer() {
            this(null);
        }
        public Externalizer(ChunkerFeatureExtractor extractor) {
            mExtractor = extractor;
        }
        public Object read(ObjectInput in)
            throws IOException, ClassNotFoundException {
            return new ChunkerFeatureExtractor();
        }
        public void writeExternal(ObjectOutput out)
            throws IOException {
            /* no op */
        }
    }

This is the standard way to serialize an immutable object. Note that nothing is written as part of the serializer (other than the default version UID and fully qualified path name, which are always written out as part of serialization). The read method just creates a new feature extractor from scratch.

With more elaborate feature extractors, more may need to be serialized.

Running CoNLL 2003 English Named Entities

In this section, we consider data drawn from the 2003 shared task of the Conference on Natural Language Learning (CoNLL). The task and data are available from:

Unfortunately, you only get the annotations from that site, and need the big Reuters data collection, sometimes called RCV1. Then you need to run a unix script to combine the RCV1 data with the tag file in order to generate a (semi-)standard CoNLL representation.

The Data Format

Once everything's put together, the data looks like this:

-DOCSTART- -X- O O

EU NNP I-NP I-ORG
rejects VBZ I-VP O
German JJ I-NP I-MISC
call NN I-NP O
to TO I-VP O
boycott VB I-VP O
British JJ I-NP I-MISC
lamb NN I-NP O
. . O O

Peter NNP I-NP I-PER
Blackburn NNP I-NP I-PER

BRUSSELS NNP I-NP I-LOC
1996-08-22 CD I-NP O

The DT I-NP O
European NNP I-NP I-ORG

The data contains four columns, consisting of a token, a part-of-speech tag, a phrase-chunking tag, and an entity tag. Unlike previous CoNLL data sets, there is no begin tag (e.g. B-LOC, only in tags (e.g. I-LOC).

Corpus Implementation

The corpus implementation may be found in src/Conll2003EnglishNeCorpus.java.

It's defined to directly extend corpus and hang onto its data directory.

public class Conll2003EnglishNeCorpus
    extends Corpus<ObjectHandler<Chunking>> {

    private final File mConllDataDir;

    public Conll2003EnglishNeCorpus(File conllMungedDataDir)
        throws IOException {

        mConllDataDir = conllMungedDataDir;
    }

The data is distributed as three files in the munged data directory with the following file names, defined as constants:

    static final String TRAIN_FILE_NAME = "eng.train";
    static final String DEV_FILE_NAME = "eng.testa";
    static final String TEST_FILE_NAME = "eng.testb";

The train and test visitor methods are implemented by calling a helper method with the file name:

    public void visitTrain(ObjectHandler<Chunking> handler)
        throws IOException {

        visit(TRAIN_FILE_NAME,handler);
        visit(DEV_FILE_NAME,handler);
    }

    public void visitTest(ObjectHandler<Chunking> handler)
        throws IOException {

        visit(TEST_FILE_NAME,handler);
    }

The helper method is:

    private void visit(String fileName, 
                       final ObjectHandler<Chunking> handler)
        throws IOException {

        TagChunkCodec codec
            = new IoTagChunkCodec(); 

        ObjectHandler<Tagging<String>> tagHandler
            = TagChunkCodecAdapters
            .chunkingToTagging(codec,handler);

        Parser<ObjectHandler<Tagging<String>>> parser
            = new LineTaggingParser(TOKEN_TAG_LINE_REGEX,
                                    TOKEN_GROUP, TAG_GROUP,
                                    IGNORE_LINE_REGEX,
                                    EOS_REGEX);
        parser.setHandler(tagHandler);
        File file = new File(mConllDataDir,fileName);
        parser.parse(file);
    }

The method first constructs an IO-encoding codec using the nullary constructor of chunk.IoTagChunkCodec. Because no tokenizer factry is specified, this codec may only be used to covnert string taggings to chunkings, not the other way around. But that is all we need to convert a chunk handler to a tag handler. This is done ith the static helper method TagChunkCodecAdapters.chunkingToTagging, which takes a tag handler and returns a chunk handler. Internally, the codec is used t decode the string taggings into chunkings.

The parser itself is defined in terms of three constants that define a regular expression to match a training line with an indication of which matching group is the token and which is the tag. The other two arguments are regular expressions for lines that are to be ignored and lines that constitute end-of-sentence markers. The constants used for the parser are:

    // token posTag chunkTag entityTag
    static final String TOKEN_TAG_LINE_REGEX
        = "(\\S+)\\s\\S+\\s\\S+\\s(O|[B|I]-\\S+)"; 

    static final int TOKEN_GROUP = 1; 
    static final int TAG_GROUP = 2;   

    // lines that start with "-DOCSTART"
    static final String IGNORE_LINE_REGEX
        = "-DOCSTART(.*)";  

    // blank lines end sentences
    static final String EOS_REGEX
        = "\\A\\Z"; 

The regex matches a line with four entries, with the first pair of parentheses picking out group 1 for the token and the second pair of parentheses picking out the tag.

Estimating the Model

The model coefficients are estimated just like the simpler models. The code is in the main method of src/Conll2003EnglishNeEval.java.

The main difference is that we take a bunch of arguments that are converted to training parameters. These are:

public class Conll2003EnglishNeEval {

    public static void main(String[] args) 
        throws IOException, ClassNotFoundException {

        File conllMungedDataDir = new File(args[0]);
        int minFeatureCount = Integer.parseInt(args[1]);
        boolean addIntercept = Boolean.parseBoolean(args[2]);
        boolean cacheFeatures = Boolean.parseBoolean(args[3]);
        double priorVariance = Double.parseDouble(args[4]);
        int priorBlockSize = Integer.parseInt(args[5]);
        double initialLearningRate = Double.parseDouble(args[6]);
        double learningRateDecay = Double.parseDouble(args[7]);
        double minImprovement = Double.parseDouble(args[8]);
        int maxEpochs = Integer.parseInt(args[9]);
...

The rest of the code proceeds exactly as for the toy examples, only now using the CoNLL corpus we just defined, by setting up all the necessary resources and calling the static ChainCrfChunker.estimate() method.

        Conll2003EnglishNeCorpus corpus
            = new Conll2003EnglishNeCorpus(conllMungedDataDir);

        TokenizerFactory tokenizerFactory
            = IndoEuropeanTokenizerFactory.INSTANCE;

        boolean enforceConsistency = true;
        TagChunkCodec tagChunkCodec
            = new BioTagChunkCodec(tokenizerFactory,
                                   enforceConsistency);

        ChainCrfFeatureExtractor<String> featureExtractor
            = new ChunkerFeatureExtractor();

        boolean uninformativeIntercept = addIntercept;
        RegressionPrior prior
            = RegressionPrior.laplace(priorVariance,
                                      uninformativeIntercept);

        AnnealingSchedule annealingSchedule
            = AnnealingSchedule.exponential(initialLearningRate,
                                            learningRateDecay);

        Reporter reporter
            = Reporters.stdOut().setLevel(LogLevel.DEBUG);

        int minEpochs = 1;
        
        ChainCrfChunker crfChunker
            = ChainCrfChunker.estimate(corpus,
                                       tagChunkCodec,
                                       tokenizerFactory,
                                       featureExtractor,
                                       addIntercept,
                                       minFeatureCount,
                                       cacheFeatures,
                                       prior,
                                       priorBlockSize,
                                       annealingSchedule,
                                       minImprovement,
                                       minEpochs,
                                       maxEpochs,
                                       reporter);
...

Evaluation

The rest of the main method performs evaluation:

...
        System.out.println("compiling");
        @SuppressWarnings("unchecked") // required for serialized compile
            ChainCrfChunker compiledCrfChunker
            = (ChainCrfChunker)
            AbstractExternalizable.serializeDeserialize(crfChunker);
        System.out.println("     compiled");

        System.out.println("\nEvaluating");
        ChunkerEvaluator evaluator
            = new ChunkerEvaluator(compiledCrfChunker);

        corpus.visitTest(evaluator);
        System.out.println("\nEvaluation");
        System.out.println(evaluator);
        
    }
    
}

First it compiles the model. Then it creates an evaluator out of the compiled model. Then the corpus is used to walk the evaluator over the test cases. Finally, the default evaluation is just printed out.

Running the Evaluation

There is an ant task from which evaluation may be run. Each of the parameters is defined as variable in the ant build.xml file. Particular parameters may be overridden as usual for ant by defining them on the command line as system properties, as in this invocation:

lingpipe\trunk\demos\tutorial\crf> ant -Dconll2003.ev.initialLearningRate=0.05 conll-ev

Here we've defined the initial learning rate, defined in the build file using parameter conll2003.ev.initialLearningRate to be 0.05. Typically, several values will be considered and the one with the best performance kept. Also, we typically use annealing, but here just use a constant learning rate.

Report on the Parameters

The result of the run are printed because the standard-out reporter was set to the debug level. First, we have a report of the training parameters.

      :00 Training chain CRF chunker
      :00 Converting chunk corpus to tag corpus using codec.
      :00 ChainCrf.estimate Parameters
      :00 featureExtractor=ChunkerFeatureExtractor@43256ea2
      :00 addInterceptFeature=true
      :00 minFeatureCount=2
      :00 cacheFeatureVectors=true
      :00 allowUnseenTransitions=false
      :00 prior=LaplaceRegressionPrior(Variance=10000.0, noninformativeIntercept=true)
      :00 annealingSchedule=Exponential(initialLearningRate=0.05, base=1.0)
      :00 minImprovement=1.0E-4
      :00 minEpochs=1
      :00 maxEpochs=50
      :00 priorBlockSize=100
...

As usual with reporters, the time since creation is listed first, in HH:MM::SS format, that is, down to the second level.

Report on the Corpus

Next, a report of the corpus statistics, including the number of dimensions after pruning (note that we required every feature to appear at least twice).

...
       :00 Computing corpus tokens and features
       :24 Corpus Statistics
       :24 Num Training Instances=18453
       :24 Num Training Tokens=278100
       :24 Num Dimensions After Pruning=108196
       :24 Tags={0=B_ORG, 1=O, 2=B_MISC, 3=B_PER, 4=I_PER, 5=B_LOC, 6=I_ORG, 7=I_MISC, 8=I_LOC}
...

Because chose to cache feature vectors, we also see:

...
      :24 Caching Feature Vectors
...

This drives memory use close to 2GB for training with this feature extractor (with 64-bit Java). This can be reduced to around 350MB at the expense of a large speed hit per epoch, because the features get extracted twice, once for updates, and once for computing the log likelihoods for convergence.

There are 18,453 sentences in the training set, and 278,100 tokens. There are 108,196 dimensions in the coefficient vectors, one for each feature. Finally, the set of 9 tags is listed, which includes types for persons, organizations, locations, and other (miscellaneous) entities.

Epoch-by-Epoch Estimation

Next, we get the epoch-by-epoch report on log likelihood, log prior, their sum, and the maximum of their sum:

...
     1:00 epoch=    0 lr=0.050000000 ll=-22947.2446 lp=-6956506.0166 llp=-6979453.2612 llp*=-6979453.2612
     1:12 epoch=    1 lr=0.050000000 ll=-14985.2173 lp=-6956565.0275 llp=-6971550.2448 llp*=-6971550.2448
     1:24 epoch=    2 lr=0.050000000 ll=-11354.4609 lp=-6956604.4636 llp=-6967958.9246 llp*=-6967958.9246
     1:35 epoch=    3 lr=0.050000000 ll= -9744.2856 lp=-6956633.7058 llp=-6966377.9914 llp*=-6966377.9914
     1:47 epoch=    4 lr=0.050000000 ll= -9404.9320 lp=-6956656.7684 llp=-6966061.7004 llp*=-6966061.7004
     1:58 epoch=    5 lr=0.050000000 ll= -7370.6916 lp=-6956675.6223 llp=-6964046.3139 llp*=-6964046.3139
...
    10:05 epoch=   47 lr=0.050000000 ll= -1231.0988 lp=-6956855.6274 llp=-6958086.7262 llp*=-6958086.7262
    10:17 epoch=   48 lr=0.050000000 ll= -1214.1489 lp=-6956856.8518 llp=-6958071.0007 llp*=-6958071.0007
    10:28 epoch=   49 lr=0.050000000 ll= -1197.9094 lp=-6956858.0347 llp=-6958055.9441 llp*=-6958055.9441
...

We can see from the log likelihood progression (here denoted ll), that we haven't converged on the likelihood front to even two digits of accuracy. The overall objective, the sum of the log likelihood and log prior, here indicated llp, has converged to at least a few digits of accuracy.

Timing Report

After the last epoch, we get statistics on how the time was spent during estimation:

...
    10:28 Feat Extraction Time=:00
    10:28 Forward Backward Time=2:25
    10:28 Update Time=3:40
    10:28 Prior Update Time=1:08
    10:28 Loss Time=2:27
...

The first report is time spent on feature extraction. Here that's no time at all, because we cached the features. If caching is turned off for features, the feature extraction time will dominate with a complex feature extractor like the one considered here for CoNLL.

Next, we see the time spent by the decoder computing the full forward-backward lattice for each training example in each epoch. That's 280K tokens times 50 epochs, or the equivalent of decoding 10M tokens. Or about 70K tokens/second. First-best decoding is considerably faster. But, keep in mind that this is only the decoding time; the feature extraction time is greater. Including feature extraction time, overall throughput is closer to 5K tokens/second for new sentences.

The prior update time is little over a minute. The time is inversely proportional to the prior block size. If an uninformative prior is used to produce a maximum likelihood estimate, this time will be zero.

The last value is the loss computation time. This is essentially doing the same work as forward backward with some summations. It walks over the entire corpus and tags each instance and computes its log likelihood. This requires the full forward-backward algorithm for the normalizing path.

First-Best Results

After compilation, we see the first-best evaluation results, which are the results of only looking at the first-best whole sequence predicted by the trained CRF on the test data. The report starts with the raw contingency table counts:

...
compiling
     compiled

Evaluating

Evaluation
FIRST-BEST EVAL
  Total=6406
  True Positive=4409
  False Negative=994
  False Positive=1003
  True Negative=0
  Positive Reference=5403
  Positive Response=5412
  Negative Reference=1003
  Negative Response=994
...

As usual, true positives are chunks produced by the CRF that match the reference chunking. Specifically, the system we found 4409 corect named entity mentions in the test data.

False negatives are chunks in the reference chunking missed by the CRF; there were 994 chunks in the reference chunking that were missed by the CRF.

False positives are chunks returned by the CRF that are not in the reference chunking; there were 1003 of these. True negatives are not evaluated by chunkers.

We also report the number of positive and negative reference and response items, which are the sums typically displayed at the side of contingency tables.

Next, we see the basic results calculated from these contingency tables, including accuracy (not a very reliable number), precision (TP/[TP+FP]), recall (TP/[TP+FN]), rejection recall and rejection precision (both of which involve true negative counts). And finally, F(1), which is the balanced harmonic mean of precision and recall.

...
  Accuracy=0.6882610053075242
  Recall=0.816028132518971
  Precision=0.8146711012564671
  Rejection Recall=0.0
  Rejection Precision=0.0
  F(1)=0.8153490522422563
...

Next are a number of less common statistics calculated from the contingency table, not all of which make sense without true negative counts. See the chunker evaluation doc for more detail.

...
  Fowlkes-Mallows=5407.49812760023
  Jaccard Coefficient=0.6882610053075242
  Yule's Q=-1.0
  Yule's Y=-1.0
  Reference Likelihood=0.8434280362160474
  Response Likelihood=0.8448329690914768
  Random Accuracy=0.7368506187952697
  Random Accuracy Unbiased=0.736851605713462
  kappa=-0.18464650483043604
  kappa Unbiased=-0.18465094775774404
  kappa No Prevalence=0.3765220106150484
  chi Squared=218.41451486192213
  phi Squared=0.03409530360005029
  Accuracy Deviation=0.005787335774511212
...

After the first-best eval, we have the n-best eval, which evaluates the n-best chunker. The results are presented as the count of test sequences that had a given n-best rank.

...
N-BEST EVAL (rank=count)
0=2732
1=281
2=165
-1=97
3=74
4=47
5=37
7=22
10=21
6=20
9=20
8=15
12=13
11=9
16=8
13=7
19=7
15=6
17=5
18=5
22=5
26=5
27=5
20=4
21=4
29=4
42=4
49=4
51=4
55=4
25=3
30=3
34=3
39=3
14=2
23=2
28=2
31=2
33=2
35=2
38=2
40=2
43=2
45=2
50=2
61=2
32=1
36=1
37=1
41=1
44=1
47=1
48=1
53=1
54=1
56=1
57=1
59=1
62=1
63=1
...

For example, 0=2732 indicates that 2732 of the test sequences had a correct first-best analysis by the chunker; 17=5 indicates that the 17th best result from the CRF chunker was correct for 5 sequences. Finally, -1=97 indicates that for 97 test sequences, the correct result was not produced by the n-best chunker. The default is to only evaluate the 64 best chunkings; this could be set higher at the cost of more time to enumerate the n-best outcomes.

Finally, we have a per-chunk confidence evaluation, reported with the usual information retrieval metrics for precision-recall curves. This is a ranked evaluation where all the mentions returned by the CRF are placed in rank order by score to allow precision, recall and other values to be calculated per returned answer.

...
CONFIDENCE EVALUATION  Area Under PR Curve (interpolated)=0.8618478546835892
  Area Under PR Curve (uninterpolated)=0.8615241699138296
  Area Under ROC Curve (interpolated)=0.9710497025169711
  Area Under ROC Curve (uninterpolated)=0.9710497025169751
  Average Precision=0.8826484588010134
  Maximum F(1) Measure=0.8116515837104072
  BEP (Precision-Recall break even point)=0.8030330062444246
  Reciprocal Rank=1.0
...

The full (interpolated or noninterpolated) precision recall or ROC curves may also be retrieved from the evaluation object, but they are not printed as part of the default toString() output.

Finally, we see overall time reported by Ant.

...
BUILD SUCCESSFUL
Total time: 10 minutes 56 seconds

Overall, the evaluation took about 11 minutes on my quad core Xeon (2.33GHz E5410) workstation running Java 1.7 64 bit (only has server mode) under Windows Vista 64.

References